Mapping Project - Validating Data

Validating Data

One of the downsides of cloud sourced data, plus the quality of public domain data from organizations like the Census, is having to get used to imperfect data. When doing an importing or editing session, it's important properly validate the data before it gets imported into OpenStreetMap. While there usually is a post import cleanup on things that got missed, the trick is to limit that as much as possible. Many rural areas have a limited budget for GIS, so often it's data is old and out of date. Also different GIS staff over the years have different styles, leading to inconsistency in naming conventions.

I've been focused lately on making navigation work better (at least here in rural Colorado) so OpenStreetMap can be used for emergency response in my rural areas. We have many old roads left from the mining era, and many still have people living in remote cabins on those roads (like me). Most of our roads are also barely maintained forest service roads, as much of the land here is public and remote.

Google doesn't know everything... Seriously, in my fire district we've been to several structure fires in just the last year whose roads didn't even exist in Google, but did exist in OpenStreetMap because it had been added. Even worse, my county renamed many of the roads 20 years ago to reduce duplication, and Google hasn't updated most of those to the new name. Even more fun is here it's legal to name your private access road, but they aren't considered real roads in the governmental sense, so don't exist in GIS.

The primary way to make navigation work is simple, address data with house numbers and a street address, and optionally a unit number. The spelling of the street must match what OpenStreetMap already has for the name. Also a highway name may be stored under a variety of tags, name, ref, ref:usfs, ref:blm etc... This is where the fun starts. Some of my software does fuzzy pattern matching on the names and refs, but it works better if everything is roughly consistent. When I find problems, it's better to fix it in the upstream data, than have a local workaround.

While there are some great online tools like OSMInspector, they only work after you've uploaded everything to OpenStreetMap. Fixing mistakes at that level can be tedious and time consuming. I still use these tools, but have managed to reduce problems to a small handful by validating carefully before importing. The problem with validation is you need other data sources to check, and often they disagree on spelling, so the final result needs to be the consensus amongst the datasets. Typical problems are an official road name may be North Foobar Street, but in OSM (via Tiger 2008) the name may only be Foobar Street. And the County may call it "South Foobar Drive". Course Tiger may be wrong as it often is, so you're stuck with a "What the $%#^@" moment.

My Datasets

To help with this, I've collected multiple datasets for Colorado, all public domain, and use them to make the best judgment call on the name. My own software makes a decision tree from all the data sources, and uses whatever the consensus is. Any mismatches are flagged in the output file to be evaluated manually. I also validate by driving there and seeing what the street sign says. My current datasets, all public domain, are:

Bing Footprints: Microsoft donated Bing footprints scanned from sat imagery by AI. These are useful for counties that don't release building footprints. The quality varies though, see my rants on other pages on this site.
USGS: All the data from the current 7.5 quad Topographic maps. Very useful for road names.
USDA: Road data on every jeep road, hiking trail, etc... on public land.
BLM: Road data for the desert...
Tiger: Yeah, I know, Tiger data sucks bad, but it is a resource and also supports reverse geocoding. Course reverse geocoding is wildly inaccurate much of the time.
Colorado State: Colorado has released aggregate data from some counties of both address and parcel data. These have been used for other imports. and are usually the most accurate source, although not always... They also released road data for the entire state.
Colorado Counties: Many counties also release their address, parcel, road, and building footprints.
OpenStreetMap (of course): Often old bad data in OSM had been corrected, roads added that weren't in Tiger in 2008, etc... so is considered the primary reference data set. Other sources are used to validate the road name and spelling.

My Validation Setup

The current system I'm using is primarily home-grown, and works well for my process. All of the software is in python3 and SQL, and often resorts to ugly brute force algorithms (slow...) rather than any attempt at performance. However the code is heavily modular and flexible, so I can tweak things when needed for weird data issues. I use a PostgreSQL database with Postgis and Hstore extensions. Being able to do work on large amounts of data is necessary. On the client side I use JOSM and Qgis for editing, and ogr2ogr for data manipulation.

I run two duplicate systems, with a few websites like TM3 only on the server side. Data validation requires multiple datasets, all huge. My current setup consumes several terabytes of storage. The server runs the more stable version of my software and is hosted on current CentOS, and my laptop is running current Ubuntu, and is often under heavy development.

I use ogr2ogr heavily to initially filter, convert, and import the data into PostgreSQL. At one time I converted everything to OSM format so I could use tools like osmium and osmconvert, but now work at a lower level which gives me access to the raw data. Working with the raw data is the best way to have the most control over the process.

To validate across multiple databases while doable in PostgreSQL, would add more complexity than needed right now. I use Python to manage querying the multiple databases, and processing the information. Rather than write a nice GUI, my software just generates an OSM formatted file. This file is pre-processed to avoid duplicates, and when conflicts arise in names, all are added as extra debugging tags so they can be found in JOSM. In JOSM I load all the relevant datasets, and by clicking through layers can make a final determination on the name.

Normalizing Highway Names

Normalizing all the various ways roads are named is key to a validation process that doesn't bury you in false negatives, which happens frequently. I follow the standard OSM naming convention, with some additions based on discussions on the OSM Tagging mailing list. The first recommendation is to not use abbreviations, so they all need to get expanded. For example St should be expanded to Street, etc... Note that these abbreviations are at the beginning, the end, or the middle of the string used for the highway name. At one time I cleaned this up later in the process, but it's time consuming and tedious. Now I fix the names in the database as best as possible, instead of much ugly hacking in the parsing code.

While there are accepted practices on naming conventions, here's my take based on a long thread on the OSM Tagging list.

FS Rd to FS
US Forest Service to FS
US Forest Service Road to FS
Usfr to FS
County Road to CR
C R to CR
Hwy to US or CO

As many roads have multiple names, here's what I do. Reference names are things like CR 124, while a name is more likely to be something like Foobar Road. Around here roads also have a third designation, namely the Forest Service reference. References are tagged with ref or ref:usfs, etc... So for proper tagging, a road may have several values. Whatever is in the address data for the parcel or building, needs to match any one of the names, not all of them.

In addition to my own software, I also run a full setup of both Nomination and Tiger 2019. These are both used for reverse geocoding, a current work in progress. Currently I find reverse geocoding not very accurate until all the data uses a consistent naming convention. At this time though, decoded addresses are not accurate enough to be useful, so not worth uploading (yet). I'm working on improving that as a side project.

I often still find myself using a text editor or sed to make changes in data files before importing them into PostgreSQL. Global search and replace can often cause more trouble than it's worth unless you are really careful about the regular expression you use to limit side effects. You can also use grep to check your change to see if it was correct. In the database I don't care what the field names are, only which ones to use. Each dataset uses a different schema of course, but all we really need is the house number, the street name, optional unit, and the location.

The Tiger import in 2008 was important to give OpenStreetMap a head start on roads in the US, but the quality varies heavily. It also causes problems when trying to validate names. While much of my software does parse through the variety of issues reasonably well, to limit code duplication and with a minor nod to performance, lately I just fix it in the database. PostgreSQL is very efficient at large scale changes. Any global search and replace needs to be heavily validated by other means later, this is to just reduce the time it takes to do that. I spend a lot of time doing queries and looking at the raw data, so after a while I have a good feel for safe changes, and how to validate them. For example, this would fix an abbreviation used at the end of a line. Edges is the table in the Tiger data that contains the name and address ranges.

      UPDATE edges SET fullname = REGEXP_REPLACE(fullname,' Rd$',' Road');

I always make a backup of the table so it's easy to find and revert changes. This way it's a slow, one time change instead of every time you want to process data. I often see issues when processing data, but then fix it in the data, and rerun the validation. The other advantage of fixing the data is others can then take advantage of that too without going through the painful prothecess of fixing it.

When there is a conflict, which name is correct ? Tough question, as there has been bad data (Tiger) in OpenStreetMap for over a decade. Still, rather than have future mappers curse our name, we try to do better. If the OSM name isn't derived from Tiger. it's usually the correct one as somebody added it. Usually this is added by a local, so the best source. Most of the old Tiger data still has the Tiger tags left from the import, so easy to tell the difference. Course this often has spelling and name normalization issues as well.

After assuming the OSM data is likely close to the correct one, if I have it, I check parcel data. This is often the next most accurate source, and useful if trying to find the address of a building. County address data I check next, which is often reasonably good, but usually out of date. It should match the parcel data though, if you have any. After that I trust the USGS topographical map data since we trust the paper maps already. That data often has names not on any other map, so I add them as alternate name so it matches what's on other maps. In addition to the topo map data, I also have the USDA forest service road data, which may also have a new name for a road. Around here in rural Colorado which is mostly national forest, most roads have 2-3 references and a name.

Reverse Geocoding

And finally there's reverse geocoding, trying to guess the address based on GPS coordinates. This is useful if you have building footprints but no address or parcel data. I haven't yet found a geocoding solution I think is sufficiently accurate, so only use it during development. I think this can be improved, but it's still a work in progress. OpenStreetMap has the Nomination geocoding server, Google of course has one, so does Bing, Census, ArcGIS, etc... and none of them agree on the house number. It requires good and complete data to work well.

My Software

While there are good online tools for validating OSM data, there's several problems. The big one is you have to be online... and you have to have already uploaded your changes, so now they're public in all it's potentially screwed up glory. Then you get to fix your issues with the community watching. As a software engineer with over 40 years experience, I wrote my own pre-validation tools which are focused on working offline. This is critical, cause when I'm in the field, I have no internet connection for prolonged periods.

I'm not a user interface designer, I'm a systems engineer, so my validation software is command line only, Maybe when it's less of a prototype I'll turn it into a JOSM plugin. What I do is generate an OSM file for JOSM that has had extra tags added during the processing process. Then I use JOSM as my UI. The advantage of this is I get to use JOSM's advanced editing and additional validation features. The biggest problem with bulk imports is without sufficient review by human beings, many subtle errors slide through and get uploaded and never corrected. (think Tiger)

To make navigation work, metadata needs to be consistent. Much of what my software does is compare data between multiple sources and do basic checking for duplicate buildings and addresses and what's missing. More than just the location, the street name must be an exact match between the address metadata and the highway metadata. For navigation to fully work, the highway data in OSM must be relatively complete. No broken highway networks with broken spots (very common). That's for a whole different discussion.

Top of project , Top of source data