Mapping Project - My Import Data Process

My Data Import Process

This documents the process I use for importing address and building footprint data. I use a lot of my own software for various steps in this process, but there are other ways to do the same tasks. My software has many of the steps partially automated and is geared towards offline processing of large files. I'm an experienced systems-level software engineer, so why not ? :-) You can read the details of my current system here.

I use a few different processes depending on the data source, but that's mostly based on size. I will emphasize that regardless of size, I find working on small chunks the most sensible way to not screw things up. With too much data, problems get buried. So a big part of my process is chopping big files into smaller ones. I find county sized files a good initial size, but still work on small sections at a time. Plus you get more familiar with an area, which helps when correcting mistakes.

There's a few JOSM plugins I use for validating data during an import, namely TODO and Validation I use other plugins as well like filters, conflation. I don't use any online editors, as I'm often working remotely offline. There is no fast way to import and validate data, so I just try to get in the groove and not rush. A good music collection, nice stereo, and good tea is what works for me.

The biggest problem with imports is conflating with existing data. Especially when it comes from a variety of sources of varying quality. Often conflation requires an exact hit of location, area, and tags. There are multiple tools for conflating data, none are perfect, so I use several in series. Remember, we're going for accuracy, not speed. For conflating addresses and building footprints, I start with my own software, because it's oriented to bulk processing of large files and fixes many common problems. Roads are harder to conflate. I usually have to conflate the same data multiple times, each pass correcting tags to make automatic processing work better, ie... street names used in address nodes need to match the road name in spelling. Even after uploading, there is still cleanup to be done.

For a small and simple import (addresses and buildings and adding metadata) I find the TODO plugin works pretty well. First I'll do a search the existing data to see if any of the data I want to import exists. Then using the boundary of that area I'll select the data I want to import from the other dataset and cut and paste it into the working data layer. Validation will usually find a buncch of things to fix, those go into the TODO window so I don't have to keep searching for them. Then I just work through the list. It's not uncommon to have to fix a few hundred entries. I find it I get the proper combination of mouse clicks and keyboard commands, I can crank through the list pretty quickly. JOSM is very good about remembering your previous entry in most dialogs, so often all you are doing is confirming OK. I often find a good flow, and some things can be fixed more globally, reducing the list size.

After it's pasted into the working layer, you can then do further cleanup, like correcting old typos (TIGER data is pretty poor). Cleanup is catching anything that slipped through the prior conversion and validation processes. This is also the time to fix issues in existing metadata, some of which are many years old. The JOSM validator will flag duplicate address nodes and duplicate buildings reasonably well, but then you can wind up with thousands of duplicates to correct manually. I eventually wrote my own conflation software to limit what I had to clean up in JOSM. I'll note than often a manual cleanup is still necessary. If you need to change a few tags, that shouldn't be automated.

For example, I recently was changing old TIGER data to use ref:* instead of name for county and US Forest Service roads. Many have a bogus name_1 tag that basically got ignored. To change name to the proper variation of ref, then change to name, and also correcting any typos and normalization issue can only be a manual operation. The advantage is that now roads with a name get used, so appears properly. And the ref tag can get used later for navigation. While it is possible to do that conversion in a semi-automated fashion, data quality made the results unreliable.

Currently I use the HOT Tasking Manger to manage any import. I setup a county sized task, and TM then breaks it into multiple smaller sections. I can then track my progress. Otherwise it gets confusing over time. I use the JOSM TODO plugin to manage the import in that section. You can also use smaller boundaries like cities or fire districts to reduce the size of the data to something comfortable.

In JOSM you can always search for modified to see the data you are working with before you upload it. I often search for modified, and run the JOSM validator on that data to fix the things it finds before uploading. For buildings, I use the SelectBuildingPlugin script as well. For addresses, I search for addr:street and addr:housenumber to get all the existing ones validated with anything I've added. The JOSM validator is pretty good about finding any duplicates. If I find a duplicate, often due to a spelling difference that slipped through conflation, then I merge the them together after correcting any spelling mistakes.

I find I can also get a good feel for the data using Layers in JOSM. You can enable/disable entire layers via a toggle. The current layer is displayed in color and can be manipulated. the data for the other layers will be grayed out. So by toggling layers, you can get a rough approximation of the new data. There's also several other painting styles in JOSM that display data differently and are useful for different datasets.

Once I've gone through the data and are satisfied with it's quality, I upload it to OpenStreetMap and then I'm prepared to fix anything I missed. I think this has been a past problem where people or organizations upload a large amount of data of borderline quality, and then drop off the face of the earth. Then somebody else (like you and me) has to fix it later. Avoid this problem by following up after a successful import. I often tag my new data with a fixme to not upload it while still working with it. So one of my ways of catching my own mistakes is to search for that tag. I remove it from anything I conciously validate. That usually catches anything that got uploaded that I wasn't planning on. I also remove that tag, and a few others like addr:county or addr:full, and anything with a fixme. The online validation tools have a lag time since they only update their data every few hours or days. My own software does similar validation checks, it just works offline.

The rough list of steps I follow to import addresses or buildings is this. It's not fast, but efficient and focuses on data quality over speed.

Extract subsets from large files to a good size to work with. That size is dependent on your hardware. Too big and everything is too slow, so the size is obvious. I find state county sized chunks a good size to work with.
Sanity check the extracts. For extracts from non OSM sources, double check the conversion. Especially for non OSM deprived sources, the quality varies... and are often easier to fix with global search and replace than later.
Load the extracted files into postgres. My conflation and validation software requires this, it may be optional for you.
Run my conflation and validation software. This is often done several times as when I check the results, it often has flagged obscure problems in the data files, usually a spelling difference. I then fix the source files, and start again. This step I usually do multiple times since fixing existing data changes the conflation. Often this is fixing existing data already in OSM.
Then if I'm online, I go to my instance of the HOT Tasking Manager, and select a tile of the area I'm working with. Then I click on download GPX, and editor. The editor for me is JOSM, since I work offline frequently. If offline, I just chop bigger files into smaller ones and work on those. JOSM doesn't support offline background imagery, but there are workarounds for that too. Usually if I know I;m going to be offline for awhile, I prepare some data files while still able to.
If using the TM, I load the GPX boundary of the tile into JOSM. TM has sent the OSM data to JOSM so you can edit that section. If offline, do just make yuour own smaller files with osmconvert or ogr2ogr and load those into JOSM. There are plenty of boundaries in OSK you can use too. I'll select one, copy it to a new layer, then save it as a poly file for osmconvert or osmosis. Newer versions of JOSM also have a selection option to use a polygon for selecting the data inside it.
I start by searching for the objects I'm adding, ie... addresses or building footprints to see what the existing data looks like, if any. Often there are none, but it gives me a good idea of existing data which helps when validating the conflation.
Then I change layers to the data I'm importing. It's crude, but you can select just the objects you want to import within the tile boundary. Any duplication will be caught later. In a crowded area, one trick I use is that when I drag the selection box over the boundary, it changes color.
Check the selected data for bogus fields. While potentially useful, tags like addr:county or addr:state are extraneous and just add data bloat. It's easy to derive that information using boundaries. Sometimes I leave addr:full in the source file just to double check the results, but remove it before uploading.
Then I cut the data, and paste it into the OSM data layer. Searching for modified at any time will select what was just added. For buildings, I then run the ../SelectDuplicateBuilding.js script. If it finds anything, I then add that to the TODO window and correct everything.
Then search for modified objects, and click on validate in the JOSM validator window. Chances are you'll find existing errors and warning, fix everything you can. There's often minor things like orphaned nodes, spaces in the wrong place, etc... For addresses, you'll get a duplicate node warning. My software is pretty good at finding those, but without that software, JOSM may find a large number of duplicates which can be overwhelming.
In the validation window, hitting Enter on each line item to be fixed selects the conflict. For address nodes, I then merge them together. Often the problem here is a spelling difference between the two sets of data, so it fails to conflate. Sometimes you'll find a large amount of similar problems so often searching for the incorrect tag and deleting or editing it fixes the problem.
If you are paranoid, repeat the validation steps until no issues are found. I'm usually a touch paranoid of unintentional mistakes, so usually search for modified, and the use upload selection. Upload Data works fine too, especially if you've been deleting a bunch of objects.
Once uploaded, delete the OSM data and GPX layers from JOSM, and go back to the Tasking Manager and click on mark as completely mapped, pick another tile, and do this process again. With good upfront conflation and validation, I find I can usually process an entire tile in a few minutes, and an entire county in a few hours.
At that point I wait a little while, take a break, make some tea, work on something else, whatever, and wait for the changes to propagate through the various upstream sources.
Then I use a few online validation websites to look for issues with the imported data. I've had issues with all of them, which is why I now use my own software to reduce what those sites find. The online sites still have valuable feedback, and other people will be using them to validate your import. I'll also usually download the entire area I'm working on using Overpass, and then look for any possible mistakes that slipped through the process. You'd be amazed what you can find just using grep.

Top of project , Top of source data