My Data Import Process

This documents the process I use for importing address and building footprint data. I use a lot of my own software for various steps in this process, but there are other ways to do the same tasks. My software has many of the steps partially automated and is geared towards offline processing of large files. I'm an experienced systems-level software engineer, so why not ? :-) You can read the details of my current system here.

I use a few different processes depending on the data source, but that's mostly based on size. I will emphasize that regardless of size, I find working on small chunks the most sensible way to not screw things up. With too much data, problems get buried. So a big part of my process is chopping big files into smaller ones. I find county sized files a good initial size, but still work on small sections at a time. Plus you get more familiar with an area, which helps when correcting mistakes.

There's a few JOSM plugins I use for validating data during an import, namely TODO and Validation I use other plugins as well like filters, conflation. I don't use any online editors, as I'm often working remotely offline. There is no fast way to import and validate data, so I just try to get in the groove and not rush. A good music collection, nice stereo, and good tea is what works for me.

The biggest problem with imports is conflating with existing data. Especially when it comes from a variety of sources of varying quality. Often conflation requires an exact hit of location, area, and tags. There are multiple tools for conflating data, none are perfect, so I use several in series. Remember, we're going for accuracy, not speed. For conflating addresses and building footprints, I start with my own software, because it's oriented to bulk processing of large files and fixes many common problems. Roads are harder to conflate. I usually have to conflate the same data multiple times, each pass correcting tags to make automatic processing work better, ie... street names used in address nodes need to match the road name in spelling. Even after uploading, there is still cleanup to be done.

For a small and simple import (addresses and buildings and adding metadata) I find the TODO plugin works pretty well. First I'll do a search the existing data to see if any of the data I want to import exists. Then using the boundary of that area I'll select the data I want to import from the other dataset and cut and paste it into the working data layer. Validation will usually find a buncch of things to fix, those go into the TODO window so I don't have to keep searching for them. Then I just work through the list. It's not uncommon to have to fix a few hundred entries. I find it I get the proper combination of mouse clicks and keyboard commands, I can crank through the list pretty quickly. JOSM is very good about remembering your previous entry in most dialogs, so often all you are doing is confirming OK. I often find a good flow, and some things can be fixed more globally, reducing the list size.

After it's pasted into the working layer, you can then do further cleanup, like correcting old typos (TIGER data is pretty poor). Cleanup is catching anything that slipped through the prior conversion and validation processes. This is also the time to fix issues in existing metadata, some of which are many years old. The JOSM validator will flag duplicate address nodes and duplicate buildings reasonably well, but then you can wind up with thousands of duplicates to correct manually. I eventually wrote my own conflation software to limit what I had to clean up in JOSM. I'll note than often a manual cleanup is still necessary. If you need to change a few tags, that shouldn't be automated.

For example, I recently was changing old TIGER data to use ref:* instead of name for county and US Forest Service roads. Many have a bogus name_1 tag that basically got ignored. To change name to the proper variation of ref, then change to name, and also correcting any typos and normalization issue can only be a manual operation. The advantage is that now roads with a name get used, so appears properly. And the ref tag can get used later for navigation. While it is possible to do that conversion in a semi-automated fashion, data quality made the results unreliable.

Currently I use the HOT Tasking Manger to manage any import. I setup a county sized task, and TM then breaks it into multiple smaller sections. I can then track my progress. Otherwise it gets confusing over time. I use the JOSM TODO plugin to manage the import in that section. You can also use smaller boundaries like cities or fire districts to reduce the size of the data to something comfortable.

In JOSM you can always search for modified to see the data you are working with before you upload it. I often search for modified, and run the JOSM validator on that data to fix the things it finds before uploading. For buildings, I use the SelectBuildingPlugin script as well. For addresses, I search for addr:street and addr:housenumber to get all the existing ones validated with anything I've added. The JOSM validator is pretty good about finding any duplicates. If I find a duplicate, often due to a spelling difference that slipped through conflation, then I merge the them together after correcting any spelling mistakes.

I find I can also get a good feel for the data using Layers in JOSM. You can enable/disable entire layers via a toggle. The current layer is displayed in color and can be manipulated. the data for the other layers will be grayed out. So by toggling layers, you can get a rough approximation of the new data. There's also several other painting styles in JOSM that display data differently and are useful for different datasets.

Once I've gone through the data and are satisfied with it's quality, I upload it to OpenStreetMap and then I'm prepared to fix anything I missed. I think this has been a past problem where people or organizations upload a large amount of data of borderline quality, and then drop off the face of the earth. Then somebody else (like you and me) has to fix it later. Avoid this problem by following up after a successful import. I often tag my new data with a fixme to not upload it while still working with it. So one of my ways of catching my own mistakes is to search for that tag. I remove it from anything I conciously validate. That usually catches anything that got uploaded that I wasn't planning on. I also remove that tag, and a few others like addr:county or addr:full, and anything with a fixme. The online validation tools have a lag time since they only update their data every few hours or days. My own software does similar validation checks, it just works offline.

The rough list of steps I follow to import addresses or buildings is this. It's not fast, but efficient and focuses on data quality over speed.

Top of project, Top of source data

Copyright © 2019,2020 Seneca Software & Solar, Inc