Importing Data

The Project

My fire department wanted to put tablets in our fire trucks to help with emergency responses in our large rural fire district. Our county is poorly mapped, being rural with many old roads left over from the mining area. We were currently using a large hard-copy Map-Book, but it was huge, and hard to find loctions.

The existing Open Street Map (OSM) data for my rural county was not very complete, but Google Maps wasn't any better. Actually Google Maps turned out to be based on an older dataset from before my county renamed a bunch of roads to try to reduce duplication and confusion. Also many roads here have both a county name and a USFS designation, both are important to us.

Getting Started

Importing a lot of data into OSM can be problematic, and cause problems for other mappers. A good place to start is read the OSM Guidelines for bulk imports. I think small mistakes are a bit unavoidable, what's important is for the person importing the data to correct mistakes quickly. I've found I avoid major mistakes by working in small sections. There's a threshold of data size that hides potential mistakes. For issues with tagging caused by data conversion, I'll sometimes upload a few small changes and then wait to see if anyone flags a mistake.

I started by downloading OpenStreetMap data from GeoFabric and producing simple KML maps with existing data. The problem with that approach is there wasn't much existing data, and what was there was reasonably inaccurate. To top this off, my fire department grew by a large amount, mostly containing a few uninhabited cabins, mines, and 4x4 roads, very little of which had been mapped.

I then started digging around for other data sources to complete the data needed. Conveniently. these days any organizations and government agencies put their data online with a good open license. Initially I grabbed road center-lines and trail data from my county GIS department. I also got updated road and trail data from the USGS, as much of our fire district is national forest. Later on I added building footprints and addresses to see if I could get offline routing working.

Garbage In, Garbage Out

At some point around 2008, TIGER census road data was imported into OSM to kick start map data. Unfortunately there seemed to be a few aborted attempts at this, as there was a lot of duplication. Many roads had 2-4 copies with only GPS coordinates, only one had any metadata tags. This was not obvious at first till I noticed this when producing KML files from a postgresql database. So to import data for my county I first had to cleanup the previous mess.

Most of the problems could be found by a mix of several things. Good use of JOSM search strings, JOSM validation, and primarily a pair of human eyeballs. Importing data accurately is not a fast process. For my small rural county, much data was missing, which can simplified importing. But it was still very time consuming to validate the imported data. I found it best to work on small chunks, as my eyeballs were the primary detector of problems.

Another problem was much of the OSM data had good metadata (where it existed), but poor road and trail alignment. The new data had good alignment but was often lacking the kind of metadata we wanted. Often OSM had a partial road or trail segment, and the County supplied data didn't.

Converting Data

All of the data I collected was usually in shapefile format. I tried using Qgis or JOSM to do some global search and replace, but often the huge size of the data files made this tedious. After digging around for a good shapefile to OSM conversion program, I decided it was time to write my own in python. While I'm not a fan of reinventing the world, I wanted something more configurable than what I could find, and conveniently, I'm a long-time software engineer.

I cover many of the ways one can convert and edit OSM data at this SOTM Workshop I did in 2017. Converting the data only gets the data in the right format for the next part of the process. Since I produce maps used for emergency response, the metadata is very important to us. Converting metadata between two unrelated formats often leaves the details of the conversion up to the person doing the conversion. While OSM has wonderfully deep and detailed metadata support, if can often be difficult to know what the right tag to use in OSM is. My experience has shown me that as long as tags are applied consistently, it's easy to correct them in bulk later.

Later problems with importing addresses and building footprints arose that required different techniques. For some reason, some of the data file had all the text in all CAPITAL LETTERS, and often used abbreviations like RD or LN instead of the full spelling. OSM prefers text be capitalized correctly, and that road type abbreviations get expanded. I wrote another python script that correctly all the capitalization issues, and also expanded all the road types before importing it. Once I got this imported, offline routing of roads stared to work!

Conflating Data

I used several programs for conflating data, but even after cleaning up the existing data for my county, there were enough problems to make fully automating the process impossible. A typical problem was buildings. Quite a bit of OSM data is traced from satellite imagery, and the satellite imagery isn't always perfectly aligned, or more commonly, fuzzy. Often a building is a simple rectangle, when it's actually more complex than that. To really avoid import errors, often different techniques are used on the same data.

JOSM's Validation window can often spot problems with buildings, it flags them as a building inside another building. If you also have both files loaded as layers in JOSM, all the data is displayed, but the data from the other files is grayed out and can't be selected. If you have the current OSM data layer activated, the other file's building footprints are visible. This is a quick way of finding areas where there will be little to no conflicts with existing data. JOSM's conflation plugin works resonably well too, and the duplications it fails to find often get caught by JOSM validation.

Roads have other problems. Some were tracked with a high-quality GPS, but sometimes they loose reception, so there are dropped out portions. For some reason in my county many intersections didn't physically connect, which messed up road routing. The TIGER import is often way off currently available satellite imagery as well, and sometimes has non-existant roads. The trick with using tw layers in JOSM also works for roads, which makes it easuer to find roads in the reference source that aren't in OSM yet.

When conflating, I usually focus on importing the easy stuff first, basically data in the reference file that doesn't exist in OSM. This is also the first pass through tags from the data conversion process. Often new data is in the some area, so the tagging is similar. Here some data was easy, all of our roads but a tiny few are compacted dirt.

In addition to the JOSM conflation plugin, which is ok for finding new data, I also use Hootenanny. Hootenanny has both a command line interface and a web based one. I primarily use it's ability to diff OSM files. Osmconvert can also diff OSM files, but only after they've been converted to .o5m format. Often different conflation programs wor better or worse on different data files, so sometimes trying more than one can help.

I've also tried RoadMatcher, but it can't export in OSM format. It doesn't loose the tags, but they can't be used in ay OSM applications without heavy conversion. One of thse days I'll write a conversion script for this format. RoadMatcher only works on roads, but works reasonably well.

Conflation software is not a silver bullet. It can both help and hinder your import project. Often conflation or validation will find issues with things that are totally correct. Identifying these cases gets easier as you gain experience with your data sets. Much of the time for me, conflation has turned into a time consuming manual process if I want to get it right.

Finding Errors

It's always better to find your own errors, preferably before uploading the changes to OSM. One way to do this is to use the JSOM Validator. This checks everything you have selected and flags both errors (which should be fixed), and warnings. Often this will find pre-existing errors on the OSM data, I try to fix as many of those as possible before uploading. Often the validation process can find duplication issues in your import.

The other good way to find errors is to use the JOSM Inspector. This only works after you've uploaded data. It catches many issues JOSM validation doesn't, like islands of roads not connected to the network.

The final way I catch errors is when I convert to KML format, my maps color code many things, like roads based on theirs classification. But that's not practical for most people.

Supporting Routing

For us, the holy grail of this project would be to have complete road and tail directions with audio output to any location in our district, no matter how obscure. And it all had to work offline. Once I had all the addresses imported, and all the roads added with the correct names, routing was starting to work. It turned out sometimes the name of the road the house was on didn't match the street in the address data, so first I fixed all those by either comparing with other data sources, or by driving there and seeing what the street sign said.

Often I'd stumble across problems while working on something else, and just fix them quickly. This was often road intersections that didn't actually connect, so routing ignored them. Sometimes entire sections of roads weren't connected into the road network. I found OSM Inspector to be very useful in finding those. The other problem was roads were often broken into many segments, often at intersections. So when searching, I'd get a page of the same road names, each with a handful of house addresses. Making sure the roads were contiguous and connected fixed most of the problems. There's still instances of these two problems, fixing them all for thousands of roads is very time consuming. Volunteers to help me finish all of these are welcome!

Meta Data Tagging

For those of us that produce maps that get used in the field, the meta data, ie.. OSM tagging, is very important. If all you have is a line on a map, you can't tell if it's a road. a trail, or a stream. Often these lines are traced from satellite imagery, where you can only guess. Usually in that case there is no metadata beyond track.

Since I primarily map rural areas, how to ta a road is a bit confusing, even if you know about the road. Based on the various warnings and errors, here's the rough policy I use. A line with only one house on it is a driveway. This is usually very obvious on satellite imagery. The base tag is highway, and it's not a public road, so I make this highway=service. If it has more than one house on it, it's not a driveway. Sometimes these have a name, in that case I use highway=residential. If it doesn't have a name, it's highway=unclassified. Around here we have many old 4wd only roads that go to old mines. Since nobody lives there, I use highway=service. Other roads that go nowhere in particular, around here they go to hill tops, camp sites, etc... those are obviously highway=track. Biking or hiking trails are easy highway=path or highway=footway. Whatever you use, be consistent. For driveways, there is an additional tag service=driveway, and of course driveways are access=private.

Now that we've tagged the highway type, there is other data worth adding. The important one is driveability. This is a hard quality to determine, since one persons idea of a really bad road, is what my drivweway looks like... If importing from USDA data source, there is usually a 4wd field, so I just use that. Otherwise I use the clearance. A really good driver can take most any vehicle anywhere with enough time. Fire fighters are usually in a hurry though. So if it's passable by your average car at a reasonable speed (2nd or 3rd gear), it's not 4wd only.

The other important metadata is smoothness and surface. Smoothness is related to whether it's 4wd only or not. Most any road bad enough to be classified as 4wd only is smoothness=bad. If it's ATV/UTV or high clearance vehicle only, I use smoothness=very_bad There are categories beyond that, but at what point does a highway=track become a highway=footway ? Any maintained compacted dirt road is smootheness=good. Less important is the surface type. In my county, only a handful of the roads are paved, everything else is dirt or grass. There is a wide variety of surface types, so you can be very specific. USDA data has a surface field, so I just use that when importing data.

Data I've Been Adding

Since I've been adding a lot of new data to OSM, some people ask where I've been focusing my energy. Primarily I focus on Gilpin County and western Boulder County, as this is our fire response area. I've been doing a lot of field data collection as well, primarily locations specific to emergency response, like accepted helicopter landing zones and emergency water sources.

I have other random contributions based on my travels around the US, plus in Cuba, Thailand, Nepal, and Bhutan. That data was all field collected, so importing wasn't a problem.

Top of project, Top of source data

Copyright © 2019 Seneca Software & Solar, Inc