top | item 7732513

Ask HN: what features do you want in a geocoder?

50 points| freyfogle | 12 years ago | reply

Hi,

we're building a public facing geocoding service (forward and reverse) on top of our own technology, OpenStreetMap, and various others open geo services. What features would make such a service compelling for developers? What is your wishlist?

Thanks for taking the time to answer.

66 comments

order
[+] michaelt|12 years ago|reply
Depends on what your target customers are doing!

For a delivery company, inaccurate results can send drivers to the wrong place, so it's important to get the best accuracy available.

You want it to work way out in the country, where buildings are few and far between; and with named buildings as well as numbered ones. So a search like [1] should come back accurate to less than 100m.

Fuzzy/imprecise matching should be used with care. If there's a search for Manor Close in London, it should ask which of the four Manor Closes you mean [2]. If the only part of the address that matches is London, that's not enough information to send a delivery driver - the address should be rejected.

If there are parts of the address you can't match that's sometimes a problem - you don't want to map 1 Hopton Parade, Streatham High Road to 1 Streatham High Road. But you do want to map Some Company Ltd, 1 Streatham High Road to the latter.

On the other hand, if your target users are dating websites wanting to show rough distances between members, just matching city might be plenty accurate enough; property search engines like Zoopla will show any shitty approximation on their maps if they don't recognise a street or postcode.

[1] https://maps.google.co.uk/maps?q=Paradise+Wildlife+Park,+Whi... [2] https://maps.google.co.uk/maps?q=from:E17+5RT+to:NW7+3NG+to:...

[+] spacemanmatt|12 years ago|reply
A downloadable bulk geocoding service. Some address databases are not licensed for exposure to 3rd parties, but geocoding is very interesting.

If you added CASS or address validation (deliverability) service it would increase the value even more.

Having recently installed PostGIS and imported loads of TIGER data, it would be useful if you provided some discussion about data sets backing your geocoder, especially if you do better than just TIGER.

I am only interested in U.S./Canada addresses (thus the mentions of CASS/verification) but I understand OSM is a global project.

[+] freyfogle|12 years ago|reply
thanks for the feedback. If I may ask, there are several people who supply exactly what you're asking for. Why are you not using them?

To be open with you, the North American market is well served and highly competitive, many other parts of the world are the opposite. Thus the US/CA probably won't be our area of focus. That being said, very keen to learn what isn't working for you in current solutions.

[+] crucialfelix|12 years ago|reply
Ability to report inaccurate addresses without just telling the customer to go to OpenStreetMap and edit it.

We have a lot of problems with Google Maps not knowing about odd addresses like 235B Whatever St. or 113-76 XYZ street. This is perhaps because its a new construction and the address isn't in the database yet.

Town names in the Hamptons are quite contentious. What the post office knows it as is not what the locals call it, and nothing like what the real estate agents call it (but they make up their own and they lie about the locations).

But primarily you need to parse native language and messy addresses. 2 Ave, First Street, Fifth Ave, Madison Ave (same as 5th), CPW (central park west)

This is where google wins.

[+] freyfogle|12 years ago|reply
It's funny you mention that, as our main business is the real estate search engine Nestoria http://www.nestoria.com We parse about listings 15M addresses in 9 different countries every day (though not the US). We work in pretty chaotic markets like India and Brazil. The world is a very diverse place, but there is one constant - agents do not feel the need to let themselves be bound by the "on the ground" truth of where a listing is.

Regarding Google, you are right, they do a good job, especially in the US. Full credit to them. The problem is the cost and usage restrictions.

Coming back to your initial point of reporting inaccuracies. What would be your preferred way to report problems? some sort of API you could automate? Would you just tell us there is a problem, or want to also tell us the solution?

[+] ManAboutCouch|12 years ago|reply
One thing that I make use of but don't see too many services providing is some kind of 'match level' - where the geocoder returns a code indicating how confident it is about the quality of it's result. A result of 1 might mean a building level match, while 100 might be street level etc.

IIRC Google's geocoder does something like this, but it's pretty inaccurate, overstating it's match level consistently.

As others have said, geocoding is very hard to do well, but I commend the efforts being made with Nominatim and komoot/photon.

[+] freyfogle|12 years ago|reply
agree, a simple to understand confidence score is critical.

Also agree nominatim and photon are impressive

[+] lobster_johnson|12 years ago|reply
One thing that our application (processing real-estate data feeds) needs is the ability to figure out an approximate location if the address is a little vague.

For example, when we a vague address such as "Ichobod Crane Circle", we still want to get a position because that road is very short. However, if the address is something like "Sleepy Hollow Road" or "Murders Kill Road" (Coxsackie sure has some weird names), those are very long roads, and placing a marker anywhere on them would be meaningless.

Google solves this for us by providing, in the results, the bounding box of the result. When the match is not "street_address" but something else such as "route", "premise", "point_of_interest" or the like, what we do is take the bounding_box, calculate the area, and use the area if it's less than 500x500 meters. It's not optimal, but it's better than having no location at all.

Another thing that Google does semi-well is constrain the search to a specific area, like a country or a state. Unfortunately, Google doesn't let you pass in more than one state, but other than that, it works well. Some of the addresses we get are so vague that they would geocode to other countries (Oxford, England instead of Oxford, NY) if it were not for this filtering ability.

[+] freyfogle|12 years ago|reply
we have a long history in real estate, very familiar with exactly the problems you describe.
[+] troysandal|12 years ago|reply
Awesome. We need bulk requests (one or more lat/lng), and reverse geocoding with locale components (state, county, city, neighborhood) Extending tzaman's localization request, a globally unique identifier, e.g. ISO code, for every piece of locale when reverse geocoding is critical for us. When storing reverse geocoded points in our own database I want to key off the unique values but lookup the locale specific versions later on client devices (ideally via REST or an offline API if possible).

/geocode?latlng=47.639548,-122.356957&language=en,fr

{ "ISO":{ Country:"US", Administrative:"WA", SubAdministrative:"King", Locality:"Seattle", SubLocality:"Queen Anne" }, "fr":{ Country:"Etas-Unis", Administrative:"Washington", SubAdministrative:"Roi County", Locality:"Seattle", SubLocality:"Renne Anne" } "en":{ Country:"US", Administrative:"Washington", SubAdministrative:"King County", Locality:"Seattle", SubLocality:"Queen Anne" } }

[+] vicchi|12 years ago|reply
Can I dig a little deeper into this as your example has me indulging in some furious head scratching.

Providing language synonyms makes perfect sense where these exist (cf: London in English, Londra in Italian, Londres in French).

But your example implies translation of place names into their language specific equivalent. Kings County in Washington state is, unless I'm mistaken, Kings County in all other languages. Although the local residents may disagree, this county isn't blessed with a language synonym as it doesn't fall into the (ill defined) category of "well known place with a language variant".

Unless you're suggesting that if, say, French is requested as a language, a geocoder should translate place names so "Kings County" would (maybe) be "Comté Roi" in French. Although this approach sounds odd to me as (AFAIK) no one else refers to this place in this way?

[+] thomersch_|12 years ago|reply
God, no. Everyone is trying to build a Geocoder and everyone is failing, because no one is actually realizing that geocoding is probably the most complex topic in GIS.

"Yeah, let's put some Elasticsearch and PostgreSQL and it will work out fine." No, it won't - you have no idea. And of course you won't believe me, but let me list some problems you will have that you don't realize right now:

* There is a lot of different charsets. Latin, kyrillic, there are umlauts, RTL, weird abbreviations, language standards that you don't know, because you don't know enough about foreign cultures.

* It's a shitload of data: OpenStreetMap is expanded about 700 GB large (not including history). And you will want to have autocompletition or autosuggestion, so response times will have to be < 100 ms.

* Ranking. Your user types "Tokyo". Is it the restaurant next to the user, is it the capital of Japan or is it some village next to Shitfuckistan?

No matter what, it will take you about a year to get any usable result. So I suggest you to look into Nominatim (the standard geocoder of OpenStreetMap which has actually got a lot better) or Photon (a geocoder based on the Nominatim DB, but with auto suggestion).

[+] nodata|12 years ago|reply
> or is it some village next to Shitfuckistan?

Wow, where did that come from?

[+] ronaldx|12 years ago|reply
Agree that Nominatim is awesome. The way it handles disambiguation is thoughtful.

In my opinion, it's better than any closed source competitor.

[+] freyfogle|12 years ago|reply
Thanks, but it's too late, we're fallen under the geo spell. We're very familiar with nominatim (it's many strengths, but also significant weaknesses), and the challenge of geocoding across many different parts of the world, which as you mention are not trivial.
[+] jessebushkar|12 years ago|reply
I've used pretty much all of the big geocoding services, and here are problems I've ran into.

1. Rate limiting: I get it, you have to make money and/or limit your freeloading, but rate limiting has killed things I've built in the past, especially Google's hard rate limit. A soft rate limit, or an alternate way to monetize, would be huge.

2. Accuracy: MapBox's geocoder is not good. Aside from inaccurate map tiles, their geocoder misses entire US zip codes. PLEASE at least include helpful error messages and a path to report incorrect results.

3. A solution for shared IPs and rate limiting. I have helped several small websites that do not come close to approaching Google's daily rate limit, but because their IP was used by someone else, they are not allowed to make geocoding calls. This forced us to use a different service.

Honorable mention: It would be nice to be able to specify what data I get back from a call. If all I need is lat/lng, I don't need another kilobyte of neighborhood/city/time zone info in my result.

Hope this helps.

[+] jessaustin|12 years ago|reply
...rate limiting has killed things I've built in the past, especially Google's hard rate limit.

Sometimes it's possible to shift queries to the client, and then build in enough intelligence to: run only ten queries at a time, delay queries by a period that backs off, save results in localStorage, etc.

This won't solve all problems, and perhaps it annoys users to see the first ten locations pop up immediately while subsequent locations have some random delay the first time they visit a particular resource, but it does make some things possible that would not be otherwise.

[+] freyfogle|12 years ago|reply
It helps a lot, thanks.

re: alternate ways to monetize, what do you propose?

[+] scraplab|12 years ago|reply
An understanding of colloquial geography, such as:

- informal place names

- boundaries of neighbourhoods

- nesting of those things within administrative boundaries

Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates. GeoNames is pretty messy and inaccurate, and the copyright status has never been cleared up.

[+] vicchi|12 years ago|reply
"Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates."

Hi Tom - as Ed says, big fan of the work you did with Flickr's shapefiles and I still use your boundaries site on a regular basis.

Yes, GeoPlanet has vanished for download from the YDN site but all versions up to 7.10 are still on archive.org (http://archive.org/search.php?query=geoplanet) thanks to the combination of Aaron of Montreal and the CC-BY-SA license we released the data under.

[+] freyfogle|12 years ago|reply
Hi Tom,

thanks for commenting. Big fan of your work on flickr neighbourhood boundaries.

We hear you and are on it, which doesn't mean we'll be perfect of course, but definitely aware of this issue.

[+] micro_cam|12 years ago|reply
* Support for and intelligent detection of a variety of coordinate formats including LAt/Lon, utm, and township and range. This would be really useful when dealing with old well or surveyors logs etc.

* Better support for terrain features. Google is getting better here but for a while "Mount Rainier" was sending you to the parks business office.

* Better support for localized search. This ties into the last one, a frequent use for me is to be zoomed in on a general area and want to find an obscure creek or peak.

* Better support for non driving use cases. Google has a nasty habit of resolving things like unquoted locations to the nearest drivable street address which is really stupid when you are using it to find a wilderness lake or something.

* Finer grained search by type.

(FWIW I run hillmap.com so most of my desires spring from the needs of a service targeted at hikers and backcountry skiers.)

[+] lukecampbell|12 years ago|reply
Simple.

I've worked in GIS for a number of years. I've worked on marine and scientific data management on top of GIS support. From google maps/earth to ArcGIS and pulling data from KML to OGC services.

If a service takes me nearly a month to learn to use, I'm going to push adamantly to use something else.

[+] seamusabshere|12 years ago|reply
* API client with batching and parallelization built in (100 queries in a single request, multiple requests run in parallel, etc.)

* robustness in the face of bad street suffixes (for example, in Burlington, VT, you may find data with "CR" meaning "CIRCLE" instead of the official USPS "CREEK")

* fuzzy street name matching (PAKCER -> PACKER)

* accurate geocoding in rural United States

* fuzzy international place matching (like "ST PANCRAS ST STATION" in London)

[+] freyfogle|12 years ago|reply
In your experience is the rural US geocoding problem a software problem or a lack of underlying data?
[+] gloubibou|12 years ago|reply
Don't forget desktop and mobile applications. Most mapping and geocoding services do.

- I have to ship my API keys to end users. Someone could grab it and repurpose it - Rate limiting by API key penalizes one end customer for the other's misbehaving

I would love an API that is aware of the end user. Applies rate limiting on a user basis. Allows for anonymized user-based usage report. E.g. number of end users, average number of API calls by users, …

[+] freyfogle|12 years ago|reply
your comments generated a lot of discussion for us here, thanks. Our conclusion: if you build an app (mobile, desktop, whatever) that becomes popular and depends on a third party service, in our case a geocoder, it generates real costs for the thidd party service. So there are three potential groups who can pay the cost

a. the third party service provider can just provide service for free. We can't, at least not indefinitely.

b. the end consumer can somehow be billed by the third party service. Feels complicated, especially as the use of the service may be deep in the internals and behind the scenes of the app. The consumer may well have no idea it is being used

c. the application developer can pay. Either directly or via billing the end consumer.

Option c. feels like the only sustainable one. Happy to hear your thoughts on it though.

[+] freyfogle|12 years ago|reply
Thanks everyone for the feedback, very useful. Please keep it coming. I need to be offline for a bit, but will check in later. If you're interested in learning more about our progress please follow us on twitter. ta.

https://twitter.com/opencagedata

[+] tzaman|12 years ago|reply
Apart from the most obvious (being accurate) I would say well documented API and properly localised results
[+] freyfogle|12 years ago|reply
thanks for commenting. If you don't mind, what exactly do you mean by "well localised"? Can you give me an example, ideally via a service you're currently using that is doing it badly. Cheers.
[+] hernantz|12 years ago|reply
Get the timezone (and related information: utc offset, local time, etc) from a lat, lng point or address
[+] dpcan|12 years ago|reply
Quick and cheap daily batch geocoding with any type of export option, from CSV to JSON to XML.
[+] freyfogle|12 years ago|reply
Not sure what your definitions are of cheap or quick, nr of what country your data is in, but there are lots of people who do bulk geocoding. Why don't you use them?
[+] vgrichina|12 years ago|reply
Publicly visible and flexible pricing. Not $10k+ per year as Google Maps API.
[+] BugBrother|12 years ago|reply
Fast for single queries, not for batch geocoding?

So a web query can come from a user of a web service, the external(?) geocoding API can be called -- and the reply can go back to user [applying lon/lat processing] without waiting too long.

(I don't do anything like this for a while, so please apply NaCl.)