Ask HN: what features do you want in a geocoder?
we're building a public facing geocoding service (forward and reverse) on top of our own technology, OpenStreetMap, and various others open geo services. What features would make such a service compelling for developers? What is your wishlist?
Thanks for taking the time to answer.
[+] [-] michaelt|12 years ago|reply
For a delivery company, inaccurate results can send drivers to the wrong place, so it's important to get the best accuracy available.
You want it to work way out in the country, where buildings are few and far between; and with named buildings as well as numbered ones. So a search like [1] should come back accurate to less than 100m.
Fuzzy/imprecise matching should be used with care. If there's a search for Manor Close in London, it should ask which of the four Manor Closes you mean [2]. If the only part of the address that matches is London, that's not enough information to send a delivery driver - the address should be rejected.
If there are parts of the address you can't match that's sometimes a problem - you don't want to map 1 Hopton Parade, Streatham High Road to 1 Streatham High Road. But you do want to map Some Company Ltd, 1 Streatham High Road to the latter.
On the other hand, if your target users are dating websites wanting to show rough distances between members, just matching city might be plenty accurate enough; property search engines like Zoopla will show any shitty approximation on their maps if they don't recognise a street or postcode.
[1] https://maps.google.co.uk/maps?q=Paradise+Wildlife+Park,+Whi... [2] https://maps.google.co.uk/maps?q=from:E17+5RT+to:NW7+3NG+to:...
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] spacemanmatt|12 years ago|reply
If you added CASS or address validation (deliverability) service it would increase the value even more.
Having recently installed PostGIS and imported loads of TIGER data, it would be useful if you provided some discussion about data sets backing your geocoder, especially if you do better than just TIGER.
I am only interested in U.S./Canada addresses (thus the mentions of CASS/verification) but I understand OSM is a global project.
[+] [-] freyfogle|12 years ago|reply
To be open with you, the North American market is well served and highly competitive, many other parts of the world are the opposite. Thus the US/CA probably won't be our area of focus. That being said, very keen to learn what isn't working for you in current solutions.
[+] [-] crucialfelix|12 years ago|reply
We have a lot of problems with Google Maps not knowing about odd addresses like 235B Whatever St. or 113-76 XYZ street. This is perhaps because its a new construction and the address isn't in the database yet.
Town names in the Hamptons are quite contentious. What the post office knows it as is not what the locals call it, and nothing like what the real estate agents call it (but they make up their own and they lie about the locations).
But primarily you need to parse native language and messy addresses. 2 Ave, First Street, Fifth Ave, Madison Ave (same as 5th), CPW (central park west)
This is where google wins.
[+] [-] freyfogle|12 years ago|reply
Regarding Google, you are right, they do a good job, especially in the US. Full credit to them. The problem is the cost and usage restrictions.
Coming back to your initial point of reporting inaccuracies. What would be your preferred way to report problems? some sort of API you could automate? Would you just tell us there is a problem, or want to also tell us the solution?
[+] [-] ManAboutCouch|12 years ago|reply
IIRC Google's geocoder does something like this, but it's pretty inaccurate, overstating it's match level consistently.
As others have said, geocoding is very hard to do well, but I commend the efforts being made with Nominatim and komoot/photon.
[+] [-] freyfogle|12 years ago|reply
Also agree nominatim and photon are impressive
[+] [-] lobster_johnson|12 years ago|reply
For example, when we a vague address such as "Ichobod Crane Circle", we still want to get a position because that road is very short. However, if the address is something like "Sleepy Hollow Road" or "Murders Kill Road" (Coxsackie sure has some weird names), those are very long roads, and placing a marker anywhere on them would be meaningless.
Google solves this for us by providing, in the results, the bounding box of the result. When the match is not "street_address" but something else such as "route", "premise", "point_of_interest" or the like, what we do is take the bounding_box, calculate the area, and use the area if it's less than 500x500 meters. It's not optimal, but it's better than having no location at all.
Another thing that Google does semi-well is constrain the search to a specific area, like a country or a state. Unfortunately, Google doesn't let you pass in more than one state, but other than that, it works well. Some of the addresses we get are so vague that they would geocode to other countries (Oxford, England instead of Oxford, NY) if it were not for this filtering ability.
[+] [-] freyfogle|12 years ago|reply
[+] [-] troysandal|12 years ago|reply
/geocode?latlng=47.639548,-122.356957&language=en,fr
{ "ISO":{ Country:"US", Administrative:"WA", SubAdministrative:"King", Locality:"Seattle", SubLocality:"Queen Anne" }, "fr":{ Country:"Etas-Unis", Administrative:"Washington", SubAdministrative:"Roi County", Locality:"Seattle", SubLocality:"Renne Anne" } "en":{ Country:"US", Administrative:"Washington", SubAdministrative:"King County", Locality:"Seattle", SubLocality:"Queen Anne" } }
[+] [-] vicchi|12 years ago|reply
Providing language synonyms makes perfect sense where these exist (cf: London in English, Londra in Italian, Londres in French).
But your example implies translation of place names into their language specific equivalent. Kings County in Washington state is, unless I'm mistaken, Kings County in all other languages. Although the local residents may disagree, this county isn't blessed with a language synonym as it doesn't fall into the (ill defined) category of "well known place with a language variant".
Unless you're suggesting that if, say, French is requested as a language, a geocoder should translate place names so "Kings County" would (maybe) be "Comté Roi" in French. Although this approach sounds odd to me as (AFAIK) no one else refers to this place in this way?
[+] [-] thomersch_|12 years ago|reply
"Yeah, let's put some Elasticsearch and PostgreSQL and it will work out fine." No, it won't - you have no idea. And of course you won't believe me, but let me list some problems you will have that you don't realize right now:
* There is a lot of different charsets. Latin, kyrillic, there are umlauts, RTL, weird abbreviations, language standards that you don't know, because you don't know enough about foreign cultures.
* It's a shitload of data: OpenStreetMap is expanded about 700 GB large (not including history). And you will want to have autocompletition or autosuggestion, so response times will have to be < 100 ms.
* Ranking. Your user types "Tokyo". Is it the restaurant next to the user, is it the capital of Japan or is it some village next to Shitfuckistan?
No matter what, it will take you about a year to get any usable result. So I suggest you to look into Nominatim (the standard geocoder of OpenStreetMap which has actually got a lot better) or Photon (a geocoder based on the Nominatim DB, but with auto suggestion).
[+] [-] amiramir|12 years ago|reply
[+] [-] nodata|12 years ago|reply
Wow, where did that come from?
[+] [-] ronaldx|12 years ago|reply
In my opinion, it's better than any closed source competitor.
[+] [-] freyfogle|12 years ago|reply
[+] [-] jessebushkar|12 years ago|reply
1. Rate limiting: I get it, you have to make money and/or limit your freeloading, but rate limiting has killed things I've built in the past, especially Google's hard rate limit. A soft rate limit, or an alternate way to monetize, would be huge.
2. Accuracy: MapBox's geocoder is not good. Aside from inaccurate map tiles, their geocoder misses entire US zip codes. PLEASE at least include helpful error messages and a path to report incorrect results.
3. A solution for shared IPs and rate limiting. I have helped several small websites that do not come close to approaching Google's daily rate limit, but because their IP was used by someone else, they are not allowed to make geocoding calls. This forced us to use a different service.
Honorable mention: It would be nice to be able to specify what data I get back from a call. If all I need is lat/lng, I don't need another kilobyte of neighborhood/city/time zone info in my result.
Hope this helps.
[+] [-] jessaustin|12 years ago|reply
Sometimes it's possible to shift queries to the client, and then build in enough intelligence to: run only ten queries at a time, delay queries by a period that backs off, save results in localStorage, etc.
This won't solve all problems, and perhaps it annoys users to see the first ten locations pop up immediately while subsequent locations have some random delay the first time they visit a particular resource, but it does make some things possible that would not be otherwise.
[+] [-] freyfogle|12 years ago|reply
re: alternate ways to monetize, what do you propose?
[+] [-] scraplab|12 years ago|reply
- informal place names
- boundaries of neighbourhoods
- nesting of those things within administrative boundaries
Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates. GeoNames is pretty messy and inaccurate, and the copyright status has never been cleared up.
[+] [-] vicchi|12 years ago|reply
Hi Tom - as Ed says, big fan of the work you did with Flickr's shapefiles and I still use your boundaries site on a regular basis.
Yes, GeoPlanet has vanished for download from the YDN site but all versions up to 7.10 are still on archive.org (http://archive.org/search.php?query=geoplanet) thanks to the combination of Aaron of Montreal and the CC-BY-SA license we released the data under.
[+] [-] freyfogle|12 years ago|reply
thanks for commenting. Big fan of your work on flickr neighbourhood boundaries.
We hear you and are on it, which doesn't mean we'll be perfect of course, but definitely aware of this issue.
[+] [-] micro_cam|12 years ago|reply
* Better support for terrain features. Google is getting better here but for a while "Mount Rainier" was sending you to the parks business office.
* Better support for localized search. This ties into the last one, a frequent use for me is to be zoomed in on a general area and want to find an obscure creek or peak.
* Better support for non driving use cases. Google has a nasty habit of resolving things like unquoted locations to the nearest drivable street address which is really stupid when you are using it to find a wilderness lake or something.
* Finer grained search by type.
(FWIW I run hillmap.com so most of my desires spring from the needs of a service targeted at hikers and backcountry skiers.)
[+] [-] freyfogle|12 years ago|reply
[+] [-] lukecampbell|12 years ago|reply
I've worked in GIS for a number of years. I've worked on marine and scientific data management on top of GIS support. From google maps/earth to ArcGIS and pulling data from KML to OGC services.
If a service takes me nearly a month to learn to use, I'm going to push adamantly to use something else.
[+] [-] seamusabshere|12 years ago|reply
* robustness in the face of bad street suffixes (for example, in Burlington, VT, you may find data with "CR" meaning "CIRCLE" instead of the official USPS "CREEK")
* fuzzy street name matching (PAKCER -> PACKER)
* accurate geocoding in rural United States
* fuzzy international place matching (like "ST PANCRAS ST STATION" in London)
[+] [-] freyfogle|12 years ago|reply
[+] [-] gloubibou|12 years ago|reply
- I have to ship my API keys to end users. Someone could grab it and repurpose it - Rate limiting by API key penalizes one end customer for the other's misbehaving
I would love an API that is aware of the end user. Applies rate limiting on a user basis. Allows for anonymized user-based usage report. E.g. number of end users, average number of API calls by users, …
[+] [-] freyfogle|12 years ago|reply
a. the third party service provider can just provide service for free. We can't, at least not indefinitely.
b. the end consumer can somehow be billed by the third party service. Feels complicated, especially as the use of the service may be deep in the internals and behind the scenes of the app. The consumer may well have no idea it is being used
c. the application developer can pay. Either directly or via billing the end consumer.
Option c. feels like the only sustainable one. Happy to hear your thoughts on it though.
[+] [-] freyfogle|12 years ago|reply
https://twitter.com/opencagedata
[+] [-] tzaman|12 years ago|reply
[+] [-] freyfogle|12 years ago|reply
[+] [-] hernantz|12 years ago|reply
[+] [-] thecodemonkey|12 years ago|reply
Shameless plug, but this is something we recently started offering [1] for our geocoding service. I'm happy to help if you have any questions.
[1] http://geocod.io/docs/#toc_21
[+] [-] dpcan|12 years ago|reply
[+] [-] freyfogle|12 years ago|reply
[+] [-] vgrichina|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] BugBrother|12 years ago|reply
So a web query can come from a user of a web service, the external(?) geocoding API can be called -- and the reply can go back to user [applying lon/lat processing] without waiting too long.
(I don't do anything like this for a while, so please apply NaCl.)