Integrating geonames.org

Relating places to a Geonames identifier can be seen as a way to prepare the merge of historical data with Wikidata.
Places expressed with a string are ambiguous because orthograph vary from one source to another (typos, abbreviations, truncated names, misspelling...). An id is needed to handle places by program.
G5 uses two ways to do associate data to Geonames : a local database and geonames.org web service.
In both cases, code can be developed to better the matching between Gauquelin data and geonames.

Local database

The program uses geonames data stored in a local postgres database ; see page Install.

The interest of a solution with data stored in a local database comes from the possibility to make lots of requests without worrying about the limitations of distant calls.
For the moment, this has been used to try a quite restrictive matching :
  • CY (country code) must be exactly the same.
  • C1 (State in the USA, Province in Italy...) or C2 (département in France) must be exactly the same.
  • The "slug" must be exactly the same (the slug is a name with all letters lowercased, accents removed and all non alpha-numeric characters converted to hyphen ; eg the slug of Saint-Jean de Védas is saint-jean-de-vedas).
This gives low matching rates but prevents accidental association to wrong geonames ids.
This is used in series A, E1 and E3 (step addGeo).

Better matching rates could be obtained using approximate string matching (like Levenstein), but this may need human validation to prevent wrong associations.

Geonames web service

This has been used for file D6, which doesn't contain place name. The question asked to the web service is "given a longitude and a latitude, give me a place name (reverse geocoding). This gives very poor results for place names, but was useful to restore the country, see page on file D6.

Here also, approximate matching could be tried to get better results.

Note : this kind of request could be done on local database, the only reason to use the web service is to avoid coding something that already exists.