Wikidata integration

An important feature of "good data" is the ability to connect them to other datasets. This permits to check, eventually correct them, and provides means to build new datasets. Current page explores the possibility to connect Gauquelin data with wikidata.org. From wikidata, it's possible to link with other standard ids (ISNI, VIAF etc.).
Matching wikidata with Gauquelin data has not (yet) been coded ; this page only explores the possibilities to retrieve relevant data from wikidata.org.
Directory img/wikidata.org contains draft development notes.
Wikidata can be queried through Wikidata Query Service (WDQS), using SPARQL query language. This service permits to retrieve lists of persons.
Manual tests to retrieve lists of persons of a given professional group showed that full details of the persons can't be obtained because WDQS "times out" for queries that ask too many informations about persons.
Details : example of a query which times out Query to retrieve mathematicians (Q170790) with information about each person :
SELECT DISTINCT ?person ?personLabel ?familynameLabel ?givennameLabel ?linkcount ?isni ?macTutor ?birthdate ?birthplace
                ?birthplaceLabel ?birthiso3166 ?birthgeonamesid ?birthcoords ?deathdate ?deathplace ?deathplaceLabel
                ?deathiso3166 ?deathgeonamesid ?deathcoords ?deathcause ?deathcauseLabel WHERE {
    ?person ?P31 wd:Q170790;
        wdt:P734 ?familyname;
        wdt:P735 ?givenname;
        wdt:P569 ?birthdate;
        wdt:P19 ?birthplace;
        wikibase:sitelinks ?linkcount .
    OPTIONAL { ?person wdt:P1563 ?macTutor } .
    OPTIONAL { ?person wdt:P213 ?isni } .
    # birth
    ?birthplace wdt:P625 ?birthcoords .
    OPTIONAL { ?birthplace wdt:P1566 ?birthgeonamesid } .
    OPTIONAL { ?birthplace wdt:P17 ?birthcountry }.
    OPTIONAL { ?birthcountry wdt:P297 ?birthiso3166 }.
    # death
    OPTIONAL { ?person wdt:P570 ?deathdate } .
    OPTIONAL { ?person wdt:P20 ?deathplace } .
    OPTIONAL { ?deathplace wdt:P625 ?deathcoords }.
    OPTIONAL { ?deathplace wdt:P1566 ?deathgeonamesid } .
    OPTIONAL { ?deathplace wdt:P17 ?deathcountry }.
    OPTIONAL { ?deathcountry wdt:P297 ?deathiso3166 }.
    OPTIONAL { ?person wdt:P509 ?deathcause }.
    #
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?linkcount)
But to match Gauquelin data with Wikidata, g5 program needs full details of the persons.
This can be achieved because :
Several solutions were tested :
  1. Retrieve the list of all humans and query them one by one. This query retrieves all humans ids (query asking also labels times out) :
    SELECT ?human WHERE { ?human wdt:P31 wd:Q5 }
    This gives 5 489 277 records (execution 2019-11-01). This is too much : full data for a single human is around 100 Kb (json format), which would give more than 500 Gb of data (uncompressed) to download.
  2. Retrieve the list of occupation codes to retrieve only humans with a profession code. First step of this process was coded, with the folllowing query as departure point :
    SELECT ?profession ?professionLabel
    WHERE{
        ?profession wdt:P31 wd:Q28640.
    	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
    }
    ORDER BY (?professionLabel)
    
    This gave more than 6000 profession codes, leading to 2 647 452 person ids, which is still too much.
  3. Start from a subset of profession codes that corespond to Gauquelin data :
    Wikidata idProfession
    Q483501artist
    Q2066131athlete
    Q482980author
    Q189290military-officer
    Q82955politician
    Q39631physician
    Q901scientist
    This solution involves 3 steps :
    • Step 1 : store list of profession codes.
    • Step 2 : use these lists of profession codes to store lists of persons.
    • Step 3 : use these lists of persons to store detailed persons.

    Preparatory code executed on 2019-11-02 gave 592 profession codes containing 790 394 person ids, which would imply to download around 70 Gb on local machine.
  4. Download the full dump from dumps.wikimedia.org/wikidatawiki/entities/ on a local machine ; on 2019-11-09, file latest-all.json.bz2 was 43 Gb.
As it is possible to extract information from a wikidata dump without uncompressing it, working with a full dump on a local machine seems to be the most convenient solution.