Wikidata | Gauquelin5

Wikidata can be queried through Wikidata Query Service (WDQS), using SPARQL query language. This service permits to retrieve lists of persons.
Manual tests to retrieve lists of persons of a given professional group showed that full details of the persons can't be obtained because WDQS "times out" for queries that ask too many informations about persons.

Details : example of a query which times out

Query to retrieve mathematicians (Q170790) with information about each person :

SELECT DISTINCT ?person ?personLabel ?familynameLabel ?givennameLabel ?linkcount ?isni ?macTutor ?birthdate ?birthplace
                ?birthplaceLabel ?birthiso3166 ?birthgeonamesid ?birthcoords ?deathdate ?deathplace ?deathplaceLabel
                ?deathiso3166 ?deathgeonamesid ?deathcoords ?deathcause ?deathcauseLabel WHERE {
    ?person ?P31 wd:Q170790;
        wdt:P734 ?familyname;
        wdt:P735 ?givenname;
        wdt:P569 ?birthdate;
        wdt:P19 ?birthplace;
        wikibase:sitelinks ?linkcount .
    OPTIONAL { ?person wdt:P1563 ?macTutor } .
    OPTIONAL { ?person wdt:P213 ?isni } .
    # birth
    ?birthplace wdt:P625 ?birthcoords .
    OPTIONAL { ?birthplace wdt:P1566 ?birthgeonamesid } .
    OPTIONAL { ?birthplace wdt:P17 ?birthcountry }.
    OPTIONAL { ?birthcountry wdt:P297 ?birthiso3166 }.
    # death
    OPTIONAL { ?person wdt:P570 ?deathdate } .
    OPTIONAL { ?person wdt:P20 ?deathplace } .
    OPTIONAL { ?deathplace wdt:P625 ?deathcoords }.
    OPTIONAL { ?deathplace wdt:P1566 ?deathgeonamesid } .
    OPTIONAL { ?deathplace wdt:P17 ?deathcountry }.
    OPTIONAL { ?deathcountry wdt:P297 ?deathiso3166 }.
    OPTIONAL { ?person wdt:P509 ?deathcause }.
    #
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?linkcount)

But to match Gauquelin data with Wikidata, g5 program needs full details of the persons.
This can be achieved because :

Wikidata permits to retrieve full informations about a person, thrgough urls like https://www.wikidata.org/wiki/Special:EntityData/Q6256830.json.
Tests showed that queries asking minimal informations about persons (their Wikidata id and their name) don't time out.

Several solutions were tested :

Retrieve the list of all humans and query them one by one. This query retrieves all humans ids (query asking also labels times out) :
```
SELECT ?human WHERE { ?human wdt:P31 wd:Q5 }
```
This gives 5 489 277 records (execution 2019-11-01). This is too much : full data for a single human is around 100 Kb (json format), which would give more than 500 Gb of data (uncompressed) to download.
Retrieve the list of occupation codes to retrieve only humans with a profession code. First step of this process was coded, with the folllowing query as departure point :
```
SELECT ?profession ?professionLabel
WHERE{
    ?profession wdt:P31 wd:Q28640.
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY (?professionLabel)
```
This gave more than 6000 profession codes, leading to 2 647 452 person ids, which is still too much.

Start from a subset of profession codes that corespond to Gauquelin data :

Wikidata id	Profession
Q483501	artist
Q2066131	athlete
Q482980	author
Q189290	military-officer
Q82955	politician
Q39631	physician
Q901	scientist

This solution involves 3 steps :

Step 1 : store list of profession codes.
Step 2 : use these lists of profession codes to store lists of persons.
Step 3 : use these lists of persons to store detailed persons.

Preparatory code executed on 2019-11-02 gave 592 profession codes containing 790 394 person ids, which would imply to download around 70 Gb on local machine.

Download the full dump from dumps.wikimedia.org/wikidatawiki/entities/ on a local machine ; on 2019-11-09, file latest-all.json.bz2 was 43 Gb.

As it is possible to extract information from a wikidata dump without uncompressing it, working with a full dump on a local machine seems to be the most convenient solution.