Manual tests to retrieve lists of persons of a given professional group showed that full details of the persons can't be obtained because WDQS "times out" for queries that ask too many informations about persons.
Details : example of a query which times out
Query to retrieve mathematicians (Q170790) with information about each person :
SELECT DISTINCT ?person ?personLabel ?familynameLabel ?givennameLabel ?linkcount ?isni ?macTutor ?birthdate ?birthplace
?birthplaceLabel ?birthiso3166 ?birthgeonamesid ?birthcoords ?deathdate ?deathplace ?deathplaceLabel
?deathiso3166 ?deathgeonamesid ?deathcoords ?deathcause ?deathcauseLabel WHERE {
?person ?P31 wd:Q170790;
wdt:P734 ?familyname;
wdt:P735 ?givenname;
wdt:P569 ?birthdate;
wdt:P19 ?birthplace;
wikibase:sitelinks ?linkcount .
OPTIONAL { ?person wdt:P1563 ?macTutor } .
OPTIONAL { ?person wdt:P213 ?isni } .
# birth
?birthplace wdt:P625 ?birthcoords .
OPTIONAL { ?birthplace wdt:P1566 ?birthgeonamesid } .
OPTIONAL { ?birthplace wdt:P17 ?birthcountry }.
OPTIONAL { ?birthcountry wdt:P297 ?birthiso3166 }.
# death
OPTIONAL { ?person wdt:P570 ?deathdate } .
OPTIONAL { ?person wdt:P20 ?deathplace } .
OPTIONAL { ?deathplace wdt:P625 ?deathcoords }.
OPTIONAL { ?deathplace wdt:P1566 ?deathgeonamesid } .
OPTIONAL { ?deathplace wdt:P17 ?deathcountry }.
OPTIONAL { ?deathcountry wdt:P297 ?deathiso3166 }.
OPTIONAL { ?person wdt:P509 ?deathcause }.
#
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?linkcount)
This can be achieved because :
- Wikidata permits to retrieve full informations about a person, thrgough urls like https://www.wikidata.org/wiki/Special:EntityData/Q6256830.json.
- Tests showed that queries asking minimal informations about persons (their Wikidata id and their name) don't time out.
Several solutions were tested :
-
Retrieve the list of all humans and query them one by one. This query retrieves all humans ids (query asking also labels times out) :
SELECT ?human WHERE { ?human wdt:P31 wd:Q5 }This gives 5 489 277 records (execution 2019-11-01). This is too much : full data for a single human is around 100 Kb (json format), which would give more than 500 Gb of data (uncompressed) to download. -
Retrieve the list of occupation codes to retrieve only humans with a profession code. First step of this process was coded, with the folllowing query as departure point :
SELECT ?profession ?professionLabel WHERE{ ?profession wdt:P31 wd:Q28640. SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } } ORDER BY (?professionLabel)This gave more than 6000 profession codes, leading to 2 647 452 person ids, which is still too much. -
Start from a subset of profession codes that corespond to Gauquelin data :
This solution involves 3 steps :Wikidata id Profession Q483501 artist Q2066131 athlete Q482980 author Q189290 military-officer Q82955 politician Q39631 physician Q901 scientist - Step 1 : store list of profession codes.
- Step 2 : use these lists of profession codes to store lists of persons.
- Step 3 : use these lists of persons to store detailed persons.
Preparatory code executed on 2019-11-02 gave 592 profession codes containing 790 394 person ids, which would imply to download around 70 Gb on local machine. -
Download the full dump from dumps.wikimedia.org/wikidatawiki/entities/ on a local machine ; on 2019-11-09, file
latest-all.json.bz2was 43 Gb.