Manual tests to retrieve lists of persons of a given professional group showed that full details of the persons can't be obtained because WDQS "times out" for queries that ask too many informations about persons.
Details : example of a query which times out
Query to retrieve mathematicians (Q170790) with information about each person :SELECT DISTINCT ?person ?personLabel ?familynameLabel ?givennameLabel ?linkcount ?isni ?macTutor ?birthdate ?birthplace ?birthplaceLabel ?birthiso3166 ?birthgeonamesid ?birthcoords ?deathdate ?deathplace ?deathplaceLabel ?deathiso3166 ?deathgeonamesid ?deathcoords ?deathcause ?deathcauseLabel WHERE { ?person ?P31 wd:Q170790; wdt:P734 ?familyname; wdt:P735 ?givenname; wdt:P569 ?birthdate; wdt:P19 ?birthplace; wikibase:sitelinks ?linkcount . OPTIONAL { ?person wdt:P1563 ?macTutor } . OPTIONAL { ?person wdt:P213 ?isni } . # birth ?birthplace wdt:P625 ?birthcoords . OPTIONAL { ?birthplace wdt:P1566 ?birthgeonamesid } . OPTIONAL { ?birthplace wdt:P17 ?birthcountry }. OPTIONAL { ?birthcountry wdt:P297 ?birthiso3166 }. # death OPTIONAL { ?person wdt:P570 ?deathdate } . OPTIONAL { ?person wdt:P20 ?deathplace } . OPTIONAL { ?deathplace wdt:P625 ?deathcoords }. OPTIONAL { ?deathplace wdt:P1566 ?deathgeonamesid } . OPTIONAL { ?deathplace wdt:P17 ?deathcountry }. OPTIONAL { ?deathcountry wdt:P297 ?deathiso3166 }. OPTIONAL { ?person wdt:P509 ?deathcause }. # SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } ORDER BY DESC(?linkcount)
This can be achieved because :
- Wikidata permits to retrieve full informations about a person, thrgough urls like https://www.wikidata.org/wiki/Special:EntityData/Q6256830.json.
- Tests showed that queries asking minimal informations about persons (their Wikidata id and their name) don't time out.
Several solutions were tested :
-
Retrieve the list of all humans and query them one by one. This query retrieves all humans ids (query asking also labels times out) :
SELECT ?human WHERE { ?human wdt:P31 wd:Q5 }
This gives 5 489 277 records (execution 2019-11-01). This is too much : full data for a single human is around 100 Kb (json format), which would give more than 500 Gb of data (uncompressed) to download. -
Retrieve the list of occupation codes to retrieve only humans with a profession code. First step of this process was coded, with the folllowing query as departure point :
SELECT ?profession ?professionLabel WHERE{ ?profession wdt:P31 wd:Q28640. SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } } ORDER BY (?professionLabel)
This gave more than 6000 profession codes, leading to 2 647 452 person ids, which is still too much. -
Start from a subset of profession codes that corespond to Gauquelin data :
Wikidata id Profession Q483501 artist Q2066131 athlete Q482980 author Q189290 military-officer Q82955 politician Q39631 physician Q901 scientist - Step 1 : store list of profession codes.
- Step 2 : use these lists of profession codes to store lists of persons.
- Step 3 : use these lists of persons to store detailed persons.
Preparatory code executed on 2019-11-02 gave 592 profession codes containing 790 394 person ids, which would imply to download around 70 Gb on local machine. -
Download the full dump from dumps.wikimedia.org/wikidatawiki/entities/ on a local machine ; on 2019-11-09, file
latest-all.json.bz2
was 43 Gb.