Gauquelin LERRCP - series A

LERRCP Series A are the first booklets published by Gauquelin laboratory in 1970 and 1971. Each booklet contains a list of famous persons used by the Gauquelins in their statistical experiments.
G5 uses Cura database version 5 to include these data.
Cura announces 15 940 persons in A files.
G5 finds 15 788 unique persons.
The difference comes from the presence of duplicates (the same person being listed in 2 or 3 different files).

This page describes the corrections done on files A1 to A6 while importing html pages of cura.free.fr.
The structure of Cura web pages is such that import is not a straightforward task.
The main problems with these files are :
  • Names and birth informations are given in two distinct lists that need to be merged.
  • The pages do not list legal birth times, but UTC (universal time).
These problems are partially handled by current g5 code, but not completely.

G5 integration

Raw files are imported in database with the following commands :
php run-g5.php gauq A raw2tmp small
php run-g5.php gauq A tweak2tmp
php run-g5.php gauq A addGeo small
php run-g5.php gauq A legalTime

php run-g5.php gauq A tmp2db
Input data are a copy of Cura html pages, located in data/raw/gauq/lerrcp/.
Step raw2tmp convert these html pages to CSV files stored in data/tmp/gauq/lerrcp.
Steps tweak2tmp, addGeo and legalTime bring corrections to these tmp files.
Step tmp2db finally imports the corrected tmp files in database.

raw2tmp

The main task of this command is to merge the two lists (list with names and list with precise birth data) ; see next paragraph.

This command needs a parameter to indicate what it should print :
php run-g5.php gauq A1 raw2tmp
MISSING PARAMETER : raw2tmp needs a parameter to specify which output it displays. Can be :
  small : echoes only global results
  full : prints the details of problematic rows

Profession codes

In some files of serie A, the precise profession codes are not associated to the records. This is possible to fix thanks to the notices that are present on Cura pages.
These informations were included to the program (see constant PROFESSIONS_DETAILS of class g5\commands\gauq\A\A) ; each record is associated to its precise profession in the resulting csv file.

Small errors

902gdA1y.html

  • in page 902gdA1y.html, there is an incoherence between the two lists for one record :
    1817	3	25	C	185	F	5	16	24	0	48N 0	4W 6	29	CONCARNEAU
    and
    1817	3	5	Lebris Jean
    Birth certificate permits to solve the case.
    Online civil registry : Registre 1 MI EC 53/10 Naissances, p 377 / 559 => "Né ce jour en cette ville à cinq heures du matin"
    => date = 1817-03-25 05:00
    Check if it matches UT time given by Cura :
    From "Traité de l'heure dans le monde", TU = HLO ; 4°6' = 00:16:24 => hour = 05:16:24
    OK => The first line is exact, the second line must be replaced by
    1817	3	25	Lebris Jean
    This fix is included in the code of raw2tmp.

NUM with exclamation marks

Some records of serie A have a ! in their NUM :
A1 : 909 1876
A2 : 2641
A4 : 159 320 439 1350 1443 1480 2136 2312
A5 : 1435 1557 1813 1829 2349 
A6 : 15 139 148 225 232 265 448 574 622 668 718 727 737 738  
This is not present in pages including names (for example present in 902gdA1.html but not in 902gdA1y.html).
The explanation is given in main Cura page : they correspond to records containing errors in the original publication (LERRCP), and corrected in Françoise Gauquelin's journal Astro-Psychological Problems.
This information is not yet included in g5 database.

Name restoration

This problem is handled by step raw2tmp

Each page of serie A contain 3 lists :
  • One list with precise birth data, but without names.
  • Two lists with names, but without precise birth data ; these two lists are sorted differently. I supposed (but didn't check by program) that these two lists are equivalent.

The purpose is to obtain records containing both precise birth data and the name of the person.
To summarize, this can be partially done by program from cura.free.fr web pages ; some cases can be fixed by human using Gauquelin 1955 book.
raw2tmp matching could be bettered for some cases, using newalchemypress.com data. See below, paragraphs Use Ertel 4391 and Use Müller 1083.

Merge lists

The program must merge the two lists.
A check done by program
php run-g5.php cura A look lists
shows that list with precise birth data differ from lists with names (lists with names contain less persons), so a trivial merge is not possible.

Unfortunately, these 2 lists don't share a common unique identifier which would permit to merge without ambiguity.
These two lists have in common the birth day. This was used to perform the merge, but a given birth day can correspond to several persons. In this case, ambiguity remains, and can't be solved by program.

To solve some ambiguities, Gauquelin 1955 book was used in an iterative process :
  • Build two arrays with birth day as key.
  • Merge the clear cases, with only one person for a given day.
  • Print the ambiguous cases.
  • Look in Gauquelin 1955 book if the ambiguous persons are present.
  • Inject the information in the program (constant CORRECTIONS_1955 in class g5\commands\cura\A\A).
  • Execute again.
Corrections using Gauquelin 1955 were currently done only for files A1 and A2.

For the cases that could not be solved, a name like "Gauquelin-A1-1352" was built, using file name and NUM field.

The results of this name matching are :
SerieOKNot OK
A1 1968 (94.3 %) 119 (5.7 %)
A2 3436 (94.32 %) 207 (5.68 %)
A3 2640 (86.67 %) 406 (13.33 %)
A4 2486 (91.4 %) 234 (8.6 %)
A5 2184 (90.62 %) 226 (9.38 %)
A6 1262 (62.29 %) 764 (37.71 %)
TOTAL 13 976 (87.72 %) 1956 (12.28 %)

Benefit from other files

Fortunately, name restoration can be bettered because other files contain common records :
External fileCura files
Ertel 4391 sportsmen A1
Müller 1083 physicians A2
Müller 402 writers A1, A2, A4, A6
Gauquelin 1955 A1, A2, A3, A4, A5, A6
Name modification is done when these files are imported in database.

Name fixes coming from Ertel file is done with this command :
php run-g5.php ertel sport fixA1 update
It permits to restore 100 % of names not identified by step raw2csv in A1.
See page on Ertel 4391, paragraph "Fix cura A1 names".

NB : raw2csv leaves 118 names unidentified and this step restores 117. This is because one name is handled by step tweak2tmp.

Name fixes coming from Arno Müller's file of 1083 medical doctors is used to better names and birth days in A2.
This permits to fix only 12 unidentified names in A2.
See page on Müller 1083, paragraphs "Fix Gauquelin names" and "Fix Gauquelin days".

tweak2tmp

Auxiliary YAML files located in data/db/init/lerrcp-tweak permit to store corrections used to solve errors detected in steps addGeo and legalTime.

addGeo

This step brings corrections to place names and compute geonames id for non-ambiguous cases.
It is useful for several purposes:
  • Correct city names, which are approximative in Cura web pages.
  • Add precision to longitudes, which is useful for legal time computation.
    Longitudes given in Cura web page are precise to 1 arc minute, and are sometimes erroneous.
    See for example record A1-558 Charles Farroux ; longitude given by Cura is 2°05', and geonames gives 2°59'59''
  • Prepare matching with wikidata.
It uses an auxiliary postgresql database where geonames informations are stored (see page install).

This command needs a parameter to indicate what it should print :
php run-g5.php gauq A1 addGeo
WRONG USAGE - This command needs a parameter indicating the type of report
- full : lists all the place names that can't be matched to geonames.
- medium : lists the places with several matches to geonames.
- small : only echoes global information about matching.
It modifies records of data/tmp/gauq/lerrcp/ only if there is a unique match to geonames. This restrictive match gives low matching rates but guarantees that the command doesn't introduce errors.
In case of match, fills column GEOID, and updates column PLACE (because place names in geonames are generally better).

See page about Geonames.

The problem of time

Gauquelin data available on Cura web pages are sometimes given in a format that generates a problem.
This problem occurs in all files of series A, B, F, NH, and in files D9a, D9b, D9c. Hopefully, D6, D10, E1, E3 are not concerned.

For example, record n° 798 of file A1 is given like this in cura.free.fr :
YEA	MON	DAY	PRO	NUM	COU	H	MN	SEC	TZ	LAT	LON	COD	CITY		
(...)
1909	3	14	C	798	F	4	50	40	0	48N 0	4W 6	29	PLEYBEN
This means that birth time is 04:50:40 and that timezone offset = 0.

This can be compared with a scan of Gauquelins' LERRCP publication provided in cura.free.fr ; it shows that the same record was given in a different form : Record A1-798 of Gauquelins' LERRCP This means that birth time is 05:00:00, and timezone offset is not provided.

Cura's value 04:50:40 is the result of a computation : legal time - timezone offset.
This is coherent with the notice found in Cura A1 page : Birthtime is converted to standard time, either for zone 0 (= GMT) or -1 (= CET) The way birth times are expressed in cura site adds information (the timezone offset). It's cool because the information can be directly used to compute birth chart.
But expressing times this way should be avoided because two distinct informations are mixed in one field :
  • The legal time, as it can be read in civil registries.
  • The timezone offset.
It's a problem, because if one wants to check the data, it's impossible to check legal time without worrying about timezone offset.

Doing like that generates a loss of information.
If the data is not correct, it is not possible to answer to the question : does it come from an error on legal time or on timezone offset computation ? The command :
php run-g5.php cura A legalTime
adds two columns, DATE-C (= date corrected) and TZO (timezone offset) to the tmp files of data/tmp/gauq/lerrcp.

Current code performs restoration only for persons born in France, excluding all cases that can't be fixed by program without ambiguity (ambiguity comes from world wars 1 and 2, for parts of France that were invaded by Germany ; precise timezone offset depend on local conditions ; see page about timezone).

Example for record A1-1 Alard Pierre
Information extracted from Cura file gives
DATE-UT = 1937-09-17 17:00:00
Command legalTime adds 2 columns:
DATE-C = 1937-09-17 18:00
TZO = +01:00

This is OK, but a problem appears in France for dates prior to 1891-03-15 because timezone offset computation involves longitude:
Example for record A1-2 André Georges
Information extracted from Cura file gives
DATE-UT = 1889-08-13 12:20:40
Command legalTime adds 2 columns:
DATE-C = 1889-08-13 12:30:04
TZO = +00:09:24

Here birth time is obviously 12:30 and not 12:30:04.
A probable explanation is that longitudes given in Cura files are not precise enough to permit an exact computation.
If this hypothese is correct, it means that UT times given in Cura pages are not exact. Fortunately, this error is small (max observed = 3 minutes of time), but it is an illustration of the problems generated when providing UT times instead of legal times.

This can be fixed by program (infer correct legal time and recompute timezone offset), but has not been implemented yet.

Restoration rates are low :
A1 : restored 1029 / 2087 dates (49.31 %)
A2 : restored 1614 / 3643 dates (44.3 %)
A3 : restored 1003 / 3046 dates (32.93 %)
A4 : restored 2333 / 2720 dates (85.77 %)
A5 : restored 882 / 2410 dates (36.6 %)
A6 : restored 776 / 2026 dates (38.3 %)