Gauquelin LERRCP - serie A

This page describes the corrections done on files A1 to A6 while importing html pages from cura.free.fr .
The main problems with these files are :
  • Reconstitute the names.
  • Restore legal birth times.
cura.free.fr announces 15 940 persons in A files. G5 find 15 790 unique persons ; the difference comes from the presence of duplicates (one person listed in 2 or 3 files).
Raw files are imported in database with the following commands :
php run-g5.php cura A raw2tmp small
php run-g5.php cura A addGeo small
php run-g5.php cura A tmp2db

raw2tmp

This command needs a parameter to indicate what it should print :
php run-g5.php cura A raw2tmp
MISSING PARAMETER : raw2tmp needs a parameter to specify which output it displays. Can be :
  small : echoes only global results
  full : prints the details of problematic rows

Profession codes

In some files of serie A, the precise profession codes are not associated to the records. This is possible to fix thanks to the notices that are present on cura.free.fr pages.
These informations were included to the program (see constant PROFESSIONS_DETAILS of class g5\commands\cura\A\A) ; each record is associated to its precise profession in the resulting csv file.

Page "output format" contains a list of profession codes used in the generated files.

Name restoration

Each page of serie A contain 3 lists :
  • One list with precise birth data, but without names.
  • Two lists with names, but without precise birth data ; these two lists are sorted differently. I supposed (but didn't check by program) that these two lists are equivalent.

The purpose is to obtain records containing both precise birth data and the name of the person.
To summarize, this can be partially done by program from cura.free.fr web pages ; some cases can be fixed by human using Gauquelin 1955 book.
raw2tmp matching could be bettered for some cases, using newalchemypress.com data. See below, paragraphs Use Ertel 4391 and Use Müller 1083.

Merge lists

The program must merge the two lists.
A check done by program
php run-g5.php cura A build lists
shows that list with precise birth data differ from lists with names (lists with names contain less persons), so a trivial merge is not possible.

Unfortunately, these 2 lists don't share a common unique identifier which would permit to merge without ambiguity.
These two lists have in common the birth day. This was used to perform the merge, but a given birth day can correspond to several persons. In this case, ambiguity remains, and can't be solved by program.

To solve some ambiguities, Gauquelin 1955 book was used in an iterative process :
  • Build two arrays with birth day as key.
  • Merge the clear cases, with only one person for a given day.
  • Print the ambiguous cases.
  • Look in Gauquelin 1955 book if the ambiguous persons are present.
  • Inject the information in the program (constant CORRECTIONS_1955 in class g5\commands\cura\A\A).
  • Execute again.
Corrections using Gauquelin 1955 were currently done only for files A1 and A2.

For the cases that could not be solved, a name like "Gauquelin-A1-1352" was built, using file name and NUM field.

The results of this name matching are :
SerieOKNot OK
A1 1968 (94.3 %) 119 (5.7 %)
A2 3436 (94.32 %) 207 (5.68 %)
A3 2640 (86.67 %) 406 (13.33 %)
A4 2486 (91.4 %) 234 (8.6 %)
A5 2184 (90.62 %) 226 (9.38 %)
A6 1262 (62.29 %) 764 (37.71 %)
TOTAL 13 976 (87.72 %) 1956 (12.28 %)

Benefit from other files

Fortunately, name restoration can be bettered because other files contain common records :
External fileCura files
Ertel 4391 sportsmen A1
Müller 1083 physicians A2
Müller 402 writers A1, A2, A4, A6
Name modification is done when these files are imported in database.

Small errors

902gdA1y.html

  • in page 902gdA1y.html, there is an incoherence between the two lists for one record :
    1817	3	25	C	185	F	5	16	24	0	48N 0	4W 6	29	CONCARNEAU
    and
    1817	3	5	Lebris Jean
    Birth certificate permits to solve the case.
    Online civil registry : Registre 1 MI EC 53/10 Naissances, p 377 / 559 => "Né ce jour en cette ville à cinq heures du matin"
    => date = 1817-03-25 05:00
    Check if it matches UT time given by Cura :
    From "Traité de l'heure dans le monde", TU = HLO ; 4°6' = 00:16:24 => hour = 05:16:24
    OK => The first line is exact, the second line must be replaced by
    1817	3	25	Lebris Jean
    This fix is included in the code of raw2tmp.

NUM with exclamation marks

Some records of serie A have a ! in their NUM :
A1 : 909 1876
A2 : 2641
A4 : 159 320 439 1350 1443 1480 2136 2312
A5 : 1435 1557 1813 1829 2349 
A6 : 15 139 148 225 232 265 448 574 622 668 718 727 737 738  
This is not present in pages including names (for example present in 902gdA1.html but not in 902gdA1y.html).
The explanation is given in main Cura page : they correspond to records containing errors in the original publication (LERRCP), and corrected in Françoise Gauquelin's journal Astro-Psychological Problems.
This information is not yet included in g5 database.

addGeo

This step bring corrections to place names and compute geonames id for non-ambiguous cases.
Following documentation is obsolete.
Corresponds to an earlier stage of development, before merging all imported files in a database.

Better A1 names with ertel4391

The command :
php run-g5.php ertel sport fixA1 update
permits to restore 100 % of names not identified by step raw2csv in A1.
See page on Ertel 4391, paragraph "Fix cura A1 names".

NB : raw2csv leaves 118 names unidentified and this step restores 117. This is because one name is handled by step tweak2tmp.

Better A2 days and names with muller1083

Arno Müller's file of 1083 medical doctors is used to better names and birth days in A2.
This permits to fix only 12 unidentified names in A2.
See page on Müller 1083, paragraphs "Fix Cura names" and "Fix Cura days".

legalTime

The command :
php run-g5.php cura A legalTime
adds a column DATE_C (= date corrected) to the generated files.
Example for record A2-1 (Joseph Jean Abadie),
DATE = 1873-12-15 15:59:40+00:00 : information from cura.free.fr unchanged
DATE_C = 1873-12-15 16:04+00:05:03 : time is modified and timezone offset (+00:05:03) is added.

Here a bug appears, the legal birth time is probably 16:00, not 16:03. For dates prior to 1891-03-15, timezone offset computation involves longitude and equation of time.

Restoration rates are low :
A1 : restored 1029 / 2087 dates (49.31 %)
A2 : restored 1614 / 3643 dates (44.3 %)
A3 : restored 1003 / 3046 dates (32.93 %)
A4 : restored 2333 / 2720 dates (85.77 %)
A5 : restored 882 / 2410 dates (36.6 %)
A6 : restored 776 / 2026 dates (38.3 %)
Current code performs restoration only for persons born in France, excluding all cases that can't be fixed by program without ambiguity (ambiguity comes from world wars 1 and 2, for parts of France that were invaded by Germany ; precise timezone offset depend on local conditions). This can be bettered by implementing non-ambiguous cases for other countries present in A files.

Better A1 names and place with Gauquelin 1955

The restoration of 1955 group "570 sportifs" is used to better family and given names as well as place names in file A1.
This step must be performed after newalch ertel4391 fixA1, as name spelling in Gauquelin 1955 book is better, and after legalTime.

The corrections use the columns added for human corrections of files in 3-g55-edited/ (these column names end with _55). Other columns of these files are not used. The reason is that once a file located in 3-g55-edited/ is edited by a human, it is never updated by program again. So the columns generated by program may contain obsolete information (in fact they do, because the file 3-g55-edited/570SPO.csv was generated before writing commands that add corrections).

Names

The command
php run-g5.php g55 570SPO edited2cura name list
lists the differences between names of Cura A files and Gauquelin 1955 groups.
Random checks, comparing the names with wikipedia and other sources, show that Gauquelin names are globally better than Cura names (but Gauquelin 1955 names also contain errors).
php run-g5.php g55 570SPO edited2cura name update
is used to copy the contents of columns GIVEN_55 and FAMILY_55 to the files of 5-cura-csv/.

This command updates 59 names in file A1.

Places

Gauquelin 1955 place names are generally better than Cura places.
This works like names :
php run-g5.php g55 570SPO edited2cura place list
php run-g5.php g55 570SPO edited2cura place update

addGeo

This command needs a parameter to indicate what it should print :
php run-g5.php cura A1 addGeo
WRONG USAGE - This command needs a parameter indicating the type of report
- full : lists all the place names that can't be matched to geonames.
- medium : lists the places with several matches to geonames.
- small : only echoes global information about matching.
It modifies records of 5-cura-csv/ only if there is a unique match to geonames.
In case of match, fills column GEOID, and updates column PLACE (because place names in geonames are generally better).

See page about Geonames.