The raw data
Raw files
From data.gouv.fr, we download one file per year:
data.gouv.fr
└── datasets
└── fichier-des-personnes-decedees
├── deces-1970.txt
├── ...
└── deces-2025.txt
Each file contains one line per personne, a total of 28 917 511 lines.
Here are the first lines of
deces-1970.txt
DUCRET*MARIE ANTOINETTE/ 21922010901004AMBERIEU-EN-BUGEY 19701210014216 GRANGEON*ERIC JEAN REMY/ 11969032901004AMBERIEU-EN-BUGEY 19700425693831059 VELLET*PHILIPPE/ 11970020101004AMBERIEU-EN-BUGEY 197002030100412 PRESSAVIN*LYDIE/ 21970040601004AMBERIEU-EN-BUGEY 197004060100433 DOUAT*MARIE-SYLVIA MARTINE/ 21970070801004AMBERIEU-EN-BUGEY 1970070801053457 ROSIER*FELIX/ 11891112501004AMBERIEU-EN-BUGEY 197011143001215 BOUVEYRON*PIERRE/ 11900042701005AMBERIEUX-EN-DOMBES 19701211693832094 MILLET*MARIE-LOUISE/ 21900082901017ARGIS 19701225060885310 GIVORD*JACQUES/ 11910081201026BAGE-LE-CHATEL 19701124060884880 CROZET*MARIE CECILE/ 21904092101029BEAUPONT 19701102392093
death-fr.sqlite3
These text files are first loaded in a sqlite database,death-fr.sqlite3.
It contains one table,
person with this structure:
create table person(
fname varchar(80),
gname varchar(80),
sex character(1),
bday character(8),
bcode character(5),
bname character(30),
bcountry varchar(80),
dday character(8),
dcode character(5),
dact character(9)
);
create index idx_bday ON person(bday);
create index idx_dday ON person(dday);
death-fr.sqlite3 is built by another program called g5, github.com/tig12/g5.
The build process is described on github.com/tig12/g5/tree/main/src/commands/enrich/deathfr.
php run-g5.php enrich deathfr raw2sqlite 1970-2025 > data/tmp/enrich/death-fr/sqlite-build-report.log
------------------------------------------------------- Total Execution time: 547.3 s - 00:12:07 ------------------------------------------------------- 28 917 511 lines parsed 28 803 832 lines inserted ----------------------- ERRORS ------------------------ ERR_NAME: 67 incorrect name - inserted anyway ERR_BDAY: 112 808 incorrect birth day - not inserted ERR_DDAY: 798 incorrect death day - not inserted ERR_POSTERIOR: 71 birth posterior to death - not inserted ERR_EXCEPTION: 2 exceptions => skipped 113 677 lines because of date problem (0.4 %)
Cleaning the data
For this first try, the elimination of rows containing errors is incomplete:- The day of birth distribution shows an excess of persons born on january 1st.
- The age at death distribution shows that some persons lived 144 years (world record is 122 years)!
Two variants
Full dataset
When looking at the interaspects between same planets the distributions of the full dataset, an anomaly is visible for all planets: a significant excess of their position at birth are the same as their positions at death.But the distribution of age at death shows a peak of death just after birth, which could explain this anomaly.
Filtered dataset
That's why distributions were computed for a second variant, where all persons deceased before their first birthday were removed. In this case, the interaspects between same planets don't show anomalies, which permits to conclude that the effect was due to demography.Execution
To build the control groups, the choice was made to use directlydeath-fr.sqlite3 to avoid loading data.csv.bz2 in memory.
death-fr.sqlite3 is queried by packets of 1000 rows ; the distributions are computed for these rows and stored in an intermediate database, var/studies/death-fr/tmp.sqlite3 (the distributions are encoded in json and stored in a text field).
At the end of an iteration of 1000 rows, current distributions are added to the distributions stored in
tmp.sqlite3.
This permits to stop and resume the computation of a control without losing the computations already done.