Gauquelin heredity data

The file used comes from newalchemypress.com :
https://newalchemypress.com/gauquelin/gauquelin_docs/mom_dad_kid_final20916_3a-m_column_first.pdf

Contains Gauquelin heredity data
N = 20 916
The pdf was first converted to a csv :
  • Convert the PDF to text :
    pdftotext -layout mom_dad_kid_final20916_3a-m_column_first.pdf tmp1.txt
  • Edit tmp1.txt, and manually find and replace : one special character is replaced by nothing.
    Also had to correct ~20 lines not formatted correctly (newlines and some months and years not separated by space).
  • The following command generates the csv :
    - Remove white lines
    - Remove useless lines (footer containing the url of the file)
    - Change strings like '48N 5' to '48N05'
    - Convert multiple spaces to one space
    - Convert spaces to semicolon (;)
    - Generate a csv file :
    grep -v -e '^$' tmp1.txt | \
    grep -v 'file:///' | \
    sed -e 's#\([0-9][NSEW]\) \([0-9]\)#\10\2#g' | \
    tr -s '[:space:]' | \
    sed -e 's/^ //g' | \
    sed -e 's/ /;/g' \
    > gq-heredity-newalch-20916.csv