https://newalchemypress.com/gauquelin/gauquelin_docs/mom_dad_kid_final20916_3a-m_column_first.pdf
Contains Gauquelin heredity data
N = 20 916
The pdf was first converted to a csv :
-
Convert the PDF to text :
pdftotext -layout mom_dad_kid_final20916_3a-m_column_first.pdf tmp1.txt
-
Edit
tmp1.txt
, and manually find and replace : one special character is replaced by nothing.
Also had to correct ~20 lines not formatted correctly (newlines and some months and years not separated by space). - The following command generates the csv :
- Remove white lines
- Remove useless lines (footer containing the url of the file)
- Change strings like '48N 5' to '48N05'
- Convert multiple spaces to one space
- Convert spaces to semicolon (;)
- Generate a csv file :grep -v -e '^$' tmp1.txt | \ grep -v 'file:///' | \ sed -e 's#\([0-9][NSEW]\) \([0-9]\)#\10\2#g' | \ tr -s '[:space:]' | \ sed -e 's/^ //g' | \ sed -e 's/ /;/g' \ > gq-heredity-newalch-20916.csv