Data transformations
Here is a summary of the data manipulated by the program.-
Auxiliary dataNot directly used by the program - Useful as a reference, to check if g5 has not introduced errors.
Contain copies of original documents, like scans or files.Versioned in another repository: github.com/tig12/g5-aux -
Raw dataInput of g5.
Contain usable version of auxiliary data, like files converted to UTF8
or scans transformed to lists through OCR and human corrections.
Conversion between auxiliary to raw data is done by humans, not by program.Versioned with g5 code, indata/raw
-
Intermediate data
-
Temporary data
Raw data are sanitized, corrected, standardized and stored in temporary CSV files.
These CSV files are stored by default in directorydata/tmp
.Not versioned -
Human corrections and additions
The conversion betweenraw
andtmp
uses human corrections stored in YAML files.Versioned with g5, indata/db
.
-
Temporary data
-
G5 database
data/tmp
are then loaded and merged in a postgresql database. -
Exports
gauquelin5 └── data ├── auxiliary # can be absent or removed ├── raw # location imposed (because versioned with g5) ├── tmp # location set in config.yml ├── db # location imposed (because versioned with g5) └── output # location set in config.ymlNote: The fact that raw data are versioned with the program has an interesting consequence:
Cloning g5 repository permits to build the database from scratch.