G5 Organisation

Gauquelin5 program gathers data from various sources to store them in a database.
This involves numerous steps called commands. G5 is a CLI (Command Line Interface) tool to issue these commands.

Data transformations

Here is a summary of the data manipulated by the program.
Gauquelin5 data transformations
  • Auxiliary data
    Not directly used by the program - Useful as a reference, to check if g5 has not introduced errors.
    Contain copies of original documents, like scans or files.
    Versioned in another repository: github.com/tig12/g5-aux
  • Raw data
    Input of g5.
    Contain usable version of auxiliary data, like files converted to UTF8
    or scans transformed to lists through OCR and human corrections.
    Conversion between auxiliary to raw data is done by humans, not by program.
    Versioned with g5 code, in data/raw
  • Intermediate data
    • Temporary data
      Raw data are sanitized, corrected, standardized and stored in temporary CSV files.
      These CSV files are stored by default in directory data/tmp.
      Not versioned
    • Human corrections and additions
      The conversion between raw and tmp uses human corrections stored in YAML files.
      Versioned with g5, in data/db.
  • G5 database
    data/tmp are then loaded and merged in a postgresql database.
  • Exports
Data directory thus contains:
gauquelin5
    └── data
        ├── auxiliary   # can be absent or removed
        ├── raw         # location imposed (because versioned with g5)
        ├── tmp         # location set in config.yml
        ├── db          # location imposed (because versioned with g5)
        └── output      # location set in config.yml
Note: The fact that raw data are versioned with the program has an interesting consequence:
Cloning g5 repository permits to build the database from scratch.