Gauquelin5 usage

Configure and start using Gauquelin5 software.
This software has been developed and tested under Linux. A priori, it should also work under Windows and Macintosh.

It is used through the command line.

Installation

  1. Open a terminal and clone the repository on your local machine :
    git clone https://github.com/tig12/gauquelin5
    (or download the code).
  2. Install php (version 7.2 or higher) on your machine.
  3. Install PECL extension "yaml".
    On debian-based systems :
    sudo apt install php-yaml
    For other systems, see php manual.
  4. Install postgresql on your machine (see below for configuration).

Optional steps

  • Geonames.org matching uses a postgresql database filled with python code, see page geonames (only useful for some commands).
  • Wikidata retrieval also needs curl and sqlite3 PECL extensions :
    This is not necessary for data restoration, only to retrieve wikidata on local machine.
    sudo apt install php-curl
    sudo apt install php-sqlite3
            

Directory structure

The important files and directories are :
gauquelin5/
    ├── data/
    │   ├── build/
    │   ├── output/
    │   ├── raw/
    │   └── tmp/
    ├── docs/
    ├── src/
    ├── vendor/
    ├── config.yml.dist
    └── run-g5.php
In the rest of this doc, directory gauquelin5/ is called the root directory.
All the commands issued to run the program are done from the root directory.

The files you need to know about are :
  • run-g5.php is the entry point to use the program.
  • data/ contains the data generated and manipulated by the program (see below).
  • config.yml.dist needs to be copied (see below).

Configuration

Create a file config.yml by copying config.yml.dist :
cp config.yml.dist config.yml
Edit config.yml and adapt some values :

dirs

This directive permits to specify unversioned directories containing data.
The values can contain either absolute paths or paths relative to root directory.
Default values are all relative to root directory :
dirs:
  output: data/output
  tmp:    data/tmp
At programm installation, data/ directory contains 3 sub-directories : db/, init/ raw/.
These directories contain data necessary to g5, and are versioned with the program. Their locations are imposed and not configurable.

Other sub-directories of data/, are not versioned, ignored by git.

db5

This concerns g5 database, used to store data imported by the program.
It contains only one section : postgresql. Specify here the parameters used to connect to a local postgresql database.

geonames

G5 uses geonames.org to match place names to geonames ids. Some steps use a postgres database where geonames informations are stored ; some steps use geonames.org web service ; see page about geonames for details.

Section postgresql permits to specify the connection parameters, which can be identical or different from main g5 database.
Section name permits to specify the user name used to call geonames web service.

Usage

To check that the program works, type :
php run-g5.php
A message saying that you must provide supplementary arguments is displayed.
WRONG USAGE - run-g5.php needs at least 3 arguments
-------                                                                                               
Usage : 
    php run-g5.php    [optional arguments]
Example :
    php run-g5.php cura A2 raw2csv
-------
Possible values for argument1 : acts, csicop, cura, db, g55, newalch, wd
The program uses 3 argument :
  • argument1 : represents in general an information source, like cura or newalch.
  • argument2 : represents in general one or several files contained in a given information source.
  • argument3 : represents in general a treatment done on a given file.

Each time an incomplete command is given to the program, it prints the general error message and prints the possible values for the next missing argument.

Example 1
php run-g5.php cura
WRONG USAGE - need at least 3 arguments
... (general message) ...

Possible argument2 for argument1 = cura : all, look, A, A1, A2, A3, A4, A5, A6, D6, D10, E1, E3
Example 2
php run-g5.php cura A3
WRONG USAGE - need at least 3 arguments
... (general message) ...

Possible argument3 for cura / A3 : build, export, look, raw2tmp, tmp2db, tweak2tmp
Example 3
php run-g5.php cura A3 raw2tmp
This does a real transformation (converts A3 raw html file to a csv file in data/tmp/cura).

Complete generation of the database

As described in the page about g5 organisation, the program first converts raw data to temporary data, and then imports temporary files in database.
The different steps must be executed in a precise order, because some steps need the result of previous executions to work.

The order of execution is given by the code of class g5\commands\db\init\all
php run-g5.php db init all
PARAMETER MISSING
Possible values for parameter :
    tmp : Build files in data/tmp
    db : Fill database with tmp files
    all : Build tmp files and fill db
If 'db' or 'all' are choosen, it also drops existing tables and creates empty ones.
Then the following command builds the database from scratch :
php run-g5.php db init all all

Generating output files

Output capacities are still limited.

A specific export was written for each historical file, because some fields coming from the raw files are copied in the output.
So each file has a specific command to generate a csv file, for example :
php run-g5.php cura A2 export

php run-g5.php newalch muller1083 export
Generic exports also permit to generate files from database (currently only by profession code).
Profession codes and target file must be specified, for example :
php run-g5.php db export occu SP data/output/new/sport/sportsmen.csv

php run-g5.php db export occu WR+JO data/output/new/letters/writers+journalists.csv
A more flexible mechanism needs to be developed to specify precisely what to output.