Computation steps
| init | Performs specific inititializations needed by some studies |
| import |
Converts the original dataset to a common format, a compressed file called data.csv.bz2, which contains the dates in the order specified in configuration entry dates.
|
| observed | Computes the distributions of the original data |
| control | Builds the control groups |
| expected | Builds the expected distributions from control groups |
| stats |
Computes statistical informations about the distributions, stored in a file called stats.csv
|
| dim2 | Adds distributions and statistics using dim2 arrays |
| output | Generates a presentation of the results |
Configuration file
The characteristics of a study are described in a YAML configuration file, located in directoryconfig/.
See the comments in the sample configurations versioned with the program.
-
slugConfiguration files can be organized in subdirectories, but each study must have a unique slug. Two different studies cannot have the same slug.
A slug must be composed of lowercase lettres, digits and hyphen ("-"). -
fqcnWhile most computations are common to all studies, some steps (
init,importandcontrol, see below) need specific implementation because they depend on the dataset.
For example, control group computation for the Deaths in France study queries a local database, while for the Births in France in 2000, the dataset is small enough to have all the data loaded in memory without needing a database.
This setting permits to specify the php class handling theses specific steps. -
date-precisionCurrently, only "day" is implemented.
-
datesLike slug, date names must be composed of lowercase lettres, digits and hyphen ("-").
Specific configurations
As each study can have custom steps, some configurations can be added, for example to specify the path to a raw file, or the characteristics of an intermediate database.The working directory
This directory (specified in entryworking-dir of the study configuration) contains all the intermediate files (mainly csv files containing distributions).
Taking the example of the "Deaths in France" study, the main subdirectores are:
var/studies/death-fr
├── controls
│ ├── control-001
│ │ ├── birth
│ │ ├── birth-death
│ │ └── death
│ ├── ...
│ └── control-100
│ ├── birth
│ ├── birth-death
│ └── death
├── expected
│ ├── birth
│ ├── birth-death
│ ├── death
│ └── stats.csv
├── observed
│ ├── birth
│ ├── birth-death
│ ├── death
│ └── stats.csv
└── data.csv.bz2
In green, distributions of type distrib1 (single date distributions).
In yellow, distributions of type
distrib2 (two dates distributions).
As you can see, each subdirectory of
controls/, observed/ and expected/ all share the same structure.
Variants of a study
A common need is to analyze different variants of a dataset.For example, in the Deaths in France study, two variants were computed: the whole dataset and data without children born before their first birthday.
Current version of the program does not handle this notion, this is done by defining a new study. If the structure of the data doesn't change, this doesn't involve new code.