Studies | Observe

Computation steps

init	Performs specific inititializations needed by some studies
import	Converts the original dataset to a common format, a compressed file called `data.csv.bz2`, which contains the dates in the order specified in configuration entry `dates`.
observed	Computes the distributions of the original data
control	Builds the control groups
expected	Builds the expected distributions from control groups
stats	Computes statistical informations about the distributions, stored in a file called `stats.csv`
dim2	Adds distributions and statistics using dim2 arrays
output	Generates a presentation of the results

Configuration file

The characteristics of a study are described in a YAML configuration file, located in directory config/.
See the comments in the sample configurations versioned with the program.

slug
Configuration files can be organized in subdirectories, but each study must have a unique slug. Two different studies cannot have the same slug.
A slug must be composed of lowercase lettres, digits and hyphen ("-").
fqcn
While most computations are common to all studies, some steps (init, import and control, see below) need specific implementation because they depend on the dataset.
For example, control group computation for the Deaths in France study queries a local database, while for the Births in France in 2000, the dataset is small enough to have all the data loaded in memory without needing a database.
This setting permits to specify the php class handling theses specific steps.
date-precision
Currently, only "day" is implemented.
dates
Like slug, date names must be composed of lowercase lettres, digits and hyphen ("-").

Specific configurations

As each study can have custom steps, some configurations can be added, for example to specify the path to a raw file, or the characteristics of an intermediate database.

The working directory

This directory (specified in entry working-dir of the study configuration) contains all the intermediate files (mainly csv files containing distributions).
Taking the example of the "Deaths in France" study, the main subdirectores are:

var/studies/death-fr
    ├── controls
    │   ├── control-001
    │   │   ├── birth
    │   │   ├── birth-death
    │   │   └── death
    │   ├── ...
    │   └── control-100
    │       ├── birth
    │       ├── birth-death
    │       └── death
    ├── expected
    │   ├── birth
    │   ├── birth-death
    │   ├── death
    │   └── stats.csv
    ├── observed
    │   ├── birth
    │   ├── birth-death
    │   ├── death
    │   └── stats.csv
    └── data.csv.bz2

In green, distributions of type distrib1 (single date distributions).
In yellow, distributions of type distrib2 (two dates distributions).

As you can see, each subdirectory of controls/, observed/ and expected/ all share the same structure.

Variants of a study

A common need is to analyze different variants of a dataset.
For example, in the Deaths in France study, two variants were computed: the whole dataset and data without children born before their first birthday.
Current version of the program does not handle this notion, this is done by defining a new study. If the structure of the data doesn't change, this doesn't involve new code.