Studies

In the context of this program, a study is based on a dataset containing dates.
Its characteristics are described in a configuration file.
Performing a study with observe program means generating distributions in the working directory of the study, then generating html pages in is output directory.

Computation steps

initPerforms specific inititializations needed by some studies
import Converts the original dataset to a common format, a compressed file called data.csv.bz2, which contains the dates in the order specified in configuration entry dates.
observedComputes the distributions of the original data
controlBuilds the control groups
expectedBuilds the expected distributions from control groups
stats Computes statistical informations about the distributions, stored in a file called stats.csv
dim2Adds distributions and statistics using dim2 arrays
outputGenerates a presentation of the results

Configuration file

The characteristics of a study are described in a YAML configuration file, located in directory config/.
See the comments in the sample configurations versioned with the program.
  • slug
    Configuration files can be organized in subdirectories, but each study must have a unique slug. Two different studies cannot have the same slug.
    A slug must be composed of lowercase lettres, digits and hyphen ("-").
  • fqcn
    While most computations are common to all studies, some steps (init, import and control, see below) need specific implementation because they depend on the dataset.
    For example, control group computation for the Deaths in France study queries a local database, while for the Births in France in 2000, the dataset is small enough to have all the data loaded in memory without needing a database.
    This setting permits to specify the php class handling theses specific steps.
  • date-precision
    Currently, only "day" is implemented.
  • dates
    Like slug, date names must be composed of lowercase lettres, digits and hyphen ("-").

Specific configurations

As each study can have custom steps, some configurations can be added, for example to specify the path to a raw file, or the characteristics of an intermediate database.

The working directory

This directory (specified in entry working-dir of the study configuration) contains all the intermediate files (mainly csv files containing distributions).
Taking the example of the "Deaths in France" study, the main subdirectores are:
var/studies/death-fr
    ├── controls
    │   ├── control-001
    │   │   ├── birth
    │   │   ├── birth-death
    │   │   └── death
    │   ├── ...
    │   └── control-100
    │       ├── birth
    │       ├── birth-death
    │       └── death
    ├── expected
    │   ├── birth
    │   ├── birth-death
    │   ├── death
    │   └── stats.csv
    ├── observed
    │   ├── birth
    │   ├── birth-death
    │   ├── death
    │   └── stats.csv
    └── data.csv.bz2
In green, distributions of type distrib1 (single date distributions).
In yellow, distributions of type distrib2 (two dates distributions).

As you can see, each subdirectory of controls/, observed/ and expected/ all share the same structure.

Variants of a study

A common need is to analyze different variants of a dataset.
For example, in the Deaths in France study, two variants were computed: the whole dataset and data without children born before their first birthday.
Current version of the program does not handle this notion, this is done by defining a new study. If the structure of the data doesn't change, this doesn't involve new code.