Control groups

The distributions computed from the original data set are called the observed distributions.
The next step is to build the expected distributions.
The comparison between observed and expected distribution will permit to see if the observed distributions show anomalies.
In this program, the expected distributions are built with control groups.

Definition

A control group is a fictional group built by shuffling the data of the observed group.
For example, for a group of 500 persons with birth and death dates (B and D): Construction of a control group by shuffling We build a fictional person of the control group:
  • take the birth date of a record,
  • randomly select an other record,
  • build a record with the birth date of the first record and the death date of the second record.
Here, precautions are needed: for example, the birth date of the newly created record must be anterior to its death date.
And questions arise:
  • Do we want to keep properties of the observed group ?
  • For example, should the control groups respect the seasonal distribution of births ?
  • Should it respect the age at death distribution ?
Answering these questions affect the way the control groups are computed.
Can it affect the results of the satistical tests ?
More rigorous statistical knowledge is needed to build correct control groups and avoid unintentional bias due to arbitrary choices.

Expected distributions

To build the expected distributions,
  • several control groups are computed ;
  • for each distribution, an average distribution is computed ;
  • these average distributions are called the expected distributions.

File hierarchy

For each control, the full hierarchy of the observed distributions is reproduced.
In the example of a group containing birth and death dates:
observed
    ├── birth
    │   ├── aspects
    │   │   ├── JU-NE.csv
    │   │   ├── ...
    │   │   └── VE-UR.csv
    │   ├── planets
    │   │   ├── JU.csv
    │   │   ├── ...
    │   │   └── UR.csv
    │   ├── day.csv
    │   └── year.csv
    ├── birth-death
    │   ├── interaspects
    │   │   ├── JU-JU.csv
    │   │   ├── ...
    │   │   └── VE-VE.csv
    │   └── age.csv
    └── death
        ├── aspects
        │   ├── JU-NE.csv
        │   ├── ...
        │   └── VE-UR.csv
        ├── planets
        │   ├── JU.csv
        │   ├── ...
        │   └── UR.csv
        ├── day.csv
        └── year.csv
Then the hierarchy of controls will be:
controls
    ├── control-001
    │   ├── birth├── birth-death└── death
    ├── ...
    │
    └── control-100
        ├── birth
        ├── birth-death
        └── death
And the expected distributions, containing the average of all control distributions:
expected
    ├── birth
    ├── birth-death
    └── death