Expected distributions

The distributions computed from the original data set are called the observed distributions.
The next step is to build the expected distributions, which is what we should observe "in theory".
The comparison between observed and expected distribution will permit to apply statistcal tests and see if the observed distributions show anomalies.
In this program, the expected distributions are built either using control groups, or with the average method.

Control groups

A control group is a fictional group built by shuffling the data of the observed group.
For example, for a group of 500 persons with birth and death dates (B and D): Construction of a control group by shuffling We build a fictional person of the control group:
  • take the birth date of a record,
  • randomly select an other record,
  • build a record with the birth date of the first record and the death date of the second record.
Several fictional groups are built (typically betweeen 100 and 1000) and the average value is computed to obtain the expected distribution.

Warning

Here, precautions are needed: for example, the birth date of the newly created record must be anterior to its death date.
And there are many ways to build the control groups: we can use complete random or decide to keep properties of the original group.
For example:
  • Should the control groups respect the seasonal distribution of births ?
  • Should it respect the age at death distribution ?
Answering these questions affect the way the control groups are computed.

Usage

In this program, control groups are used only for distributions expressed by one-dimensional arrays (called dim1).
For distributions expressed by two-dimensional arrays (dim2), the expected distributions use the average method, described below.
The only reason is to save up computation time and memory: a 360 x 360 table has 129 600 cells. Computations for N planets imply to compute N x N tables for interaspects, and N x (N + 1) / 2 for aspects.
Nb of planetsNb of tablesNb of cells
11 187 24 235 200
15 345 44 712 000
16 392 50 803 200

Average method

This is the method described by Didier Castille in his article "A Link between Birth and Death" ("Un Lien entre la Naissance et le Décès"), also explained on en.wikipedia.org/wiki/Chi-squared_test.
For a two-dimensional array, the formula is :
expected value of a cell = mean of the line x mean of the column / total of the table
See a related question. This method might be abandoned and replaced by control groups for dim2 arrays.

File hierarchy

For each control, the full hierarchy of the observed distributions is reproduced.
In the example of a group containing birth and death dates:
observed
    ├── birth
    ├── birth-death
    └── death
Then the hierarchy of controls will be:
controls
    ├── control-001
    │   ├── birth├── birth-death└── death
    ├── ...
    │
    └── control-100
        ├── birth
        ├── birth-death
        └── death
And the expected distributions, containing the average of all control distributions:
expected
    ├── birth
    ├── birth-death
    └── death
Distributions of type distrib1 (single date) are in green.
Distributions of type distrib2 (two dates) are in yellow.