For ethics and scientific reasons it is necessary that biomedical research becomes open to all.

This implies that protocols and statistical analysis plans are transparent, and also that data sets are freely available. Unfortunately, this last condition is most often impossible to achieve because it is in contradiction with other ethical constraints, in particular concerning the right to privacy.

Technical presentation

How to keep good statistical properties?

The open Cesp project tries to overcome this paradox. Its ultimate goal is to provide freely, without any conditions, most of the datasets used by the CESP researchers.

Indeed this is NOT the original datasets which is proposed here, but synthetic or cloned datasets.

From a formal point of view these synthetic datasets have the same joint probability distribution as the original ones they are imitating.

This has been made thanks to an incredible ecosystem of open source libraries in R and Python.

SynthPop

The synthpop package for R allows users to create synthetic versions of confidential individual level data for use by researchers interested in making inferences about the population that the data represent. They can be used to carry out statistical analyses, though we would usually recommend conducting an analysis of the original data to confirm the results. Synthetic data are also useful for providing data sets for teaching.

https://synthpop.org.uk/

SDV

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of python libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.

https://sdv.dev/

How does it work?

It is ensured that the synthetic dataset creates observations whose level of similarity to the original observations is strictly less than the similarity between the original observations themselves. This is how privacy is guaranteed.

Of course, no scientific publications are possible from the analyses of these synthetic data. If you have the feeling that your analyses could have some scientific interest, it will be necessary to apply your scripts to the original datasets. This can be done using the contact us option in this page.

A formal agreement will have to be considered to secure all authorship positions for possible future publications. In addition, a limited financial contribution will be possibly asked to compensate data-management and statistical analysis workload that will be required.