Welcome to Clover’s documentation! citadel ivado#

Synthetic Health Data Generation and Validation Library#

Advances in health research are constrained by the availability of data. Indeed, access to a large amount of data from different sources is a key factor to increase the generalizability of the machine learning algorithms and validate them and thus improve healthcare for the population.

Public and pre-processed data do not reflect the real-world. Synthetic data, which preserve the properties of the original dataset while overcoming privacy risks since the information is no longer personal, hold promise. However, the evidence regarding their utility and security remains unclear. For widespread adoption of synthetic data, both by the general public and by potential users, it is essential to establish best practices to mitigate the risks of privacy breach and information loss.

The goal of this project is therefore to provide means to perform a comprehensive study on synthetic data generation. The quality of the synthetic data and their generator will be evaluated on two criteria: the preservation of information and privacy. A trade-off between these two aspects is necessary in order to preserve the properties of the real data without compromising the privacy of the patients.

Current Features#

  • Synthetic data generators incorporating integrated differential privacy, supporting continuous and categorical variables (unique identifiers are not handled):
  • Utility report to assess the fidelity of the synthetic data:
    • Summary table

    • Detailed report with figures

  • The following utility metrics are implemented:
    • Univariate metrics
      • Continuous & categorical consistency

      • Continuous & categorical statistics

      • Hellinger distance

      • Kullback-Leibler divergence

    • Bivariate metrics
      • Pairwise Pearson Correlation Difference

      • Pairwise Chi-square correlation difference

    • Population metrics
      • Distinguishability

      • Cross learning (regression & classification)

    • Application metrics
      • Prediction (regression & classification)

      • F-Score for binary classification with continuous variables only

      • Feature importance

  • The following privacy metrics are implemented:
    • Reidentification metrics
      • Distance to Closest Record

      • Nearest Neighbor Distance Ratio

    • Membership inference attack (MIA)
      • GAN-Leaks

      • Monte Carlo membership inference attack

      • Logan

      • TableGan

      • Detector

      • Collision

  • Metareport to compare several synthetic datasets with respect to the metrics

See the documentation of each component in the User Guide section for more details.

Ongoing Work - Next Steps#

  • Improve data coverage (direct identifiers, missing data, etc.)

  • Improve the utility metrics (better discretization, learning algorithms, etc.)

  • Create a benchmark of the synthetic data generator in different settings