Welcome to Clover’s documentation!
#
Synthetic Health Data Generation and Validation Library#
Advances in health research are constrained by the availability of data. Indeed, access to a large amount of data from different sources is a key factor to increase the generalizability of the machine learning algorithms and validate them and thus improve healthcare for the population.
Public and pre-processed data do not reflect the real-world. Synthetic data, which preserve the properties of the original dataset while overcoming privacy risks since the information is no longer personal, hold promise. However, the evidence regarding their utility and security remains unclear. For widespread adoption of synthetic data, both by the general public and by potential users, it is essential to establish best practices to mitigate the risks of privacy breach and information loss.
The goal of this project is therefore to provide means to perform a comprehensive study on synthetic data generation. The quality of the synthetic data and their generator will be evaluated on two criteria: the preservation of information and privacy. A trade-off between these two aspects is necessary in order to preserve the properties of the real data without compromising the privacy of the patients.
Useful Links#
Current Features#
- Synthetic data generators incorporating integrated differential privacy, supporting continuous and categorical variables (unique identifiers are not handled):
- Utility report to assess the fidelity of the synthetic data:
Summary table
Detailed report with figures
- The following utility metrics are implemented:
- Univariate metrics
Continuous & categorical consistency
Continuous & categorical statistics
Hellinger distance
Kullback-Leibler divergence
- Bivariate metrics
Pairwise Pearson Correlation Difference
Pairwise Chi-square correlation difference
- Population metrics
Distinguishability
Cross learning (regression & classification)
- Application metrics
Prediction (regression & classification)
F-Score for binary classification with continuous variables only
Feature importance
- The following privacy metrics are implemented:
- Reidentification metrics
Distance to Closest Record
Nearest Neighbor Distance Ratio
- Membership inference attack (MIA)
GAN-Leaks
Monte Carlo membership inference attack
Logan
TableGan
Detector
Collision
Metareport to compare several synthetic datasets with respect to the metrics
See the documentation of each component in the User Guide section for more details.
Ongoing Work - Next Steps#
Improve data coverage (direct identifiers, missing data, etc.)
Improve the utility metrics (better discretization, learning algorithms, etc.)
Create a benchmark of the synthetic data generator in different settings