Privacy Metrics#
While the generated synthetic data should preserve the properties of the real data, it is also crucial that the synthetic data does not reveal any information about the individuals in the real data. In Clover, several privacy metrics have been implemented to evaluate the privacy-preserving integrity of the synthetic data. These privacy metrics are summarized in Table 1.
Category |
Metric |
---|---|
Reidentification |
Distance to closest record (DCR) |
Ratio match |
|
Nearest neighbor distance ratio (NNDR) |
|
Membership inference attack (MIA) |
GAN-Leaks |
Monte Carlo membership inference attack |
|
Logan |
|
TableGan |
|
Detector |
|
Collision |
Reidentification#
Reidentification metrics assess the reidentification risk using distance-based algorithms. They provide an overview of the privacy implications of synthetic data. If the distance between real and synthetic data is too small, there may be a risk of revealing sensitive information from the training data. Conversely, if the distance is too large, the quality of the synthetic data might be poor. The reidentification metrics implemented in Clover include:
Distance to closest record (DCR): This quantifies the proximity between synthetic data points and their closest counterparts in the real data.
Ratio match: It calculates the proportion of records with a DCR below a predefined threshold.
Nearest neighbor distance ratio (NNDR): This assesses the ratio between the Gower distance for the nearest and the second nearest neighbor in the real data for any corresponding synthetic record.
Membership inference attack (MIA)#
Membership disclosure occurs when an adversary gains access to data from the population that the synthetic data was generated from. Membership inference attack (MIA) is often achieved by training a machine learning model to infer whether a record in the population was used for training a synthetic data generator. In Clover, a range of state-of-the-art MIA models have been adapted and implemented:
GAN-Leaks: This method infers membership based on the distance to closest record (DCR) for each record in the real data relative to its counterpart in the synthetic data.
Monte Carlo membership inference attack: It infers membership based on the number of neighbors within the synthetic data for each record in the real data.
Logan: Membership inference is achieved by training a model to classify the 1st and 2nd generation synthetic data.
TableGan: Membership inference is achieved by training a discriminator and a classifier.
Detector: Membership inference is achieved by training a model to classify the 1st generation synthetic data and real data that were not utilized in generating the synthetic data.
Collision: This method trains a model to classify whether each record in the synthetic data collides with a record in the real data that was used to generate the synthetic data.