How to generate a combined utility/privacy report?#

Create a combined report of the metrics, whether they are utility or privacy metrics. /! Only for the summary.#

Assume that the synthetic data is already generated
Based on the Wisconsin Breast Cancer Dataset (WBCD)
[1]:
# Standard library
import sys
import tempfile

sys.path.append("..")

# 3rd party packages
import pandas as pd

# Local packages
import config
import utils.draw
from metrics.report import Report

Load the real and synthetic Wisconsin Breast Cancer Datasets#

[2]:
df_real = {}
df_real["train"] = pd.read_csv("../data/WBCD_train.csv")
df_real["test"] = pd.read_csv("../data/WBCD_test.csv")
df_real["train"].shape
[2]:
(455, 10)

Choose the synthetic dataset#

[3]:
df_synth = {}
df_synth["train"] = pd.read_csv("../results/data/2024-02-15_Synthpop_455samples.csv")
df_synth["test"] = pd.read_csv("../results/data/2024-02-15_Synthpop_228samples.csv")
df_synth["2nd_gen"] = pd.read_csv(
    "../results/data/2024-02-15_Synthpop_455samples_2nd_gen.csv"
)
df_synth["test"].shape
[3]:
(228, 10)

Configure the metadata dictionary#

The continuous and categorical variables need to be specified, as well as the variable to predict#

[4]:
metadata = {
    "continuous": [
        "Clump_Thickness",
        "Uniformity_of_Cell_Size",
        "Uniformity_of_Cell_Shape",
        "Marginal_Adhesion",
        "Single_Epithelial_Cell_Size",
        "Bland_Chromatin",
        "Normal_Nucleoli",
        "Mitoses",
        "Bare_Nuclei",
    ],
    "categorical": ["Class"],
    "variable_to_predict": "Class",
}

Generate the report#

[5]:
parameters = {  # see the notebooks utility_report and privacy_report for more details
    "cross_learning": False,
    "num_repeat": 1,
    "num_kfolds": 3,
    "num_optuna_trials": 15,
    "use_gpu": True,
    "sampling_frac": 0.5,
}
[6]:
report = Report(
    dataset_name="Wisconsin Breast Cancer Dataset",
    df_real=df_real,
    df_synthetic=df_synth,
    metadata=metadata,
    figsize=(8, 6),  # will be automatically adjusted for larger or longer figures
    random_state=42,  # for reproducibility purposes
    report_folderpath=None,  # load computed utility and/or privacy reports if available
    report_filename=None,  # the name of the computed report (without extension nor utility/privacy) if available
    metrics=None,  # list of the metrics to compute. Can be utility or privacy metrics. If not specified, all the metrics are computed.
    params=parameters,  # the dictionary containing the parameters for both utility and privacy reports
)
[7]:
report.compute()
LOGAN test set shape: (228, 10)
TableGan test set shape: (228, 10)
Detector test set shape: (228, 10)

Get the summary report as a pandas dataframe#

[8]:
report.specification()
----- Wisconsin Breast Cancer Dataset -----
Contains:
    - 455 instances in the train set,
    - 228 instances in the test set,
    - 10 variables, 9 continuous and 1 categorical.
[9]:
df_summary = report.summary()
[10]:
by = ["name", "objective", "min", "max"]
df_summary.groupby(by).apply(lambda x: x.drop(by, axis=1).reset_index(drop=True))
[10]:
alias submetric value
name objective min max
Categorical Consistency max 0 1.0 0 cat_consis within_ratio 1.000000
Categorical Statistics max 0 1.0 0 cat_stats support_coverage 1.000000
1 cat_stats frequency_coverage 0.975824
Classification min 0 1.0 0 classif diff_real_synth 0.001478
Collision max 0 1.0 0 collision precision 0.460317
1 collision recall 0.966667
2 collision f1_score 0.623656
3 collision recovery_rate 0.186495
inf 0 collision avg_num_appearance_realtrain 1.463023
1 collision avg_num_appearance_realcontrol 1.349112
2 collision avg_num_appearance_synth 1.552901
3 collision avg_num_appearance_collision_real 3.250000
4 collision avg_num_appearance_collision_synth 3.566667
Continuous Consistency max 0 1.0 0 cont_consis within_ratio 1.000000
Continuous Statistics min 0 inf 0 cont_stats median_l1_distance 0.000000
1 cont_stats iqr_l1_distance 0.049383
DCR max 0 1.0 0 dcr nndr_5th_percent_synthreal_train 0.000000
1 dcr nndr_5th_percent_synthreal_control 0.000000
inf 0 dcr dcr_5th_percent_synthreal_train 0.000000
1 dcr dcr_5th_percent_synthreal_control 0.000000
min 0 1.0 0 dcr ratio_match_synthreal_train 0.368421
1 dcr ratio_match_synthreal_control 0.377193
Detector max 0 1.0 0 detector precision_top1% 1.000000
1 detector precision_top50% 0.473684
2 detector precision 0.478261
3 detector tpr_at_0.001%_fpr 0.096491
4 detector tpr_at_0.1%_fpr 0.096491
Distinguishability min 0 1.0 0 dist prediction_mse 0.025798
1 dist prediction_mse_real 0.032699
2 dist prediction_mse_synth 0.018898
3 dist prediction_auc_rescaled 0.033568
FScore min 0 inf 0 fscore diff_f_score 0.353215
Feature Importance min 0 inf 0 feature_imp diff_permutation_importance 0.014686
GAN-Leaks max 0 1.0 0 ganleaks precision_top1% 0.000000
1 ganleaks precision_top50% 0.850877
Hellinger Categorical Univariate Distance min 0 1.0 0 hell_cat_univ_dist hellinger_distance 0.018080
Hellinger Continuous Univariate Distance min 0 1.0 0 hell_cont_univ_dist hellinger_distance 0.053864
KL Divergence Categorical Univariate Distance min 0 inf 0 kl_div_cat_univ_dist kl_divergence 0.001300
KL Divergence Continuous Univariate Distance min 0 inf 0 kl_div_cont_univ_dist kl_divergence 0.012048
LOGAN max 0 1.0 0 logan precision_top1% 1.000000
1 logan precision_top50% 0.543860
2 logan precision 0.551181
3 logan tpr_at_0.001%_fpr 0.017544
4 logan tpr_at_0.1%_fpr 0.017544
Monte Carlo Membership max 0 1.0 0 mcmebership precision_top1% 0.500000
1 mcmebership precision_top50% 0.526316
Pairwise Correlation Difference min 0 inf 0 pcd norm 0.222504
TableGan max 0 1.0 0 tablegan precision_top1% 0.500000
1 tablegan precision_top50% 0.500000
2 tablegan precision 0.513158
3 tablegan tpr_at_0.001%_fpr 0.000000
4 tablegan tpr_at_0.1%_fpr 0.000000

Save and load the report#

[ ]:
with tempfile.TemporaryDirectory() as temp_dir:
    report.save(savepath=temp_dir, filename="report")  # save
    new_report = Report(report_folderpath=temp_dir, report_filename="report")  # load