Synthpop#

Introduction#

To explore the efficacy of synthetic data synthesis, the paper “Utility of synthetic microdata generated using tree-based methods” evaluates the utility of synthetic data produced through the implementation of tree-based methods in synthpop software. By comparing various tree-based methods, such as classification and regression trees (CART), bagging, and random forest, the authors assess the performance of synthetic data in statistical disclosure control and utility metrics. The findings highlight the viability of tree-based methods for synthesizing tabular data, showcasing their potential across contexts constrained by confidentiality concerns. This idea can go further with the use of Sequential decision trees offering compelling solutions to address confidentiality constraints and enhance data utility. The “Bespoke Creation of Synthetic Data in R” paper introduces the synthpop package for R. This comprehensive tool empowers researchers to generate synthetic versions of original datasets, mimicking the observed data while preserving essential variable relationships.

Algorithm#

Sequential Regression/Classification Trees (SRTs) are a type of machine learning algorithm used for data imputation and data generation, based on the concepts of decision trees and sequentiality.

Here’s how SRTs work in the context of data generation (Emam et al., 2021): Let’s say we have five variables, A, B, C, D, and E. The generation is performed sequentially, and therefore we need to have a sequence. Various criteria can be used to choose a sequence. For our example, we define the sequence A -> E -> C -> B -> D. Let the prime notation indicate that the variable is synthesized. For example, A’ means that this is the synthesized version of A. The generative process consists of two general steps: fitting and synthesis. The following are the steps for sequential generation:

Input:
- Training source dataset

Output:
- Synthetic dataset

Procedure SequentialTrees:
    1. Construct Sequential Regression/Classification Trees (SRTs) using the training dataset
        - Build a model F1: E ~ A
        - Build a model F2: C ~ A + E
        - Build a model F3: B ~ A + E + C
        - Build a model F4: D ~ A + E + C + B

    2. Generate synthetic data
        - Sample from the A distribution to get A'
        - Synthesize E as E' = F1(A')
        - Synthesize C as C' = F2(A', E')
        - Synthesize B as B' = F3(A', E')
        - Synthesize D as D' = F4(A', E', C', B')
        - The four models (F1, F2, F3, and F4) make up the overall generative model

    3. Return the final synthetic dataset

The first variable to be synthesised A cannot have any predictors and therefore its synthetic values are generated by random sampling with replacement from its observed values. Then the distribution of E conditional on A is estimated and the synthetic values of E are generated using the fitted model and the synthesised values of A. Next the distribution of C conditional on A and E is estimated and used along with synthetic values of A and E to generate synthetic values of C and so on. The distribution of the last variable D will be conditional on all other variables. Similar conditional specification approaches are used in most implementations of synthetic data generation.

Clover implementation#

"""
Wrapper of the Synthpop Python implementation https://github.com/hazy/synthpop.

:cvar name: the name of the generator
:vartype name: str

:param df: the data to synthesize
:param metadata: a dictionary containing the list of **continuous** and **categorical** variables
:param random_state: for reproducibility purposes
:param generator_filepath: the path of the generator to sample from if it exists
:param variable_order: the order of the variable to construct the sequential trees
:param min_samples_leaf: the minimum number of samples required in a leaf to expand the tree further
:param max_depth: the maximum depth of the tree. If None, the tree expands until all leaves are pure or
    until they contain less than min_samples_leaf samples
"""

name = "Synthpop"

def __init__(
    self,
    df: pd.DataFrame,
    metadata: dict,
    random_state: int = None,
    generator_filepath: Union[Path, str] = None,
    variables_order: List[str] = None,
    min_samples_leaf: int = 5,
    max_depth: int = None,
):

How to optimize the hyperparameters?#

Synthpop has two types of hyperparameters: the order of the variables sequence to be synthesized and the trees depth.

  • The default visit sequence is the dataset columns order. Another option would be to conceptualize the variables order as an optimization problem such as the Traveling Salesman Problem. The discrete Particle Swarm Optimization (PSO) implemented in the library can be used to solve the problem, see the discrete PSO section section in the User Guide.

  • The trees depth can be tuned with either the minimum samples per leaf “min_samples_leaf” and/or the maximum depth “max_depth” parameters. Optuna or Ray Tune can be used to find the best values.

For more details, please refer to the notebook Tune hyperparameters to learn how to use the optimizers.

References#