Synthetic data are becoming increasingly important mechanisms for sharing data among collaborators and with the public. Multiple methods for the generation of synthetic data have been proposed, but many have short comings with respect to maintaining the statistical properties of the original data. We propose a new method for fully synthetic data generation that leverages linear and integer mathematical programming models in order to match the moments of the original data in the synthetic data. This method has no inherent disclosure risk and does not require parametric or distributional assumptions. We demonstrate this methodology using the Framingham Heart Study. Existing synthetic data methods that use chained equations were compared with our approach. We fit Cox proportional hazards, logistic regression, and nonparametric models to synthetic data and compared with models fitted to the original data. True coverage, the proportion of synthetic data parameter confidence intervals that include the original data's parameter estimate, was 100% for parametric models when up to four moments were matched, and consistently outperformed the chained equations approach. The area under the curve and accuracy of the nonparametric models trained on synthetic data marginally differed when tested on the full original data. Models were also trained on synthetic data and a partition of original data and were tested on a held-out portion of original data. Fourth-order moment matched synthetic data outperformed others with respect to fitted parametric models but did not always outperform other methods with fitted nonparametric models. No single synthetic data method consistently outperformed others when assessing the performance of nonparametric models. The performance of fourth-order moment matched synthetic data in fitting parametric models suggests its use in these cases. Our empirical results also suggest that the performance of synthetic data generation techniques, including the moment matching approach, is less stable for use with nonparametric models. The benefits of the moment matching approach should be weighed against additional computational costs. In summary, our results demonstrate that the introduced moment matching approach may be considered as an alternative to existing synthetic data generation methods.
In many domains, disseminating data to potential collaborators is critical for developing innovative models and research opportunities. Sharing such information often justifies the consuming task of collecting such data, and can be an important consideration when public funding is used to collect the data. However, data are frequently kept confidential for various reasons; for example, the data may contain sensitive information, such as in healthcare and banking applications. In cases where participants have provided personal information, they may have only consented to dissemination of their information for specific uses.
Furthermore, permission-granting procedures for data access can be time consuming, requiring proposal submissions, establishing trusted partnerships among the data owners, or applying for International Review Board approval. Many of these procedures require detailed plans for how the data will be used, sometimes inhibiting the process of discovery. After all of this effort, the acquired data sets may not be what the researcher originally envisioned, have structural deficiencies, or requests may only be approved for limited versions of the original data. Researchers may even be required to physically visit data centers to access the data,1 and may, after all their research is completed, not be permitted to publish the results if the data owners require approval before submission.
Synthetic data, comprising plausible observations that replace true observations,2 have emerged as a viable alternative to providing access to original data sets. Partially synthetic data permit some values from the observed data to be disclosed. Fully synthetic data, where no values in the original data set are in the disclosed data set, have become increasingly used to avoid disclosure risk. In recent years, there has been considerable discussion regarding the suitability of synthetic data as a replacement to collected data, and has led to the development of synthetic data methodologies.1,3
Computer science and engineering disciplines, often acquire data to develop mathematical or algorithmic tools. The development of such tools relies on understanding the statistical and structural properties of the variables of interest, but may not require the precise values from the original data, so long as the tools perform similarly in the real data.1
In this work, we propose a new approach for generating synthetic data. This approach leverages the concept of moment matching and avoids distributional assumptions. We apply this approach to data from a well-known cardiovascular longitudinal study—The Framingham Heart Study (FHS). To evaluate its effectiveness, we compare the performance of the moment-matching-based method with chained equations synthetic data methods of important classes of prediction models in healthcare: logistic regression and Cox proportional hazards models. We also test performance using the nonparametric models of k-nearest neighbors (KNN) and bagged trees.
Disclosure control includes all methods for protecting private data, with a comprehensive review in Hundepool et al.4 and Domingo-Ferrer et al.5 The oldest class of disclosure control is matrix masking, where the observed data are changed in some way and the modified data are then released.4 Data swapping6 interchanges select values between records. Postrandomization for statistical disclosure control uses Markovian matrices to probabilistically change categorical variables.7 Binning groups observations prior to distribution. Data shuffling ensures that the marginal distributions of the original and disclosed data are the same.8
Traditional disclosure control methods have several disadvantages. Tabular aggregate statistics and binning do not release individual observations for the user to analyze. Data swapping does not necessarily preserve relationships between variables.3 The disclosure risk of postrandomization for statistical disclosure control at disclosure is dependent on the transition matrix, and its impact on joint distributions is unknown.
Encryption is another form of disclosure control. Homomorphic encryption permits operations on ciphertext (encrypted data) without decryption or access to the decryption key. The result of an operation can then be decrypted.9 Fully homomorphic encryption theoretically allows an arbitrary number of addition and multiplication operations to be performed on ciphertext and was first conceptualized by Rivest et al.10 A method of fully homomorphic encryption using ideal lattices which permits both addition and multiplication operations was first described by Gentry.11 Practically, this method is computationally burdensome for a large number of operations, and the number of operations must be fixed in advance to provide proper encryption.
Somewhat homomorphic schemes, in which a finite number and limited scope of operations can be performed, introduce noise to ciphertext, and after a number of operations the noise prohibits decryption. Bootstrapping, which is computationally costly, can then be used to reduce the noise in the ciphertext. Many somewhat homomorphic schemes have been developed; for a review of these and more efficient fully homomorphic schemes, see Silverberg.12
Leveled homomorphic schemes can perform operations up to a predetermined polynomial degree of complexity without bootstrapping to reduce noise; however, parameters for the encryption scheme must be carefully selected such that results of operations are correct.12 A practical implementation of leveled homomorphic encryption of logistic and Cox models on encrypted cardiovascular data is described by Bos et al.13 Graepel et al.14 demonstrated that some machine learning algorithms can be trained on encrypted data on small databases. Homomorphic encryption allows data to be securely stored and may permit predetermined operations by so-called untrusted users. The results can then be decrypted for these users without permitting them to access the true data. However, the user cannot view or manipulate ciphertext in the same manner, or with the same as with the true data.
Rubin15 first proposed full synthetic data sets to control the disclosure of sensitive information. His multiple imputation method randomly and independently samples observations from a posterior distribution sampling frame based on parametric models to create multiple synthetic data sets. Then, a specified number of observations are randomly sampled to form a new data set. Multiple imputation for synthetic data creation is widely implemented with various extensions. Reiter16 demonstrated the use of different sampling frames in multiple imputation for synthetic data, including simple random sampling and two-stage cluster sampling. Abowd and Woodcock17 proposed a multiple imputation approach for partially synthetic data that replaced sensitive values, such as data values that are easily disclosed by an intruder. Little and Liu18 proposed selective multiple imputation of key variables by combining all sensitive records with some nonsensitive records using the posterior distribution.
Raghunathan et al.19 implemented an approximate Bayesian bootstrapping method. In this method, the original data are directly sampled to create a new data set; the released data may have a different distribution of observations but comprise observations exclusively from the original data set. Fienberg et al. used bootstraping approaches for categorical data,20 where the empirical cumulative distribution function is computed and smoothed for sampling observations. Matthews and Harel used predictive means matching to generate synthetic observations,21 but this method selects values from the original data to place in the new data set.
An empirical study of synthetic data methods of logistic regression, multinomial logit, Bayesian bootstraping, and linear regression for individual variables, based on the type of variable (continuous, binary, or categorical), was implemented using chained equations.22 Results demonstrated that the type of regression or modeling assumption used heavily impacted the quality of the estimands. Importantly, some of the estimated values using the synthetic data deviated by a large percentage from the original data.
Nonparametric approaches for synthetic data attempt to avoid model mis-specification and distributional assumptions. A Bayesian hierarchical modeling approach generated fully synthetic data sets,23 but random effects were assumed to be multivariate normal, one step involved linear regression estimates, and the method is only applicable for continuous data. Drechsler24 created partially synthetic data sets for categorical variables using support vector machines. Woodcock and Benedetto25 used kernel density estimation to generate partially synthetic data sets to preserve characteristics for distributions with high skewness and had promising results for replicating moment characteristics across data sets. Reiter26 used classification and regression trees (CART) and Caiola and Reiter27 used random forests to create partially synthetic data sets.
These methods have significant limitations in the context of full synthetic data. Multiple imputation uses a parametric model and may be invalid if the underlying distribution of the original data differs from the model. Matthews and Harel21 demonstrated that for binary data, categorical data, or continuous data that are not truly multivariate normal, using multivariate normal and rounding significantly biases estimators. Approximate Bayesian bootstrapping results in a synthetic data set that composed of original observations. Support vector machines do not use parametric assumptions, but, when tuned to classify well, may result in high disclosure risk.24 Kernel density estimates empirically seem to preserve higher order moments, but have only been tested for partially synthetic data and the user must specify a model for the transformation step.25 CART and random forest approaches were leveraged to create partially synthetic data only. In addition, variables are imputed sequentially using chained equations, such that dependency may be introduced and the imputation order can impact the imputation quality. For any chained equation approach, relationships between variables and the order of imputation must be specified and these assumptions may impact synthesis quality.27
Our primary contribution to this body of literature is a new moment matching approach for the generation of fully synthetic data sets. This method uses mathematical optimization to select candidate and final observations for the synthetic data set. Unlike past work, this approach does not require distributional assumptions or parametric processes to generate synthetic data, and can create fully synthetic data sets with a prespecified number of records without restriction on the type of variable (e.g., continuous, categorical, and binary). The moments of the synthetic data are matched to the moments of the original data up to a specified order.
To view more click here