Model Selection and Multivariate Inference Using Data Multiply Imputed for Disclosure Limitation and Nonresponse
Abstract (Summary)
This thesis proposes some inferential methods for use with multiple
imputation for missing data and statistical disclosure limitation, and
describes an application of multiple imputation to protect data
confidentiality. A third component concerns model selection in random
effects models.The use of multiple imputation to generate partially synthetic public
release files for confidential datasets has the potential to limit
unauthorized disclosure while allowing valid inferences to be made.
When confidential datasets contain missing values, it is natural to
use multiple imputation to handle the missing data simultaneously with
the generation of synthetic data. This is done in a two-stage process
so that the variability may be estimated properly. The combining rules
for data multiply imputed in this fashion differ from those developed
for multiple imputation in a single stage. Combining rules for scalar
estimands have been derived previously; here hypothesis tests for
multivariate components are derived.
Longitudinal business data are widely desired by researchers, but
difficult to make available to the public because of confidentiality
constraints. An application of partially synthetic data to the U. S.
Census Longitudinal Business Database is described. This is a large
complex economic census for which nearly the entire database must be
imputed in order for it to be considered for public release. The
methods used are described and analytical results for synthetic data
generated for a subgroup are described. Modifications to the multiple
imputation combining rules for population data are also developed.Model selection is an area in which few methods have been developed
for use with multiply-imputed data. Careful consideration is given to
how Bayesian model selection can be conducted with multiply-imputed
data. The usual assumption of correspondence between the imputation
and analyst models is not amenable to model selection procedures.
Hence, the model selection procedure developed incorporates the
imputation model and assumes that the imputation model is known to the
analyst.Lastly, a model selection problem outside the multiple imputation
context is addressed. A fully Bayesian approach for selecting fixed
and random effects in linear and logistic models is developed
utilizing a parameter expanded stochastic search Gibbs sampling
algorithm to estimate the exact model-averaged posterior distribution.
This approach automatically identifies subsets of predictors having
nonzero fixed coefficients or nonzero random effects variance, while
allowing uncertainty in the model selection process.
Bibliographical Information:
Advisor:Reiter, Jerome P
School:Duke University
School Location:USA - North Carolina
Source Type:Master's Thesis
Keywords:statistics
ISBN:
Date of Publication:12/07/2007