Download PDF

Estimation from sparse, multiply-imputed, multiway tables

Author

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: CPS Abstract

Keywords: missing-data, multiple imputation

Abstract

Multiple imputation is a standard method for dealing with missing values in data. Frequentist analysis of multiply imputed datasets is usually straightforward using well-known formulas for combining estimates from each of the multiply imputed datasets. However, the normality assumption that underpins the combining formulas is suspect for the analysis of multiway tables with many small proportions or counts. This means that application of the standard combining formulas to multiply imputed, sparse, multiway tabulations can yield confidence intervals with markedly less than nominal coverage. In some cases, lower limits for proportions and counts may be negative. When cross-classified by several categorical variables, even large datasets can yield multiway tables with that are sparse in some areas of the table. This is particularly common when one or more dimensions of the table identify numerically small population groups, such as smaller ethnic groups.

Multiway tables are a very common statistical output. For example, national statistical institutes typically produce a very large number of multiway tables as a primary means of publishing results of censuses, surveys and other data collections. In the presence of missing data, even census counts become uncertain, and it is therefore important to develop methodology that provides accurate confidence intervals from sparse multiply imputed multiway tables. In this paper we develop a straightforward approach to analysis of sparse, multiply imputed, multiway tables that slightly smooths cell probabilities using a Dirichlet prior in conjunction with a logit transformation to provide complete – data statistics better suited to the application the standard multiple-imputation combining formulas. Back transformation of confidence limits for the pooled logit transformed probabilities yields confidence limits for cell probabilities and counts. Negative confidence limits are avoided. The methodology accommodates both general multiway tables, where all dimensions are viewed symmetrically, as well as conditional tables where interest focuses on variation of some dimensions over levels of other dimensions (e.g., prevalence of smoking by age, sex and level of educational attainment). We present applications to census and survey data. The latter requires some modification of the basic framework. We also provide simulation evidence in support of the proposed methodology and provide comparisons with a fully Bayesian approach to analysis of multiply imputed data which does not require the use of the multiple imputation combining formulae and the attendant normality assumptions.