Prev Next

Ton J. Cleophas and Aeilko H. ZwindermanSPSS for Starters and 2nd Levelers10.1007/978-3-319-20600-4_19

19. Missing Data Imputation (35 Patients)

Ton J. Cleophas^{1, 2} and Aeilko H. Zwinderman^2,
3

(1)

Department Medicine, Albert Schweitzer Hospital, Dordrecht, The Netherlands

(2)

European College Pharmaceutical Medicine, Lyon, France

(3)

Department Biostatistics, Academic Medical Center, Amsterdam, The Netherlands

1 General Purpose

In clinical research missing data are common, and compared to demographics, clinical research produces generally smaller files, making a few missing data more of a problem than it is with demographic files. As an example, a 35 patient data file of 3 variables consists of 3 × 35 = 105 values if the data are complete. With only 5 values missing (1 value missing per patient) 5 patients will not have complete data, and are rather useless for the analysis. This is not 5 % but 15 % of this small study population of 35 patients. An analysis of the remaining 85 % patients is likely not to be powerful to demonstrate the effects we wished to assess. This illustrates the necessity of data imputation.

2 Schematic Overview of Type of Data File

3 Primary Scientific Question

Primary question: what is the effect of regression imputation and multiple imputations on the sensitivity of testing a study with missing data.

4 Data Example

The effects of an old laxative and of age on the efficacy of a novel laxative is studied. The data file with missing data is given underneath.

Outcome	Predictor 1	Predictor 2
Efficacy new laxative (stools/mth)	Efficacy old laxative (stools/mth)	Age (years)
24,00	8,00	25,00
30,00	13,00	30,00
25,00	15,00	25,00
35,00	10,00	31,00
39,00	9,00
30,00	10,00	33,00
27,00	8,00	22,00
14,00	5,00	18,00
39,00	13,00	14,00
42,00		30,00
41,00	11,00	36,00
38,00	11,00	30,00
39,00	12,00	27,00
37,00	10,00	38,00
47,00	18,00	40,00
	13,00	31,00
36,00	12,00	25,00
12,00	4,00	24,00
26,00	10,00	27,00
20,00	8,00	20,00
43,00	16,00	35,00
31,00	15,00	29,00
40,00	14,00	32,00
31,00		30,00
36,00	12,00	40,00
21,00	6,00	31,00
44,00	19,00	41,00
11,00	5,00	26,00
27,00	8,00	24,00
24,00	9,00	30,00
40,00	15,00
32,00	7,00	31,00
10,00	6,00	23,00
37,00	14,00	43,00
19,00	7,00	30,00

5 Regression Imputation

First we will perform a multiple linear regression analysis of the above data. For convenience the data file is in extras.springer.com, and is entitled “chapter19missingdata”. We will start by opening the data file in SPSS. For a linear regression the module Regression is required. It consists of at least ten different statistical models, such as linear modeling, curve estimation, binary logistic regression, ordinal regression etc. Here we will simply use the linear model.

Command:

Analyze....Regression....Linear....Dependent: Newlax....Independent(s): Bisacodyl, Age....click OK.

The software program will exclude the patients with missing data from the analysis. The analysis is given underneath.

Coefficients^a

Model		Unstandardized coefficients		Standardized coefficients	t	Sig.
Model		B	Std. error	Beta	t	Sig.
1	(Constant)	,975	4,686		,208	,837
	Bis acodyl	1,890	,322	,715	5,865	,000
	age	,305	,180	,207	1,698	,101

^aDependent Variable: new lax

Using the cut-off level of p = 0,15 for statistical significance both the efficacy of the old laxative and patients’ age are significant predictors of the new laxative.

The regression equation is as follows

$\mathrm{y}=\mathrm{a}+{\mathrm{bx}}_1+{\mathrm{cx}}_2$

$\mathrm{y}=0,975+1,890{\mathrm{x}}_1+0,305{\mathrm{x}}_2$

Using this equation, we use the y-value and x₁-value to calculate the missing x₂-value. Similarly, the missing y- and x₁ –values are calculated and imputed. The underneath data file has the imputed values.

Newlax	Oldlax	Age
24,00	8,00	25,00
30,00	13,00	30,00
25,00	15,00	25,00
35,00	10,00	31,00
39,00	9,00	69,00
30,00	10,00	33,00
27,00	8,00	22,00
14,00	5,00	18,00
39,00	13,00	14,00
42,00	17,00	30,00
41,00	11,00	36,00
38,00	11,00	30,00
39,00	12,00	27,00
37,00	10,00	38,00
47,00	18,00	40,00
35,00	13,00	31,00
36,00	12,00	25,00
12,00	4,00	24,00
26,00	10,00	27,00
20,00	8,00	20,00
43,00	16,00	35,00
31,00	15,00	29,00
40,00	14,00	32,00
31,00	11,00	30,00
36,00	12,00	40,00
21,00	6,00	31,00
44,00	19,00	41,00
11,00	5,00	26,00
27,00	8,00	24,00
24,00	9,00	30,00
40,00	15,00	35,00
32,00	7,00	31,00
10,00	6,00	23,00
37,00	14,00	43,00
19,00	7,00	30,00

A multiple linear regression of the above data file with the imputed data included produced b-values (regression coefficients) equal to those of the non-imputed data file, but the standard errors fell, and, consequently, sensitivity of testing was increased with a p-value falling from 0,101 to 0,005 (see the table on the next page).

6 Multiple Imputations

Multiple imputations is probably a better device for missing data imputation than regression imputation. In order to perform the multiple imputation method the SPSS add-on module “Missing Value Analysis” has to be used. First, the pattern of the missing data must be checked using the command “Analyze Pattern”. If the missing data are equally distributed and no “islands” of missing data exist, the model will be appropriate. For analysis the statistical model Impute Missing Values in the module Multiple Imputations is required.

Command:

Analyze….Missing Value Analysis….Transform….Random Number Generators ….Analyze.…Multiple Imputations….Impute Missing Data.…OK (the imputed data file must be given a new name e.g. “study name imputed”).

Five or more times a file is produced by the software program in which the missing values are replaced with simulated versions using the Monte Carlo method (see also the Chaps. 27 and 50 for explanation of the Monte Carlo method). In our example the variables are continuous, and, thus, need no transformation.

Command:

Split File….click OK.

If you, subsequently, run a usual linear regression of the summary of your “imputed” data files (commands as given above), then the software will automatically produce pooled regression coefficients instead of the usual regression coefficients. In our example the multiple imputation method produced a much larger p-value for the predictor age than the regression imputation did as demonstrated in the underneath table (p = 0,097 versus p = 0,005). The underneath table also shows the result of testing after mean imputation and hot deck imputation as reviewed in Chapter 3 of the e book “Statistics on a Pocket Calculator Part 2”, Springer New York, 2012, from the same authors (B = regression coefficient, SE = standard error, T = t-value, Sig = p-value).

	B₁	SE₁ bisacodyl	t	Sig	B₂	SE₂ age	t	Sig
Full data
	1.82	0.29	6.3	0.0001	0.34	0.16	2.0	0.048
5 % Missing data
	1.89	0.32	5.9	0.0001	0.31	0.19	1.7	0.101
Means imputation
	1.82	0.33	5.6	0.0001	0.33	0.19	1.7	0.094
Hot deck imputation
	1.77	0.31	5.7	0.0001	0.34	0.18	1.8	0.074
Regression imputation
	1.89	0.25	7.6	0.0001	0.31	0.10	3.0	0.005
Multiple imputations
	1.84	0.31	5.9	0.0001	0.32	0.19	1.7	0.097

The result of multiple imputations was, thus, less sensitive than that of regression imputation. Actually, the result was rather similar to that of mean and hot deck imputation. Why do it then anyway. The argument is that, with the multiple imputation method, the imputed values are not used as constructed real values, but rather as a device for representing missing data uncertainty. This approach is a safe and probably, scientifically, better alternative to the other methods.

7 Conclusion

Regression imputation tends to overstate the certainty of the data testing. Multiple imputations is, probably, a better alternative to regression imputation. However, it is not in the basic SPSS program and requires the add-on module “Missing Value Analysis”.

8 Note

More background, theoretical, and mathematical information of missing data managements is given in Statistics applied to clinical trials 5th edition, Chap. 22, Springer Heidelberg Germany, 2012, from the same authors.

Prev Next