© Springer International Publishing Switzerland 2016
Ton J. Cleophas and Aeilko H. ZwindermanSPSS for Starters and 2nd Levelers10.1007/978-3-319-20600-4_19

19. Missing Data Imputation (35 Patients)

Ton J. Cleophas1, 2  and Aeilko H. Zwinderman2, 3
(1)
Department Medicine, Albert Schweitzer Hospital, Dordrecht, The Netherlands
(2)
European College Pharmaceutical Medicine, Lyon, France
(3)
Department Biostatistics, Academic Medical Center, Amsterdam, The Netherlands
 

1 General Purpose

In clinical research missing data are common, and compared to demographics, clinical research produces generally smaller files, making a few missing data more of a problem than it is with demographic files. As an example, a 35 patient data file of 3 variables consists of 3 × 35 = 105 values if the data are complete. With only 5 values missing (1 value missing per patient) 5 patients will not have complete data, and are rather useless for the analysis. This is not 5 % but 15 % of this small study population of 35 patients. An analysis of the remaining 85 % patients is likely not to be powerful to demonstrate the effects we wished to assess. This illustrates the necessity of data imputation.

2 Schematic Overview of Type of Data File

A211753_2_En_19_Figa_HTML.gif

3 Primary Scientific Question

Primary question: what is the effect of regression imputation and multiple imputations on the sensitivity of testing a study with missing data.

4 Data Example

The effects of an old laxative and of age on the efficacy of a novel laxative is studied. The data file with missing data is given underneath.
Outcome
Predictor 1
Predictor 2
Efficacy new laxative (stools/mth)
Efficacy old laxative (stools/mth)
Age (years)
24,00
8,00
25,00
30,00
13,00
30,00
25,00
15,00
25,00
35,00
10,00
31,00
39,00
9,00
 
30,00
10,00
33,00
27,00
8,00
22,00
14,00
5,00
18,00
39,00
13,00
14,00
42,00
 
30,00
41,00
11,00
36,00
38,00
11,00
30,00
39,00
12,00
27,00
37,00
10,00
38,00
47,00
18,00
40,00
 
13,00
31,00
36,00
12,00
25,00
12,00
4,00
24,00
26,00
10,00
27,00
20,00
8,00
20,00
43,00
16,00
35,00
31,00
15,00
29,00
40,00
14,00
32,00
31,00
 
30,00
36,00
12,00
40,00
21,00
6,00
31,00
44,00
19,00
41,00
11,00
5,00
26,00
27,00
8,00
24,00
24,00
9,00
30,00
40,00
15,00
 
32,00
7,00
31,00
10,00
6,00
23,00
37,00
14,00
43,00
19,00
7,00
30,00

5 Regression Imputation

First we will perform a multiple linear regression analysis of the above data. For convenience the data file is in extras.springer.com, and is entitled “chapter19missingdata”. We will start by opening the data file in SPSS. For a linear regression the module Regression is required. It consists of at least ten different statistical models, such as linear modeling, curve estimation, binary logistic regression, ordinal regression etc. Here we will simply use the linear model.
Command:
  • Analyze....Regression....Linear....Dependent: Newlax....Independent(s): Bisacodyl, Age....click OK.
The software program will exclude the patients with missing data from the analysis. The analysis is given underneath.
Coefficientsa
Model
Unstandardized coefficients
Standardized coefficients
t
Sig.
B
Std. error
Beta
1
(Constant)
,975
4,686
 
,208
,837
 
Bis acodyl
1,890
,322
,715
5,865
,000
 
age
,305
,180
,207
1,698
,101
aDependent Variable: new lax
Using the cut-off level of p = 0,15 for statistical significance both the efficacy of the old laxative and patients’ age are significant predictors of the new laxative.
The regression equation is as follows
$$ \mathrm{y}=\mathrm{a}+{\mathrm{bx}}_1+{\mathrm{cx}}_2 $$
$$ \mathrm{y}=0,975+1,890{\mathrm{x}}_1+0,305{\mathrm{x}}_2 $$
Using this equation, we use the y-value and x1-value to calculate the missing x2-value. Similarly, the missing y- and x1 –values are calculated and imputed. The underneath data file has the imputed values.
Newlax
Oldlax
Age
24,00
8,00
25,00
30,00
13,00
30,00
25,00
15,00
25,00
35,00
10,00
31,00
39,00
9,00
69,00
30,00
10,00
33,00
27,00
8,00
22,00
14,00
5,00
18,00
39,00
13,00
14,00
42,00
17,00
30,00
41,00
11,00
36,00
38,00
11,00
30,00
39,00
12,00
27,00
37,00
10,00
38,00
47,00
18,00
40,00
35,00
13,00
31,00
36,00
12,00
25,00
12,00
4,00
24,00
26,00
10,00
27,00
20,00
8,00
20,00
43,00
16,00
35,00
31,00
15,00
29,00
40,00
14,00
32,00
31,00
11,00
30,00
36,00
12,00
40,00
21,00
6,00
31,00
44,00
19,00
41,00
11,00
5,00
26,00
27,00
8,00
24,00
24,00
9,00
30,00
40,00
15,00
35,00
32,00
7,00
31,00
10,00
6,00
23,00
37,00
14,00
43,00
19,00
7,00
30,00
A multiple linear regression of the above data file with the imputed data included produced b-values (regression coefficients) equal to those of the non-imputed data file, but the standard errors fell, and, consequently, sensitivity of testing was increased with a p-value falling from 0,101 to 0,005 (see the table on the next page).

6 Multiple Imputations

Multiple imputations is probably a better device for missing data imputation than regression imputation. In order to perform the multiple imputation method the SPSS add-on module “Missing Value Analysis” has to be used. First, the pattern of the missing data must be checked using the command “Analyze Pattern”. If the missing data are equally distributed and no “islands” of missing data exist, the model will be appropriate. For analysis the statistical model Impute Missing Values in the module Multiple Imputations is required.
Command:
  • Analyze….Missing Value Analysis….Transform….Random Number Generators ….Analyze.…Multiple Imputations….Impute Missing Data.…OK (the imputed data file must be given a new name e.g. “study name imputed”).
Five or more times a file is produced by the software program in which the missing values are replaced with simulated versions using the Monte Carlo method (see also the Chaps. 27 and 50 for explanation of the Monte Carlo method). In our example the variables are continuous, and, thus, need no transformation.
Command:
  • Split File….click OK.
If you, subsequently, run a usual linear regression of the summary of your “imputed” data files (commands as given above), then the software will automatically produce pooled regression coefficients instead of the usual regression coefficients. In our example the multiple imputation method produced a much larger p-value for the predictor age than the regression imputation did as demonstrated in the underneath table (p = 0,097 versus p = 0,005). The underneath table also shows the result of testing after mean imputation and hot deck imputation as reviewed in Chapter 3 of the e book “Statistics on a Pocket Calculator Part 2”, Springer New York, 2012, from the same authors (B = regression coefficient, SE = standard error, T = t-value, Sig = p-value).
 
B1
SE1 bisacodyl
t
Sig
B2
SE2 age
t
Sig
Full data
 
1.82
0.29
6.3
0.0001
0.34
0.16
2.0
0.048
5 % Missing data
 
1.89
0.32
5.9
0.0001
0.31
0.19
1.7
0.101
Means imputation
 
1.82
0.33
5.6
0.0001
0.33
0.19
1.7
0.094
Hot deck imputation
 
1.77
0.31
5.7
0.0001
0.34
0.18
1.8
0.074
Regression imputation
 
1.89
0.25
7.6
0.0001
0.31
0.10
3.0
0.005
Multiple imputations
 
1.84
0.31
5.9
0.0001
0.32
0.19
1.7
0.097
The result of multiple imputations was, thus, less sensitive than that of regression imputation. Actually, the result was rather similar to that of mean and hot deck imputation. Why do it then anyway. The argument is that, with the multiple imputation method, the imputed values are not used as constructed real values, but rather as a device for representing missing data uncertainty. This approach is a safe and probably, scientifically, better alternative to the other methods.

7 Conclusion

Regression imputation tends to overstate the certainty of the data testing. Multiple imputations is, probably, a better alternative to regression imputation. However, it is not in the basic SPSS program and requires the add-on module “Missing Value Analysis”.

8 Note

More background, theoretical, and mathematical information of missing data managements is given in Statistics applied to clinical trials 5th edition, Chap. 22, Springer Heidelberg Germany, 2012, from the same authors.
SPSS for Starters and 2nd Levelers
ACoverHTML.html
A211753_2_En_BookFrontmatter_OnlinePDF.html
A211753_2_En_1_ChapterPart1.html
A211753_2_En_1_Chapter.html
A211753_2_En_2_Chapter.html
A211753_2_En_3_Chapter.html
A211753_2_En_4_Chapter.html
A211753_2_En_5_Chapter.html
A211753_2_En_6_Chapter.html
A211753_2_En_7_Chapter.html
A211753_2_En_8_Chapter.html
A211753_2_En_9_Chapter.html
A211753_2_En_10_Chapter.html
A211753_2_En_11_Chapter.html
A211753_2_En_12_Chapter.html
A211753_2_En_13_Chapter.html
A211753_2_En_14_Chapter.html
A211753_2_En_15_Chapter.html
A211753_2_En_16_Chapter.html
A211753_2_En_17_Chapter.html
A211753_2_En_18_Chapter.html
A211753_2_En_19_Chapter.html
A211753_2_En_20_Chapter.html
A211753_2_En_21_Chapter.html
A211753_2_En_22_Chapter.html
A211753_2_En_23_Chapter.html
A211753_2_En_24_Chapter.html
A211753_2_En_25_Chapter.html
A211753_2_En_26_Chapter.html
A211753_2_En_27_Chapter.html
A211753_2_En_28_Chapter.html
A211753_2_En_29_Chapter.html
A211753_2_En_30_Chapter.html
A211753_2_En_31_Chapter.html
A211753_2_En_32_Chapter.html
A211753_2_En_33_Chapter.html
A211753_2_En_34_ChapterPart2.html
A211753_2_En_34_Chapter.html
A211753_2_En_35_Chapter.html
A211753_2_En_36_Chapter.html
A211753_2_En_37_Chapter.html
A211753_2_En_38_Chapter.html
A211753_2_En_39_Chapter.html
A211753_2_En_40_Chapter.html
A211753_2_En_41_Chapter.html
A211753_2_En_42_Chapter.html
A211753_2_En_43_Chapter.html
A211753_2_En_44_Chapter.html
A211753_2_En_45_Chapter.html
A211753_2_En_46_Chapter.html
A211753_2_En_47_Chapter.html
A211753_2_En_48_Chapter.html
A211753_2_En_49_Chapter.html
A211753_2_En_50_Chapter.html
A211753_2_En_51_Chapter.html
A211753_2_En_52_Chapter.html
A211753_2_En_53_Chapter.html
A211753_2_En_54_Chapter.html
A211753_2_En_55_ChapterPart3.html
A211753_2_En_55_Chapter.html
A211753_2_En_56_Chapter.html
A211753_2_En_57_Chapter.html
A211753_2_En_58_Chapter.html
A211753_2_En_59_Chapter.html
A211753_2_En_60_Chapter.html
A211753_2_En_BookBackmatter_OnlinePDF.html