1 General Purpose
In clinical research missing data are
common, and compared to demographics, clinical research produces
generally smaller files, making a few missing data more of a
problem than it is with demographic files. As an example, a 35
patient data file of 3 variables consists of 3 × 35 = 105 values if
the data are complete. With only 5 values missing (1 value missing
per patient) 5 patients will not have complete data, and are rather
useless for the analysis. This is not 5 % but 15 % of
this small study population of 35 patients. An analysis of the
remaining 85 % patients is likely not to be powerful to
demonstrate the effects we wished to assess. This illustrates the
necessity of data imputation.
2 Schematic Overview of Type of Data File

3 Primary Scientific Question
Primary question: what is the effect of
regression imputation and multiple imputations on the sensitivity
of testing a study with missing data.
4 Data Example
The effects of an old laxative and of
age on the efficacy of a novel laxative is studied. The data file
with missing data is given underneath.
Outcome
|
Predictor 1
|
Predictor 2
|
Efficacy new laxative (stools/mth)
|
Efficacy old laxative (stools/mth)
|
Age (years)
|
24,00
|
8,00
|
25,00
|
30,00
|
13,00
|
30,00
|
25,00
|
15,00
|
25,00
|
35,00
|
10,00
|
31,00
|
39,00
|
9,00
|
|
30,00
|
10,00
|
33,00
|
27,00
|
8,00
|
22,00
|
14,00
|
5,00
|
18,00
|
39,00
|
13,00
|
14,00
|
42,00
|
30,00
|
|
41,00
|
11,00
|
36,00
|
38,00
|
11,00
|
30,00
|
39,00
|
12,00
|
27,00
|
37,00
|
10,00
|
38,00
|
47,00
|
18,00
|
40,00
|
13,00
|
31,00
|
|
36,00
|
12,00
|
25,00
|
12,00
|
4,00
|
24,00
|
26,00
|
10,00
|
27,00
|
20,00
|
8,00
|
20,00
|
43,00
|
16,00
|
35,00
|
31,00
|
15,00
|
29,00
|
40,00
|
14,00
|
32,00
|
31,00
|
30,00
|
|
36,00
|
12,00
|
40,00
|
21,00
|
6,00
|
31,00
|
44,00
|
19,00
|
41,00
|
11,00
|
5,00
|
26,00
|
27,00
|
8,00
|
24,00
|
24,00
|
9,00
|
30,00
|
40,00
|
15,00
|
|
32,00
|
7,00
|
31,00
|
10,00
|
6,00
|
23,00
|
37,00
|
14,00
|
43,00
|
19,00
|
7,00
|
30,00
|
5 Regression Imputation
First we will perform a multiple linear
regression analysis of the above data. For convenience the data
file is in extras.springer.com, and is entitled
“chapter19missingdata”. We will start by opening the data file in
SPSS. For a linear regression the module Regression is required. It
consists of at least ten different statistical models, such as
linear modeling, curve estimation, binary logistic regression,
ordinal regression etc. Here we will simply use the linear
model.
Command:
-
Analyze....Regression....Linear....Dependent: Newlax....Independent(s): Bisacodyl, Age....click OK.
The software program will exclude the
patients with missing data from the analysis. The analysis is given
underneath.
Coefficientsa
Model
|
Unstandardized coefficients
|
Standardized coefficients
|
t
|
Sig.
|
||
B
|
Std. error
|
Beta
|
||||
1
|
(Constant)
|
,975
|
4,686
|
,208
|
,837
|
|
Bis acodyl
|
1,890
|
,322
|
,715
|
5,865
|
,000
|
|
age
|
,305
|
,180
|
,207
|
1,698
|
,101
|
Using the cut-off level of p = 0,15
for statistical significance both the efficacy of the old laxative
and patients’ age are significant predictors of the new
laxative.
The regression equation is as follows
Using this equation, we use the y-value and x1-value to
calculate the missing x2-value. Similarly, the missing
y- and x1 –values are calculated and imputed. The
underneath data file has the imputed values.


Newlax
|
Oldlax
|
Age
|
24,00
|
8,00
|
25,00
|
30,00
|
13,00
|
30,00
|
25,00
|
15,00
|
25,00
|
35,00
|
10,00
|
31,00
|
39,00
|
9,00
|
69,00
|
30,00
|
10,00
|
33,00
|
27,00
|
8,00
|
22,00
|
14,00
|
5,00
|
18,00
|
39,00
|
13,00
|
14,00
|
42,00
|
17,00
|
30,00
|
41,00
|
11,00
|
36,00
|
38,00
|
11,00
|
30,00
|
39,00
|
12,00
|
27,00
|
37,00
|
10,00
|
38,00
|
47,00
|
18,00
|
40,00
|
35,00
|
13,00
|
31,00
|
36,00
|
12,00
|
25,00
|
12,00
|
4,00
|
24,00
|
26,00
|
10,00
|
27,00
|
20,00
|
8,00
|
20,00
|
43,00
|
16,00
|
35,00
|
31,00
|
15,00
|
29,00
|
40,00
|
14,00
|
32,00
|
31,00
|
11,00
|
30,00
|
36,00
|
12,00
|
40,00
|
21,00
|
6,00
|
31,00
|
44,00
|
19,00
|
41,00
|
11,00
|
5,00
|
26,00
|
27,00
|
8,00
|
24,00
|
24,00
|
9,00
|
30,00
|
40,00
|
15,00
|
35,00
|
32,00
|
7,00
|
31,00
|
10,00
|
6,00
|
23,00
|
37,00
|
14,00
|
43,00
|
19,00
|
7,00
|
30,00
|
A multiple linear regression of the
above data file with the imputed data included produced b-values
(regression coefficients) equal to those of the non-imputed data
file, but the standard errors fell, and, consequently, sensitivity
of testing was increased with a p-value falling from 0,101 to 0,005
(see the table on the next page).
6 Multiple Imputations
Multiple imputations is probably a
better device for missing data imputation than regression
imputation. In order to perform the multiple imputation method the
SPSS add-on module “Missing Value Analysis” has to be used. First,
the pattern of the missing data must be checked using the command
“Analyze Pattern”. If the missing data are equally distributed and
no “islands” of missing data exist, the model will be appropriate.
For analysis the statistical model Impute Missing Values in the
module Multiple Imputations is required.
Command:
-
Analyze….Missing Value Analysis….Transform….Random Number Generators ….Analyze.…Multiple Imputations….Impute Missing Data.…OK (the imputed data file must be given a new name e.g. “study name imputed”).
Five or more times a file is produced
by the software program in which the missing values are replaced
with simulated versions using the Monte Carlo method (see also the
Chaps. 27 and 50 for explanation of the Monte Carlo
method). In our example the variables are continuous, and, thus,
need no transformation.
Command:
-
Split File….click OK.
If you, subsequently, run a usual
linear regression of the summary of your “imputed” data files
(commands as given above), then the software will automatically
produce pooled regression coefficients instead of the usual
regression coefficients. In our example the multiple imputation
method produced a much larger p-value for the predictor age than
the regression imputation did as demonstrated in the underneath
table (p = 0,097 versus p = 0,005). The underneath table also shows
the result of testing after mean imputation and hot deck imputation
as reviewed in Chapter 3 of the e book “Statistics on a
Pocket Calculator Part 2”, Springer New York, 2012, from the same
authors (B = regression coefficient, SE = standard error,
T = t-value, Sig = p-value).
B1
|
SE1 bisacodyl
|
t
|
Sig
|
B2
|
SE2 age
|
t
|
Sig
|
|
Full data
|
||||||||
1.82
|
0.29
|
6.3
|
0.0001
|
0.34
|
0.16
|
2.0
|
0.048
|
|
5 % Missing data
|
||||||||
1.89
|
0.32
|
5.9
|
0.0001
|
0.31
|
0.19
|
1.7
|
0.101
|
|
Means imputation
|
||||||||
1.82
|
0.33
|
5.6
|
0.0001
|
0.33
|
0.19
|
1.7
|
0.094
|
|
Hot deck imputation
|
||||||||
1.77
|
0.31
|
5.7
|
0.0001
|
0.34
|
0.18
|
1.8
|
0.074
|
|
Regression imputation
|
||||||||
1.89
|
0.25
|
7.6
|
0.0001
|
0.31
|
0.10
|
3.0
|
0.005
|
|
Multiple imputations
|
||||||||
1.84
|
0.31
|
5.9
|
0.0001
|
0.32
|
0.19
|
1.7
|
0.097
|
The result of multiple imputations
was, thus, less sensitive than that of regression imputation.
Actually, the result was rather similar to that of mean and hot
deck imputation. Why do it then anyway. The argument is that, with
the multiple imputation method, the imputed values are not used as
constructed real values, but rather as a device for representing
missing data uncertainty. This approach is a safe and probably,
scientifically, better alternative to the other methods.
7 Conclusion
Regression imputation tends to
overstate the certainty of the data testing. Multiple imputations
is, probably, a better alternative to regression imputation.
However, it is not in the basic SPSS program and requires the
add-on module “Missing Value Analysis”.
8 Note
More background, theoretical, and
mathematical information of missing data managements is given in
Statistics applied to clinical trials 5th edition, Chap.
22, Springer Heidelberg Germany,
2012, from the same authors.