© Springer International Publishing Switzerland 2015
Lawrence M. Friedman, Curt D. Furberg, David L. DeMets, David M. Reboussin and Christopher B. GrangerFundamentals of Clinical Trials10.1007/978-3-319-18539-2_17

17. Statistical Methods Used in Interim Monitoring

Lawrence M. Friedman, Curt D. Furberg2, David L. DeMets3, David M. Reboussin4 and Christopher B. Granger5
(1)
North Bethesda, MD, USA
(2)
Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
(3)
Department Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
(4)
Department of Biostatistics, Wake Forest School of Medicine, Winston-Salem, NC, USA
(5)
Department of Medicine, Duke University, Durham, NC, USA
 
The original version of this chapter was revised. An erratum can be found at DOI 10.​1007/​978-3-319-18539-2_​23
In Chap. 16, the administrative structure was discussed for conducting interim analysis of data quality and outcome data for benefit and potential harm to trial participants. Although statistical approaches for interim analyses may have design implications, we have delayed discussing any details until this chapter because they really focus on monitoring accumulating data. Even if, during the design of the trial, consideration was not given to sequential methods, they could still be used to assist in the data monitoring or the decision-making process. In this chapter, some statistical methods for sequential analysis will be reviewed that are currently available and used for monitoring accumulating data in a clinical trial. These methods help support the evaluation of interim data and whether they are so convincing that the trial should be terminated early for benefit, harm, or futility or whether it should be continued to its planned termination. No single statistical test or monitoring procedure ought to be used as a strict rule for decision-making, but rather as one piece of evidence to be integrated with the totality of evidence [16]. Therefore, it is difficult to make a single recommendation about which should be used. However, the following methods, when applied appropriately, can be useful guides in the decision-making process.
Classical sequential methods, a modification generally referred to as group sequential methods, and curtailed testing procedures are discussed below in some detail; other approaches are also briefly considered. Classical sequential methods are given more mathematical attention in several articles and texts which can be referred to for further detail [720].

Fundamental Point

Although many statistical techniques are available to assist in monitoring, none of them should be used as the sole basis in the decision to stop or continue the trial.

Classical Sequential Methods

The aim of the classical sequential design is to minimize the number of participants that must be entered into a study. The decision to continue to enroll participants depends on results from those already entered. Most of these sequential methods assume that the response variable outcome is known in a short time relative to the duration of the trial. Therefore, for many trials involving acute illness, these methods are applicable. For studies involving chronic diseases, classical sequential methods have not been as useful. Detailed discussions of classical sequential methods are given, for example, by Armitage [20], Whitehead [18], and Wald [16].
The classical sequential analysis method as originally developed by Wald [16] and applied to the clinical trial by others such as Armitage [8, 9, 20] involves repeated testing of data in a single experiment. The method assumes that the only decision to be made is whether the trial should continue or be terminated because one of the groups is responding significantly better than the other. This classical sequential decision rule is called an “open plan” by Armitage [20] because there is no guarantee of when a decision to terminate will be reached. Strict adherence to the “open plan” would mean that the study could not have a fixed sample size. Very few clinical trials use the “open” or classical sequential design. The method also requires data to be paired, one observation from each group. In many instances, the pairing of participants is not appealing because the paired participants may be very different and may not be “well matched” in important prognostic variables. If stratification is attempted in order to obtain better matched pairs, each stratum with an odd number of participants would have one unpaired participant. Furthermore, the requirement to monitor the data after every pair may not be possible for many clinical trials. Silverman and colleagues [21] used an “open plan” in a trial of the effects of humidity on survival in infants with low birth weight. At the end of 36 months, 181 pairs of infants had been enrolled; 52 of the pairs had a discrepant outcome. Nine infants were excluded because they were un-matched and 16 pairs were excluded because of a mismatch. The study had to be terminated without a clear decision because it was no longer feasible to continue the trial. This study illustrates the difficulties inherent in the applying the classical sequential design for clinical trials.
Armitage [8] introduced the restricted or “closed” sequential design to assure that a maximum limit is imposed on the number of participants (2 N) to be enrolled. As with the “open plan,” the data must be paired using one observation from each study group. Criteria for early termination and rejection of no treatment effect are determined so that the design has specified levels of significance and power (α and 1 − β). This design was used in a comparison of two interventions in patients with ulcerative colitis [22]. In that trial, the criterion for no treatment effect was exceeded, demonstrating short-term clinical benefit of corticosteroids over sulphasalazine therapy. This closed design was also used in an acute leukemia trial, comparing 6-mercaptopurine with placebo (CALGB) [23]. This trial was terminated early, with the statistic comparing remission rates crossing the sequential boundary for benefit after 21 pairs of patients.
Another solution to the repeated testing problem, called “repeated significance tests,” was proposed by McPherson and Armitage [24] and also described by Armitage [20]. Although different theoretical assumptions are used, this approach has features similar to the restricted sequential model. That is, the observed data must be paired, and the maximum number of pairs to be considered can be fixed. Other modifications to the Armitage restricted plan [2527] have also been proposed. This methodology plays an important role in a method to be described below, referred to as group sequential design.
The methods described above can in some circumstances be applied to interim analyses of censored survival data [25, 2836]. If participants simultaneously enter a clinical trial and there is no loss to follow-up, information from interim analyses is said to be “progressively censored.” Sequential methods for this situation have been developed using, for example, modified rank statistics. In fact, most participants are not entered into a trial simultaneously, but in a staggered fashion. That is, participants enter over a period of time after which events of interest occur, subject to an independent censoring process. The log-rank statistic, described in Chap. 15, may also be used to monitor in this situation.
The classical sequential approach has not been widely used, even in clinical trials where the time to the event is known almost immediately. One major reason is that for many clinical trials, if the data are monitored by a committee which has regularly scheduled meeting, it is neither feasible nor necessary for ethical reasons to perform an analysis after every pair of outcomes. In addition, classical sequential boundaries require an alternative hypothesis to be specified, a feature not demanded by conventional statistical tests for the rejection of the null hypothesis.

Group Sequential Methods

Because of limitations with classical sequential methods, other approaches to the repeated testing problem have been proposed. Ad hoc rules have been suggested that attempt to ensure a conservative interpretation of interim results. One such method is to use a critical value of 2.6 at each interim look as well as in the final analyses [1]. Another approach [37, 38] referred to as the Haybittle–Peto procedure, favors using a large critical value, such as Z i  = +3.0, for all interim tests (i < K). Then any adjustment needed for repeated testing at the final test (i = K) is negligible and the conventional critical value can be used. These methods are ad hoc in the sense that no precise Type I error level is guaranteed. They might, however, be viewed as precursors of the more formal procedures to be described below.
Pocock [3941] modified the repeated testing methods of McPherson and Armitage [24] and developed a group sequential method for clinical trials which avoids many of the limitations of classical methods. He discusses two cases of special interest; one for comparing two proportions and another for comparing mean levels of response. Pocock’s method divides the participants into a series of K equal-sized groups with 2n participants in each, n assigned to intervention and n to control. K is the number of times the data will be monitored during the course of the trial. The total expected sample size is 2nK. The test statistic used to compare control and intervention is computed as soon as data for the first group of 2n participants are available, and then recomputed when data from each successive group of 2n participants become known. Under the null hypothesis, the distribution of the test statistic, Z i , is assumed to be approximately normal with zero mean and unit variance, where i indicates the number of groups (i ≤ K) which have complete data. This statistic Z i is compared to the stopping boundaries, ±ZN K where ZN K has been determined so that for up to K repeated tests, the overall (two sided) significance level for the trial will be α. For example, if K = 5 and α = 0.05 (two-sided), ZN K  = 2.413. This critical value is larger than the critical value of 1.96 used in a single test of hypothesis with α = 0.05. If the statistic Z i falls outside the boundaries on the “i”-th repeated test, the trial should be terminated, rejecting the null hypothesis. If the statistic never falls outside the boundaries, the trial should be continued until i = K (the maximum number of tests). When i = K, the trial would stop and the investigator would “accept” H 0.
O’Brien and Fleming [42] also discuss a group sequential procedure. Using the above notation, their stopping rule compares the statistic Z i with Z* √(K / i) where Z* is determined so as to achieve the desired significance level. For example, if K = 5 and a = 0.05, Z* = 2.04. If K ≤ 5, Z* may be approximated by the usual critical values for the normal distribution. One attractive feature is that the critical value used at the last test (i = K) is approximately the same as that used if a single test were done.
In Fig. 17.1, boundaries for the three methods described are given for K = 5 and α = 0.05 (two-sided). If for i < 5 the test statistic falls outside the boundaries, the trial is terminated and the null hypothesis rejected. Otherwise, the trial is continued until i = 5, at which time the null hypothesis is either rejected or “accepted”. The three boundaries have different early stopping properties. The O’Brien–Fleming model is unlikely to lead to stopping in the early stages. Later on, however, this procedure leads to a greater chance of stopping prior to the end of the study than the other two. Both the Haybittle–Peto and the O’Brien–Fleming boundaries avoid the awkward situation of accepting the null hypothesis when the observed statistic at the end of the trial is much larger than the conventional critical value (i.e., 1.96 for a two-sided 5% significance level). If the observed statistic in Fig. 17.1 is 2.3 when i = 5, the result would not be significant using the Pocock boundary. The large critical values used at the first few analyses for the O’Brien–Fleming boundary can be adjusted to some less extreme values (e.g., 3.5) without noticeably changing the critical values used later on, including the final one.
A61079_5_En_17_Fig1_HTML.gif
Fig. 17.1
Three group sequential stopping boundaries for the standardized normal statistic (Zi) for up to five sequential groups with two-sided significance level of 0.05 [64]
Many monitoring committees wish to be somewhat conservative in their interpretation of early results because of the uncertainties discussed earlier and because a few additional events can alter the results substantially. Yet, most investigators would like to use conventional critical values in the final analyses, not requiring any penalty for interim analyses. This means that the critical value used in a conventional fixed sample methods would be the same for that used in a sequential plan, resulting in no increase in sample size. With that in mind, the O’Brien–Fleming model has considerable appeal, perhaps with the adjusted or modified boundary as described. That is, the final critical value at the scheduled end of the trial is very close to the conventional critical value (e.g. 2.05 instead of 1.96) if the number of interim analyses is not excessive (e.g. larger than 10). The group sequential methods have an advantage over the classical methods in that the data do not have to be continuously tested and individual participants do not have to be “paired.” This concept suits the data review activity of most large clinical trials where monitoring committees meet periodically. Furthermore, in many trials constant consideration of early stopping is unnecessary. Pocock [3941] discusses the benefits of the group sequential approach in more detail and other authors describe variations [4347].
In many trials, participants are entered over a period of time and followed for a relatively long period. Frequently, the primary outcome is time to some event. Instead of adding participants between interim analyses, new events are added. As discussed in Chap. 15, survival analysis methods could be used to compare the experience of the intervention and the control arms. Given their general appeal, it would be desirable to use the group sequential methods in combination with survival analyses. It has been established for large studies that the log-rank or Mantel–Haenszel statistic [4853] can be used. Furthermore, even for small studies, the log-rank procedure is still quite robust. The Gehan, or modified Wilcoxon test [54, 55], as defined in Chap. 15 does not always produce interim values with independent increments and so cannot be easily incorporated using the usual group sequential procedures. A generalization of the Wilcoxon procedure for survival data, though, is appropriate [56] and the survival methods of analyses can in general terms be applied in group sequential monitoring. Instead of looking at equal-sized participant groups, the group sequential methods described strictly require that interim analyses should be done after an additional equal number of events have been observed. Since monitoring committees usually meet at fixed calendar times, the condition of equal number of events might not be met exactly. However, the methods applied under these circumstances are approximately correct [57] if the increments are not too disparate. Other authors have also described the application of group sequential methods to survival data [5861].
Interim log-rank tests in the Beta-Blocker Heart Attack Trial [62, 63] were evaluated using the O’Brien–Fleming group sequential procedure [42]. Seven meetings had been scheduled to review interim data. The trial was designed for a two-sided 5% significance level. These specifications produce the group sequential boundary shown in Fig. 17.2. In addition, the interim results of the log-rank statistic are also shown for the first six meetings. From the second analysis on, the conventional significance value of 1.96 was exceeded. Nevertheless, the trial was continued. At the sixth meeting, when the O’Brien–Fleming boundary was crossed, a decision was made to terminate the trial with the final mortality curves as seen earlier in Fig. 16.​5. However, it should be emphasized that crossing the boundary was not the only factor in this decision.
A61079_5_En_17_Fig2_HTML.gif
Fig. 17.2
Six interim log rank statistics plotted for the time of data monitoring committee meetings with a two-sided O’Brien-Fleming significance level boundary in the Beta-Blocker Heart Attack Trial. Dashed line represents Z = 1.96 [63]

Flexible Group Sequential Procedures: Alpha Spending Functions

While the group sequential methods described are an important advance in data monitoring, the Beta-blocker Heart Attack Trial (BHAT) [62, 63] experience suggested two limitations. One was the need to specify the number K of planned interim analyses in advance. The second was the requirement for equal numbers of either participants or events between each analysis. This also means that the exact time of the interim analysis must be pre-specified. As indicated in the BHAT example, the numbers of deaths between analyses were not equal and exactly seven analyses of the data had been specified. If the monitoring committee had requested an additional analysis between the fifth and sixth scheduled meetings, the O’Brien–Fleming group sequential procedure would not have directly accommodated such a modification. Yet such a request could easily have happened. In order to accommodate the unequal numbers of participants or events between analyses and the possibility of larger or fewer numbers of interim analyses than pre-specified, flexible procedures that eliminated those restrictions were developed [6471]. The authors proposed a so-called alpha spending function which allows investigators to determine how they want to allocate or “spend” the Type I error or alpha during the course of the trial. This function guarantees that at the end of the trial, the overall Type I error will equal the prespecified value of α. As will be described, this approach is a generalization of the previous group sequential methods so that the Pocock [39] and O’Brien–Fleming [42] monitoring procedures become special cases.
We must first distinguish between calendar time and information fraction [70, 71]. The information expected from all participants at the planned end of the trial is the total information. At any particular calendar time t during the study, a certain fraction t* of the total information is observed. That may be approximated by the fraction of participants randomized at that point, n, divided by the total number expected, N, or in survival studies, by the number of events observed already, d, divided by the total number expected D. Thus, the value for t* must be between 0 and 1. The information fraction is more generally defined in terms of ratio of the inverse of the variance of the test statistic at the particular interim analysis and the final analysis. The alpha spending function, α(t*), determines how the prespecified α is allocated at each interim analyses as a function of the information fraction. At the beginning of a trial, t* = 0 and α(t*) = 0, while at the end of the trial, t* = 1 and α(t*) = α. Alpha-spending functions that correspond to the Pocock and O’Brien–Fleming boundaries shown in Fig. 17.1 are indicated in Fig. 17.3 for a two-sided 0.05 α level and five interim analyses. These spending functions correspond to interim analyses at information fractions at 0.2, 0.4, 0.6, 0.8, and 1.0. However, in practice the information fractions need not be equally spaced. We chose those information fractions to indicate the connection between the earlier discussion of group sequential boundaries and the α spending function. The Pocock-type spending function allocates the alpha more rapidly than the O’Brien–Fleming type spending function. For the O’Brien–Fleming-type spending function at t* = 0.2, the α(0.2) is less than 0.0001 which corresponds approximately to the very large critical value or boundary value of 4.56 in Fig. 17.1. At t* = 0.4, the amount of α which can be spent is α(0.4) − α(0.2) which is approximately 0.0006, corresponding to the boundary value 3.23 in Fig. 17.1. That is, the difference in α(t*) at two consecutive information fractions, t* and t** where t* is less than t**, α(t**) − α(t*), determines the boundary or critical value at t**. Obtaining these critical values consecutively requires numerically integrating a distribution function similar to that for the Pocock boundary and is described elsewhere in detail [68]. Because these spending functions are only approximately equivalent to the Pocock or O’Brien–Fleming boundaries, the actual boundary values will be similar but not exactly the same. However, the practical operational differences are important in allowing greater flexibility in the monitoring process. Programs are available for these calculations [72, 73].
A61079_5_En_17_Fig3_HTML.gif
Fig. 17.3
Alpha-spending functions for K = 5, two-sided α = 0.05 at information fractions 0.2, 0.4, 0.6, 0.8, and 1.0. α1(t*) ~ O’Brien-Fleming; α2(t*) ~ Pocock; α3(t*) ~ uniform [74]
Many different spending functions can be specified. The O’Brien–Fleming α 1(t*) and Pocock α 2 (t*) type spending functions are specified as follows:
 $$ \begin{array}{ll}{\alpha}_1\left(t^{*}\right)=2-2\Phi \left({Z}_{\alpha / 2}/\sqrt{t^{*}}\right)\hfill & \sim \mathrm{O}'\mathrm{Brien}\hbox{-} \mathrm{Fleming}\hfill \\ {}{\alpha}_2\left(t^{*}\right)=\alpha\ \ln \left(1+\Big(e-1\right)t^{*}\Big)\hfill & \sim \mathrm{Pocock}\hfill \\ {}{\alpha}_3\left(t^{*}\right)=\alpha\ {t^{*}}^{\theta}\hfill & \mathrm{for} \ \theta >0\hfill \end{array} $$
The spending function α 3(t*) spends alpha uniformly during the trial for θ = 1, at a rate somewhat between α 1(t*) and α 2(t*). Other spending functions have also been defined [75, 76].
The advantage of the alpha-spending function is that neither the number nor the time of the interim analyses needs to be specified in advance. Once the particular spending function is selected, the information fractions t 1*, t 2*, …. determine the critical or boundary values exactly. In addition, the frequency of the interim analyses can be changed during the trial and still preserve the prespecified α level. Even if the rationale for changing the frequency is dependent on the emerging trends, the impact on the overall Type I error rate is almost negligible [77, 78]. These advantages give the spending function approach to group sequential monitoring the flexibility in analysis times that is often required in actual clinical trial settings [79]. It must be emphasized that no change of the spending function itself is permitted during the trial. Other authors have discussed additional aspects of this approach [8082].

Applications of Group Sequential Boundaries

As indicated in the BHAT example [62, 63], the standardized logrank test can be compared to the standardized boundaries provided by the O’Brien–Fleming, Pocock, or α spending function approach. However, these group sequential methods are quite widely applicable for statistical tests. Under very general conditions, any statistic testing a single parameter from a parametric or semiparametric model has the normal or asymptotically normal distribution with independent increments of information between interim analyses which is sufficient for this approach [83, 84]. Many of the commonly used test statistics used in clinical trials have this feature. Besides logrank and other survival tests, comparisons of means, comparison of proportions [39, 85] and comparison of linear regression slopes [8691] can be monitored using this approach. For means and proportions, the information fraction can be approximated by the ratio of the number of participants observed to the total expected. For regression slopes, the information fraction is best determined from the ratio of the inverse of the variance of the regression slope differences computed for the current and expected final estimate [86, 90, 91]. Considerable work has extended the group sequential methodology to more general linear and nonlinear random effects models for continuous data and to repeated measure methods for categorical data [83, 84, 92]. Thus, for most of the statistical tests that would be applied to common primary outcome measures in a clinical trial setting, the flexible group sequential methods can be used directly.
If the trial continues to the scheduled termination point, a p value is often computed to indicate the extremeness of the result. If the standardized statistical test exceeds the critical value, the p value would be less than the corresponding significance level. If a trial is terminated early or continues to the end with the standardized test exceeding or crossing the boundary value, a p value can also be computed [93]. These p values cannot be the nominal p value corresponding to the standardized test statistic. They must be adjusted to account for the repeated statistical testing of the outcome measure and for the particular monitoring boundary employed. Calculation of the p value is relatively straight forward with existing software packages [72, 73].
Statistical tests of hypotheses are but one of the methods used to evaluate the results of a clinical trial. Once trials are terminated, either on schedule or earlier, confidence intervals (CIs) are often used to give some sense of the uncertainty in the estimated treatment or intervention effect. For a fixed sample study, CIs are typically constructed as
 $$ \left(\mathrm{effect}\ \mathrm{estimate}\right)\pm Z\left(\alpha \right)\ \mathrm{S}\mathrm{E}\left(\mathrm{estimate}\right) $$
where SE is the standard error of the estimate.
In the group sequential monitoring setting, this CI will be referred to as the naïve estimate since it does not take into account the sequential testing aspects. In general, construction of CIs following the termination of a clinical trial is not as straightforward [94107], but software exists to aid in the computations [72]. The major problem with naive CIs is that they may not give proper coverage of the unknown but estimated treatment effect. That is, the CIs constructed in this way may not include the true effect with the specified frequency (e.g., 95%). For example, the width of the CI may be too narrow. Several methods have been proposed for constructing a more proper CI [94107] by typically ordering the possible outcomes in different ways. That is, a method is needed to determine if a treatment effect at one time is either more or less extreme than a difference at another time. None of the methods proposed appear to be universally superior but the ordering originally suggested by Siegmund [104] and adopted by Tsiatis et al. [105] appears to be quite adequate in most circumstances. In this ordering, any treatment comparison statistic which exceeds the group sequential boundary at one time is considered to be more extreme than any result which exceeds the sequential boundary at a later time. While construction of CIs using this ordering of possible outcomes can break down, the cases or circumstances are almost always quite unusual and not likely to occur in practice [107]. It is also interesting that for conservative monitoring boundaries such as the O’Brien– Fleming method, the naive CI does not perform that poorly, due primarily to the extreme early conservatism of the boundary [103]. While more exact CIs can be computed for this case, the naive estimate may still prove useful as a quick estimate to be recalculated later using the method described [105]. Pocock and Hughes [102] have suggested that the point estimate of the effect of the intervention should also be adjusted, since trials that are terminated early tend to exaggerate the size of the true treatment difference. Others have also pointed out the bias in the point estimate [96, 101]. Kim [101] suggested that an estimate of the median is less biased.
CIs can also be used in another manner in the sequential monitoring of interim data. At each interim analysis, a CI could be constructed for the parameter summarizing the intervention effect, such as differences in means, proportions, or hazard ratios. This is referred to as repeated confidence intervals (RCIs) [95, 98, 99]. If the RCI excludes a null difference, or no intervention effect, then the trial might be stopped claiming a significant effect, either beneficial or harmful. It is also possible to continue the trial unless the CI excluded not only no difference but also minimal or clinically unimportant differences. On the other hand, if all values of clinically meaningful treatment differences are ruled out or fall outside the CI, then that trial might be stopped claiming that no useful clinical effect is likely. This method is useful for non-inferiority designs as described earlier in Chap. 5. Here, as for CIs following termination, the naive CI is not appropriate. Jennison and Turnbull [98, 99] have suggested one method for RCIs that basically inverts the group sequential test. That is, the CI has the same form as the naive estimate, but the coefficient is the standardized boundary value as determined by the spending function, for example. The RCI then has the following form:
 $$ \left(\mathrm{treatment}\ \mathrm{difference}\right)\pm Z(k)\mathrm{S}\mathrm{E}\left(\mathrm{difference}\right) $$
where Z(k) is the sequential boundary value at the kth interim analysis. For example, using the O’Brien–Fleming boundaries shown in Fig. 17.1, we would have a coefficient of 4.56 at k = 1,  $$ {t}_1^{*}=0.2 $$ and 3.23 at k = 2,  $$ {t}_2^{*}=0.4 $$ . Used in this manner, the RCI and the sequential test of the null hypothesis will yield the same conclusions.
One particular application of the RCI is for trials whose goal is to demonstrate that two interventions or treatments are essentially equivalent, that is, have an effect that is considered to be within a specified acceptable range and might be used interchangeably. As indicated in Chap. 5, clinicians might select the cheaper, less toxic or less invasive intervention if the effects were close enough. One suggestion for “close enough” or “equivalence” would be treatments whose effects are within 20% [108, 109]. Thus, RCIs that are contained within a 20% range would suggest that the results are consistent with this working definition of equivalence. For example, if the relative risks were estimated along with a RCI, the working range of equivalence would be from 0.8 to 1.2, where large values indicate inferiority of the intervention being tested. The trial would continue as long as the upper limit of the RCI exceeded 1.2 since we would not have ruled out a treatment worsening by 20% or more. Depending on the trial and the interventions, the trial might also continue until the lower limit of the RCI was larger than 0.8, indicating no improvement by 20% or greater.
As described in Chap. 5, there is a fundamental difference between an “equivalence” design and a noninferiority design. The former is a two-sided test, with the aim of establishing a narrow range of possible differences between the new intervention and the standard, or that any difference is within a narrow range. The noninferiority design aims to establish that the new intervention is no worse than the standard by some prespecified margin. It may be that the margins in the two designs are set to the same value. From a data monitoring point of view, both of these designs are best handled by sequential CIs [99]. As data emerge, the RCI takes into consideration the event rate or variability, the repeated testing aspects, and the level of the CI. The upper and lower boundaries can address either the “equivalence” point of view or the noninferiority margin of indifference.

Asymmetric Boundaries

In most trials, the main purpose is to test whether the intervention is superior to the control. It is rarely ethical to continue a study in order to prove, at the usual levels of significance, that the intervention is harmful relative to a placebo or standard control. This point has been mentioned by authors [110, 111] who discuss methods for group sequential designs in which the hypothesis to be tested is one-sided; that is, to test whether the intervention is superior to the control. They proposed retaining the group sequential upper boundaries of methods such as Pocock, Haybittle–Peto, or O’Brien–Fleming for rejection of H 0 while suggesting various forms of a lower boundary which would imply “acceptance” of H 0. One simple approach is to set the lower boundary at an arbitrary value of Z i , such as −1.5 or −2.0. If the test statistic goes below that value, the data may be sufficiently suggestive of a harmful effect to justify terminating the trial. This asymmetric boundary attempts to reflect the behavior or attitude of members of many monitoring committees, who recommend stopping a study once the intervention shows a strong, but non-significant, trend in an adverse direction for major events. Emerson and Fleming [112] recommend a lower boundary for acceptance of the null hypothesis which allows the upper boundary to be changed in order to preserve the Type I error exactly. Work by Gould and Pecore [113] suggests ways for early acceptance of the null hypothesis while incorporating costs as well. For new interventions, trials might well be terminated when the chances of a positive or beneficial result seem remote (discussed in the next section). However, if the intervention arm is being compared to a standard but the intervention is already in widespread use, it may be important to distinguish between lack of benefit and harm [114]. For example, if the intervention is not useful for the primary outcome, and also not harmful, it may still have benefits such as on other secondary clinical outcomes, quality of life, or fewer adverse events that would still make it a therapeutic option. In such cases, a symmetric boundary for the primary outcome might be appropriate.
An example of asymmetric group sequential boundaries is provided by the Cardiac Arrhythmia Suppression Trial (CAST). Two arms of the trial (encainide and flecainide, each vs. placebo) were terminated early using a symmetric two-sided boundary, although the lower boundary for harm was described as advisory by the authors [115117]. The third comparison (moricizine vs. placebo) continued. However, due to the experience with the encainide and flecainide arms, the lower boundary for harm was revised to be less stringent than originally, i.e., an asymmetric boundary was used [115].
MERIT-HF used a modified version of the Haybittle–Peto boundary for benefit, requiring a critical value near +3.0 and a similar but asymmetric boundary, close to a critical Z value of −2.5 for harm as shown in Fig. 17.4. In addition, at least 50% of the designed person years of exposure were to be observed before early termination could be recommended. The planned interim analyses to consider benefit were at 25, 50, and 75% of the expected target number of events. Because there was a concern that treating heart failure with a beta blocker might be harmful, the monitoring committee was required to evaluate safety on a monthly basis using the lower sequential boundary as a guide. At the 25% interim analyses, the statistic for the logrank test was +2.8, just short of the boundary for benefit. At the 50% interim analyses, the observed logrank statistic was +3.8, clearly exceeding the sequential boundary for benefit. It also met the desired person years of exposure as plotted in Fig. 17.4. Details of this experience are described elsewhere [118]. A more detailed presentation of group sequential methods for interim analysis of clinical trials may be found in books by Jennison and Turnbull [119] and Proschan, Lan, and Wittes [120].
A61079_5_En_17_Fig4_HTML.gif
Fig. 17.4
MERIT-HF group sequential monitoring bounds for mortality [118]

Curtailed Sampling and Conditional Power Procedures

During the course of monitoring accumulating data, one question often posed is whether the current trend in the data is so impressive that “acceptance” or rejection of H 0 is already determined, or at least close to being determined. If the results of the trial are such that the conclusions are known for certain, no matter what the future outcomes might be, then consideration of early termination is in order. A helpful sports analogy is a baseball team “clinching the pennant” after winning a specific game. At that time, it is known for certain who has won and who has not won the pennant or league championship, regardless of the outcome of the remaining games. Playing the remaining games is done for reasons (e.g., fiscal) other than deciding the winner. This idea has been developed for clinical trials and is often referred to as deterministic curtailed sampling. It should be noted that group sequential methods focus on existing data while curtailed sampling in addition considers the data which have not yet been observed.
Alling [121, 122] developed a closely related approach when he considered the early stopping question and compared the survival experience in two groups. He used the Wilcoxon test for two samples, a frequently used non-parametric test which ranks survival times and which is the basis for one of the primary survival analysis techniques. Alling’s method allows stopping decisions to be based on data available during the trial. The trial would be terminated if future data could not change the final conclusion about the null hypothesis. The method is applicable whether all participants are entered at the same time or recruitment occurs over a longer period of time. However, when the average time to the event is short relative to the time needed to enroll participants, the method is of limited value. The repeated testing problem is irrelevant, because any decision to reject the null hypothesis is based on what the significance test will be at the end of the study. Therefore, frequent use of this procedure during the trial causes no problem with regard to significance level and power.
Many clinical trials with survival time as a response variable have observations that are censored; that is, participants are followed for some length of time and then at some point, no further information about the participant is known or collected. Halperin and Ware [123] extended the method of Alling to the case of censored data, using the Wilcoxon rank statistic. With this method, early termination is particularly likely when the null hypothesis is true or when the expected difference between groups is large. The method is shown to be more effective for small sample sizes than for large studies. The Alling approach to early stopping has also been applied to another commonly used test, the Mantel–Haenszel statistic. However, the Wilcoxon statistic appears to have better early stopping properties than the Mantel–Haenszel statistic.
A deterministic curtailed procedure has been developed [124] for comparing the means of two bounded random variables using the two sample t-test. It assumes that the response must be between two values, A and B (A < B). An approximate solution is an extreme case approach. First, all the estimated remaining responses in one group are given the maximum favorable outcome and all the remaining responses in the other take on the worst response. The statistic is then computed. Next, the responses are assigned in the opposite way and a second statistic is computed. If neither of these two extreme results alters the conclusion, no additional data are necessary for testing the hypothesis. While this deterministic curtailed approach provides an answer to an interesting question, the requirement for absolute certainty results in a very conservative test and allows little opportunity for early termination.
In some clinical trials, the final outcome may not be absolutely certain, but almost so. To use the baseball analogy again, a first place team may not have clinched the pennant but is so many games in front of the second place team that it is highly unlikely that it will not, in fact, end up the winner. Another team may be so far behind that it cannot “realistically” catch up. In clinical trials, this idea is often referred to as stochastic curtailed sampling or conditional power. It is identical to the concept of conditional power discussed in the section on extending a trial.
One of the earliest applications of the concept of conditional power was in the CDP [1, 125]. In this trial, several treatment arms for evaluating cholesterol lowering drugs produced negative trends in the interim results. Through simulation, the probability of achieving a positive or beneficial result was calculated given the observed data at the time of the interim analysis. Unconditional power is the probability at the beginning of the trial of achieving a statistically significant result at a prespecified alpha level and with a prespecified alternative treatment effect. Ideally, trials should be designed with a power of 0.80–0.90 or higher. However, once data begin to accumulate, the probability of attaining a significant result increases or decreases with emerging positive or negative trends. Calculating the probability of rejecting the null hypothesis of no effect once some data are available is conditional power.
Lan et al. [126] considered the effect of stochastic curtailed or conditional power procedures on Type I and Type II error rates. If the null hypothesis, H 0, is tested at time t using a statistic, S(t), then at the scheduled end of a trial at time T, the statistic would be S(T). Two cases are considered. First, suppose a trend in favor of rejecting H 0 is observed at time t < T, with intervention doing better than control. One then computes the conditional probability, γ0 of rejecting H 0 at time T; that is, S(T) > Z α , assuming H 0 to be true and given the current data, S(t). If this probability is sufficiently large, one might argue that the favorable trend is not going to disappear. Second, suppose a negative trend or data consistent with the null hypothesis of no difference, at some point t. Then, one computes the conditional probability, γ1, of rejecting H 0 at the end of the trial, time T, given that some alternative H 1 is true, for a sample of reasonable alternatives. This essentially asks how large the true effect must be before the current “negative” trend is likely to be reversed. If the probability of a trend reversal is highly unlikely for a realistic range of alternative hypotheses, trial termination might be considered.
Because there is a small probability that the results will change, a slightly greater risk of a Type I or Type II error rate will exist than would be if the trial continued to the scheduled end [127]. However, it has been shown that the Type I error is bounded very conservatively by α0 and the Type II error by β1. For example, if the probability of rejecting the null hypothesis, given the existing data were 0.85, then the actual Type I error would be no more than 0.05/0.85 or 0.059, instead of 0.05. The actual upper limit is considerably closer to 0.05, but that calculation requires computer simulation. Calculation of these probabilities is relatively straightforward and the details have been described by Lan and Wittes [128]. A summary of these methods, using the approach of DeMets [74], follows.
Let Z(t) represent the standardized statistic at information fraction t. The information fraction may be defined, for example, as the proportion of expected participants or events observed so far. The conditional power, CP, for some alternative intervention effect θ, using a critical value of Z α for a Type I error of alpha, can be calculated as
 $$ P\left[Z(1)\ge {Z}_{\alpha}\Big|Z(t),\theta \right]=1-\varPhi \left\{\left|{Z}_{\alpha }-Z(t)\sqrt{t}-\theta \left(1-t\right)\right|/\sqrt{1-t}\right\} $$
where θ = E(Z(t = 1)), the expected value of the test statistic at the full completion of the trial.
The alternative θ is defined for various outcomes as follows for:
  1. 1.
    Survival outcome (D = total events)
     $$ \theta =\sqrt{D/4}\mathrm{Log}\left({\lambda}_C/{\lambda}_T\right) $$
    λ C  and λ T are the hazard rates in the control and intervention arms, respectively.
     
  2. 2.
    Binomial outcome (2n = N, n/arm or N = total sample size)
     $$ \begin{array}{c}\hfill \theta =\frac{P_C-{P}_T}{\sqrt{2\overline{p}\left(1-\overline{p}\right)/\left(n\ /2\ \right)}}=\frac{\left({P}_C-{P}_T\right)\sqrt{N/4}}{\sqrt{\overline{p}\left(1-\overline{p}\right)}}\hfill \\ {}=1/2\frac{\left({P}_C-{P}_T\right)\sqrt{N}}{\sqrt{\overline{pq}}}\hfill \end{array} $$
    where P C and P T are the event rates in the control arm and intervention arm respectively and  $$ \overline{p} $$ is the common event rate.
     
  3. 3.
    Continuous outcome (means) (N = total sample size)
     $$ \begin{array}{c}\hfill \theta =\left(\frac{\mu_C-{\mu}_T}{\sigma}\right)\sqrt{N/4}\hfill \\ {}=1/2\left(\frac{\mu_C-{\mu}_T}{\sigma}\right)\sqrt{N}\hfill \end{array} $$
    where μ C and μ T are the mean response levels for the control and the intervention arms, respectively, and σ is the common standard deviation.
     
If we specify a particular value of the conditional power as γ, then a boundary can also be produced which would indicate that if the test statistic fell below that, the chance of finding a significant result at the end of the trial is less than γ [127]. For example, in Fig. 17.5 the lower futility boundary is based on a specified conditional power γ, ranging from 10 to 30% that might be used to claim futility of finding a positive beneficial claim at the end of the trial. For example, if the standardized statistic crosses that 20% lower boundary, the conditional power for a beneficial result at the end of the trial is less than 0.20 for the specified alternative.
A61079_5_En_17_Fig5_HTML.gif
Fig. 17.5
Conditional power boundaries: outer boundaries represent symmetric O’Brien-Fleming type sequential boundaries (α = 0.05). Three lower boundaries represent boundaries for 10%, 20% and 30% conditional power to achieve a significant (P < 0.05) result of the trial conclusion [74]
Conditional power calculations are done for a specific alternative but in practice, a monitoring committee would likely consider a range of possibilities. These specified alternatives may range between the null hypothesis of no effect and the prespecified design based alternative treatment effect. In some cases, a monitoring committee may consider even more extreme beneficial effects to determine just how much more effective the treatment would have to be to raise the conditional power to desired levels. These conditional power results can be summarized in a table or a graph, and then monitoring committee members can assess whether they believe recovery from a substantial negative trend is likely.
Conditional power calculations were utilized in the Vesnarinone in Heart Failure Trial (VEST) [129]. In Table 17.1, the test statistics for the logrank test are provided for the information fractions at a series of monitoring committee meetings. Table 17.2 provides conditional power for VEST at three of the interim analyses. A range of intervention effects was used including the beneficial effect (hazard rate less than 1) seen in a previous vesnarinone trial to the observed negative trend (hazard rates of 1.3 and 1.5). It is clear that the conditional power for a beneficial effect was very low by the midpoint of this trial for a null effect or worse. In fact, the conditional power was not encouraging even for the original assumed effect. As described by DeMets et al. [114] the trial continued beyond this point due to the existence of a previous trial that indicated a large reduction in mortality, rather than the harmful effect observed in VEST.
Table 17.1
Accumulating results for the Vesnarinone in Heart Failure Trial (VEST) [129]
Information fraction
Log-rank Z-value (high dose)
0.43
+0.99
0.19
−0.25
0.34
−0.23
0.50
−2.04
0.60
−2.32
0.67
−2.50
0.84
−2.22
0.20
−2.43
0.95
−2.71
1.0
−2.41
Table 17.2
Conditional power for the Vesnarinone in Heart Failure Trial (VEST) [129]
RR
Information fraction
0.50
0.67
0.84
0.50
0.46
<0.01
<0.01
0.70
0.03
<0.01
<0.01
1.0
<0.01
<0.01
<0.01
1.3
<0.01
<0.01
<0.01
1.5
<0.01
<0.01
<0.01
RR = relative risk
The Beta-Blocker Heart Attack Trial [62, 63] made considerable use of this approach. As discussed, the interim results were impressive with 1 year of follow-up still remaining. One question posed was whether the strong favorable trend (Z = 2.82) could be lost during that year. The probability of rejecting H 0 at the scheduled end of the trial, given the existing trend (γ 0), was approximately 0.90. This meant that the false positive or Type I error was no more than α/γ 0 = 0.05/0.90 or 0.056.

Other Approaches

Other techniques for interim analysis of accumulating data have also received attention. These include binomial sampling strategies [15], decision theoretic models [130], and likelihood or Bayesian methods [131140]. Bayesian methods require specifying a prior probability on the possible values of the unknown parameter. The experiment is performed and based on the data obtained, the prior probability is adjusted. If the adjustment is large enough, the investigator may change his opinion (i.e., his prior belief). Spiegelhalter et al. [139] and Freedman et al. [135] have implemented Bayesian methods that have frequentist properties very similar to boundaries of either the Pocock or O’Brien–Fleming type. It is somewhat reassuring that two methodologies, even from a different theoretical framework, can provide similar monitoring procedures. While the Bayesian view is critical of the hypothesis testing methods because of the arbitrariness involved, the Bayesian approach is perhaps hampered mostly by the requirement that the investigator formally specify a prior probability. However, if a person during the decision-making process uses all of the factors and methods discussed in this chapter, a Bayesian approach is involved, although in a very informal way.
One Bayesian method to assess futility that has been used extensively is referred to as predictive power and is related to the concept of conditional power. In this case, the series of possible alternative intervention effects, θ, are represented by a prior distribution for θ, distributing the probability across the alternatives. The prior probability distribution can be modified by the current trend to give an updated posterior for θ. The conditional power is calculated as before for a specific value of θ. Then a predictive or “average” power is calculated by integrating the conditional power over the posterior distribution for θ:
 $$ p\left({X}_f\in R\right.\left|{x}_0\right)={\displaystyle \int p\left({X}_f\in R\Big|\theta \right)p\left(\theta \Big|{x}_0\right)\;d\theta } $$
This can then be utilized by the monitoring committee to assess whether the trial is still viable, as was computed for the interim analyses conducted in VEST [129] as shown in Table 17.3. In this case, the prior was taken from an earlier trial of vesnarinone where the observed reduction in mortality was over 60% (relative risk = 0.40). For these calculations, the prior was first set at the point estimate of the hazard ratio equal to 0.40. Using this approach, it is clear that VEST would not likely have shown a benefit at the end of the trial.
Table 17.3
Predictive probability for the Vesnarinone in Heart Failure Trial (VEST) [129]
Date
Ta
Probability
Hazard rate = 0.40
2/7/96
0.50
0.28
3/7/96
0.60
0.18
4/10/96
0.67
<0.0001
5/19/96
0.84
<0.0001
6/26/96
0.90
<0.0001
aT = information fraction
We have stated that the monitoring committee should be aware of all the relevant information in the use of the intervention which existed before the trial started and which emerges during the course of a trial. Some have argued that all of this information should be pooled or incorporated and updated sequentially in a formal statistical manner [141]. This is referred to as cumulative meta-analysis (see Chap. 18). We do not generally support cumulative or sequential meta-analysis as a primary approach for monitoring a trial. We believe that the results of the ongoing trial should be first presented alone, in detail, including baseline comparisons, primary and secondary outcomes, adverse events and relevant laboratory data (see Chap. 16). As supportive evidence for continuation or termination, results or other analysis from external completed trials may be used, including a pooled analysis of all available external data.

Trend Adaptive Designs and Sample Size Adjustments

Sample size adjustments based on overall event rates or outcome variability, without knowledge of interim trends, have long been performed to regain trial power with no issues regarding impact on Type I error or other design concerns. However, while sample size adjustments based on comparing emerging trends in the intervention and control groups were initially discouraged, statistical methodology now allows trialists to adjust the sample size and maintain the Type I error while regaining power [142164]. It is possible to have a statistically efficient or nearly efficient design if the adaptations are prespecified [154]. While multiple adjustments over the course of follow-up are possible, the biggest gain comes from a single adaptive adjustment.
These methods must be implemented by some individual or party that is aware of the emerging trend. In general, we do not recommend that the monitoring committee perform this function because it may be aware of other factors that would mitigate any sample size increase but cannot share those issues with the trial investigators or sponsors. This can present an awkward if not an ethical dilemma for the monitoring committee. Rather, someone who only knows the emerging trend should make the sample size adjustment recommendation to the investigators. Whatever trend adaptive method is used must also take into account the final analyses as discussed briefly in Chap. 18, because it can affect the final critical value. We will briefly describe a few of these methods [145, 147, 159].
As proposed by Cui et al. [146, 147] for adaptive adjustments in a group sequential setting, suppose we measure an outcome variable denoted as X where X has a N(0,1) distribution and n is current sample size, N 0 is initial total sample size, N is new target sample size, θ is hypothesized intervention effect, and t is n/N 0. In this case, we can have an estimate of the intervention effect and a test statistic
based on n observations.
 $$ \begin{array}{c}\hfill \widehat{\uptheta}={\displaystyle \sum_i^n{x}_i/n}\hfill \\ {}\hfill {z}^{(n)}={\displaystyle \sum_i^n{x}_i/\sqrt{n}}\hfill \end{array} $$
We then compute a revised sample size N based on the current trend, assuming the same initial Type I error and desired power. A new test statistic is defined that combines the already observed data and the yet to be obtained data.
 $$ {Z}_W^{(N)}=\sqrt{t}{Z}^{(n)}+\sqrt{1-t}{\left(N-n\right)}^{-\frac{1}{2}}{\displaystyle \sum_{n+1}^N}{x}_{\mathrm{i}} $$
In this setting, we would reject the null hypothesis H 0 of no treatment effect if  $$ {Z}_W^{(N)}>{Z}_{\alpha } $$ . This revised test statistic controls the Type I error at the desired level α. However, less weight is assigned to the new or additional observations, yielding what is referred to as a weighted Z statistic. That is, the weight given to each trial participant prior to any sample size adjustment is greater than weight given to participants after the adjustment, violating a “one participant—one vote” principle. This discounting may not be acceptable for scientific and ethical reasons [144, 164].
Other approaches have also been proposed. A modification proposed by Chen et al. [145] requires that both the weighted and un-weighted test statistics exceed the standard critical value.
 $$ {Z}^{(N)}\mathrm{and}{Z}_W^{(N)} > {Z}_{\alpha } $$
In this case, the Type I error < α and there is no loss of power. Another approach, an adjusted p value method, proposed by Proschan and colleagues [159, 160] requires a “promising” p value before allowing an increase in sample size. However, this approach requires stopping if the first stage p value is not promising. It also requires a larger critical value at the second stage to control the Type I error. As an example, consider a one-sided significance level a = 0.05, which would ordinarily have a critical value of 1.645 for the final test statistic. In this case the promising p value, p′, and the final critical values, Z′, are as follows, regardless of the sample size in the second stage:
p′:
0.10
0.15
0.20
0.25
0.50
Z′:
1.77
1.82
1.85
1.875
1.95
This simple method will control the Type I error but in fact may make Type I error substantially less than 0.05. A method can be developed to obtain an exact Type I error as a function of Z(t) and the adjusted sample size N, using a conditional power type calculation [127] as described below.
Conditional power, CP, is a useful calculation to assess the likelihood of exceeding a critical value at the scheduled end of a trial, given the current data or value of the interim test statistic and making assumptions about the future intervention effect as described earlier in this chapter [67, 126, 128]. The computation of conditional power in this case is relatively simple. Let θ be a function of the intervention affect, as described earlier, and then
 $$ \begin{array}{l}\hfill CP\left(Z(t),\theta \right)=P\left[Z(T)\ge {Z}_{\alpha}\right|Z(t),\theta \Big)\hfill \\ {}=1-\varPhi \left\{\Big|,{Z}_{\alpha }-Z(t)\sqrt{t}-\theta \left(1-t\right)\Big|/\sqrt{\left(1-t\right)}\right\}\hfill \end{array} $$
Applying the idea of conditional power to the trend adaptive design, we can define an algorithm to adjust the sample size and still control the Type I error [146]. For example,
Let Δ = observed effect and δ = assumed effect. If we observe that for θ(Δ) as a function of the observed effect Δ, and θ (δ) as a function of the assumed δ, then if
 $$ \begin{array}{c}\hfill \begin{array}{cc}\hfill \mathrm{C}\mathrm{P}\left(Z(t),\ \theta \left(\varDelta \right)\right)>1.2CP\Big(Z(t),\theta \left(\delta \right),\hfill & \hfill \mathrm{decrease}\ \mathrm{N}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill \mathrm{C}\mathrm{P}\left(Z(t),\ \theta \left(\varDelta \right)\right)<0.8CP\Big(Z(t),\theta \left(\delta \right),\hfill & \hfill \mathrm{increase}\mathrm{N}\hfill \end{array}\hfill \end{array} $$
where N is the final targeted sample size. The properties of this procedure have not been well investigated but the idea is related to other conditional power approaches [153]. These conditional power procedures adjust the sample size if the computed conditional power for the current trend is marginal, with only a trivial impact on Type I error. For example, define a lower limit (c ) and an upper limit (c u) such that for the current trend θ(Δ):
if  $$ \mathrm{C}\mathrm{P}\left(Z(t),\theta \left(\varDelta \right)\right)<{c}_{\mathrm{\ell}} $$ , then terminate for futility and accept the null (required),
if  $$ \mathrm{C}\mathrm{P}\left(Z(t),\theta \left(\varDelta \right)\right)>{c}_{\mathrm{u}} $$ , then continue with no change in sample size, or
if  $$ {c}_{\mathrm{\ell}}<\mathrm{C}\mathrm{P}\left(Z(t),\theta \left(\varDelta \right)\right)<{c}_{\mathrm{u}} $$ , then increase sample size from N 0, to N to get conditional power to the desired level.
Chen et al. [145] suggested a modest alternative. If the conditional power is 50% or larger, then increase the sample size to get the desired power. An upper cap is typically placed on the size of the increase in sample size. Increase N 0 if the interim result is “promising,” defined as conditional power >50% for the current trend but the increase in N 0 cannot be greater than 1.75-fold. Under these conditions, Type I error is not increased and there is no practical loss in power. This approach is one that we favor since it is simple to implement, easy to understand and preserves the design characteristics.
Adaptive designs have appeal because the assumptions made during protocol development often fail to hold precisely for the implemented trial, making adjustments useful or even necessary for the study to succeed. However, adaptive designs also rely on assumptions which prove to be unmet in practice, so that theoretical gains are not necessarily realized. For example, it is often found that the observed event rate is less than expected, or the intervention effect not as great as had been assumed. Tsiatis and Mehta [163] have provided conditions under which a properly designed group sequential trial is more efficient than these adaptive designs, though Mehta has also argued that factors such as allocation of financial and participant resources may be as important as statistical efficiency [157]. In any case, a clear need exists for adaptive designs, including trend adaptive designs. We are fortunate that technical advances have been made through several new methods. Research continues on finding methods which can applied to different trial settings [143, 150152, 154158, 161, 164].
Perhaps the largest challenge is how to implement the trend adaptive design without introducing bias or leaving the door open for bias. If one utilizes one of the described trend adaptive designs, anyone who knows the details of the method can “reverse engineer” the implementation and obtain a reasonable estimate of what the current trend (Z(t)) must have been to generate the adjusted sample size (N). Given that these trend adaptive designs have as yet not been widely used, there is not enough experience to recommend what can be done to best minimize bias. However, as suggested earlier, a third party who knows only the emerging trend and none of the other secondary or safety data is probably best suited to make these calculations and provide them to the investigators.
References
1.
Canner PL. Practical Aspects of Decision-Making In Clinical Trials—The Coronary Drug Project as a Case-Study. Control Clin Trials 1981;1:363–376.
2.
DeMets DL. Data monitoring and sequential analysis—An academic perspective. J Acquir Immune Defic Syndr 1990;3:S124–S133.
3.
DeMets DL, Furberg C, Friedman LM. Data monitoring in clinical trials: a case studies approach. New York, NY, Springer, 2006.MATH
4.
Ellenberg SS, Fleming TR, DeMets DL. Data Monitoring Committees in Clinical Trials: A Practical Perspective. Wiley, 2003.
5.
Fisher MR, Roecker EB, DeMets DL. The role of an independent statistical analysis center in the industry-modified National Institutes of Health model. Drug Inf J 2001;35:115–129.
6.
Fleming TR, DeMets DL. Monitoring of clinical trials: issues and recommendations. Control Clin Trials 1993;14:183–197.
7.
Anscombe FJ. Sequential medical trials. J Am Stat Assoc 1963;58:365–383.MathSciNet
8.
Armitage P. Restricted sequential procedures. Biometrika 1957;9–26.
9.
Armitage P, McPherson CK, Rowe BC. Repeated Significance Tests on Accumulating Data. J R Stat Soc Ser A 1969;132:235–244.MathSciNet
10.
Bross I Sequential medical plans. Biometrics 1952;8:188–205.
11.
Cornfield J. Sequential trials, sequential analysis and the likelihood principle. Am Stat 1966;20:18–23.MATH
12.
DeMets DL, Lan KKG. An Overview of Sequential-Methods and Their Application in Clinical-Trials. Commun Stat Theory Methods 1984;13:2315–2338.MATH
13.
Robbins H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 1952;58:527–535.MathSciNetMATH
14.
Robbins H. Statistical methods related to the law of the iterated logarithm. Ann Math Stat 1970;1397–1409.
15.
Simon R, Weiss GH, Hoel DG. Sequential Analysis of Binomial Clinical Trials. Biometrika 1975;62:195–200.MathSciNetMATH
16.
Wald A. Sequential Analysis. Dover Publications, 2013.
17.
Whitehead J, Stratton I. Group Sequential Clinical Trials with Triangular Continuation Regions. Biometrics 1983;39:227–236.MathSciNet
18.
Whitehead J. The Design and Analysis of Sequential Clinical Trials. Wiley, 1997.
19.
Whitehead J, Jones D. The analysis of sequential clinical trials. Biometrika 1979;66:443–452.MATH
20.
Armitage P. Sequential medical trials, ed 2. New York, Wiley, 1975.
21.
Silverman WA, Agate FJ, Fertig JW. A sequential trial of the nonthermal effect of atmospheric humidity on survival of newborn infants of low birth weight. Pediatrics 1963;31:719–724.
22.
Truelove SC, Watkinson G, Draper G. Comparison of corticosteroid and sulphasalazine therapy in ulcerative colitis. Br Med J 1962;2:1708.
23.
Freireich EJ, Gehan E, Frei E, et al. The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia: A model for evaluation of other potentially useful therapy. Blood 1963;21:699–716.
24.
McPherson CK, Armitage P. Repeated significance tests on accumulating data when the null hypothesis is not true. J R Stat Soc Ser A 1971;15–25.
25.
Chatterjee SK, Sen PK. Nonparametric testing under progressive censoring. Calcutta Statist Assoc Bull 1973;22:13–50.MathSciNetMATH
26.
Dambrosia JM, Greenhouse SW. Early stopping for sequential restricted tests of binomial distributions. Biometrics 1983;695–710.
27.
Whitehead J, Jones DR, Ellis SH. The analysis of a sequential clinical trial for the comparison of two lung cancer treatments. Statist Med 1983;2:183–190.
28.
Breslow NE, Haug C. Sequential comparison of exponential survival curves. J Am Stat Assoc 1972;67:691–697.
29.
Canner PL. Monitoring treatment differences in long-term clinical trials. Biometrics 1977;603–615.
30.
Davis CE. A two sample Wilcoxon test for progressively censored data. Commun Stat Theory Methods 1978;7:389–398.MATH
31.
Joe H, Koziol JA, Petkau AJ. Comparison of procedures for testing the equality of survival distributions. Biometrics 1981;327–340.
32.
Jones D, Whitehead J. Sequential forms of the log rank and modified Wilcoxon tests for censored data. Biometrika 1979;66:105–113.MathSciNetMATH
33.
Koziol JA, Petkau AJ. Sequential testing of the equality of two survival distributions using the modified Savage statistic. Biometrika 1978;65:615–623.MathSciNetMATH
34.
Muenz LR, Green SB, Byar DP. Applications of the Mantel-Haenszel statistic to the comparison of survival distributions. Biometrics 1977;617–626.
35.
Nagelkerke NJD, Hart AAM. The sequential comparison of survival curves. Biometrika 1980;67:247–249.MathSciNetMATH
36.
Sellke T, Siegmund D. Sequential analysis of the proportional hazards model. Biometrika 1983;70:315–326.MathSciNetMATH
37.
Haybittle JL. Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol 1971;44:793–797.
38.
Peto R, Pike MC, Armitage P, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br J Cancer 1976;34:585.
39.
Pocock SJ. Group Sequential Methods in Design and Analysis of Clinical-Trials. Biometrika 1977;64:191–200.
40.
Pocock SJ. Size of cancer clinical trials and stopping rules. Br J Cancer 1978;38:757.
41.
Pocock SJ. Interim analyses for randomized clinical trials: the group sequential approach. Biometrics 1982;38:153–162.
42.
O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979;35:549–556.
43.
DeMets DL. Practical aspects in data monitoring: a brief review. Stat Med 1987;6:753-760.
44.
Emerson SS, Fleming TR. Interim analyses in clinical trials. Oncology (Williston Park, NY) 1990;4:126.
45.
Fleming TR, Watelet LF. Approaches to monitoring clinical trials. J Natl Cancer Inst 1989;81:188–193.
46.
Freedman LS, Lowe D, Macaskill P. Stopping rules for clinical trials. Statist Med 1983;2:167–174.
47.
Jennison C, Turnbull BW. Statistical approaches to interim monitoring of medical trials: a review and commentary. Stat Sci 1990;299–317.
48.
Gail MH, DeMets DL, Slud EV. Simulation studies on increments of the two-sample logrank score test for survival time data, with application to group sequential boundaries. Lecture Notes-Monograph Series 1982;2:287–301.
49.
Harrington DP, Fleming TR, Green SJ. Procedures for serial testing in censored survival data; in Crowley J, Johnson RA, Gupta SS (eds): Survival Analysis. Hayward, CA, Institute of Mathematical Statistics, 1982, pp 269–286.
50.
Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer chemotherapy reports Part 1 1966;50:163–170.
51.
Tsiatis AA. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika 1981;68:311–315.MathSciNetMATH
52.
Tsiatis AA. Group sequential methods for survival analysis with staggered entry. Lecture Notes-Monograph Series 1982;2:257–268.
53.
Tsiatis AA. Repeated Significance Testing for A General-Class of Statistics Used in Censored Survival Analysis. J Am Stat Assoc1982;77:855–861.MathSciNetMATH
54.
Gehan EA. A Generalized Wilcoxon Test for Comparing Arbitrarily Singly-Censored Samples. Biometrika 6-1-1965;52:203–223.
55.
Slud E, Wei LJ. Two-Sample Repeated Significance Tests Based on the Modified Wilcoxon Statistic. J Am Stat Assoc 1982;77:862–868.MathSciNetMATH
56.
Peto R, Peto J: Asymptotically Efficient Rank Invariant Test Procedures. J R Stat Soc Ser A 1972;135:185–207.MATH
57.
DeMets DL, Gail MH. Use of logrank tests and group sequential methods at fixed calendar times. Biometrics 1985;41:1039–1044.
58.
George SL. Sequential Methods Based on the Boundaries Approach for the Clinical Comparison Of Survival Times-Discussion. Statistics in Medicine 13[13-14], 1369–1370. 1994, John Wiley & Sons Ltd.
59.
Kim K, Tsiatis AA. Study Duration for Clinical-Trials with Survival Response and Early Stopping Rule. Biometrics 1990;46:81–92.MathSciNetMATH
60.
Kim K. Study duration for group sequential clinical trials with censored survival data adjusting for stratification. Statist Med 1992;11:1477–1488.
61.
Whitehead J: Sequential methods based on the boundaries approach for the clinical comparison of survival times. Statist Med 1994;13:1357–1368.
62.
Beta-Blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction: I. mortality results. JAMA 1982;247:1707–1714.
63.
DeMets DL, Hardy R, Friedman LM, Gordon Lan KK. Statistical aspects of early termination in the Beta-Blocker Heart Attack Trial. Control Clin Trials 1984;5:362–372.
64.
DeMets DL, Lan KK. Interim analysis: the alpha spending function approach. Stat Med 1994;13:1341–1352.
65.
Kim K, DeMets DL: Design and Analysis of Group Sequential Tests Based on the Type I Error Spending Rate Function. Biometrika 1987;74:149–154.MathSciNetMATH
66.
Lan KKG, Rosenberger WF, Lachin JM. Use of spending functions for occasional or continuous monitoring of data in clinical trials. Stat Med 1993;12:2219–2231.
67.
Lan KKG, Zucker DM. Sequential monitoring of clinical trials: the role of information and Brownian motion. Stat Med 1993;12:753–765.
68.
Lan KKG, DeMets DL. Discrete Sequential Boundaries for Clinical-Trials. Biometrika 1983;70:659–663.MathSciNetMATH
69.
Lan KKG, DeMets DL, Halperin M. More Flexible Sequential and Non-Sequential Designs in Long-Term Clinical-Trials. Commun Stat Theory Methods 1984;13:2339–2353.
70.
Lan KKG, Reboussin DM, DeMets DL. Information and Information Fractions for Design and Sequential Monitoring of Clinical-Trials. Commun Stat Theory Methods 1994;23:403–420.MATH
71.
Lan KKG, DeMets D. Group sequential procedures: calendar versus information time. Stat Med 1989;8:1191–1198.
72.
Reboussin DM, DeMets DL, Kim K, Lan KKG. Lan-DeMets Method—Statistical Programs for Clinical Trials. [2.1]. 11–17–2003.
73.
Reboussin DM, DeMets DL, Kim K, Lan KKG. Computations for group sequential boundaries using the Lan-DeMets spending function method. Control Clin Trials 2000;21:190–207.
74.
DeMets DL. Futility approaches to interim monitoring by data monitoring committees. Clin Trials 2006;3:522–529.
75.
Hwang IK, Shih WJ, De Cani JS. Group sequential designs using a family of type I error probability spending functions. Statist Med 1990;9:1439–1445.
76.
Wang SK, Tsiatis AA: Approximately optimal one-parameter boundaries for group sequential trials. Biometrics 1987;193–199.
77.
Lan KKG, DeMets DL. Changing frequency of interim analysis in sequential monitoring. Biometrics 1989;45:1017–1020.
78.
Proschan MA, Follmann DA, Waclawiw MA. Effects of assumption violations on type I error rate in group sequential monitoring. Biometrics 1992;1131–1143.
79.
Geller NL. Discussion of “Interim analysis: the alpha spending approach”. Statist Med 1994;13:1353–1356.
80.
Falissard B, Lellouch J. A new procedure for group sequential analysis in clinical trials. Biometrics 1992;373–388.
81.
Lan KKG, Lachin JM. Implementation of group sequential logrank tests in a maximum duration trial. Biometrics 1990;46:759–770.
82.
Li ZQ, Geller NL. On the Choice of Times for Data Analysis in Group Sequential Clinical Trials. Biometrics 1991;47:745–750.
83.
Jennison C, Turnbull BW. Group-sequential analysis incorporating covariate information. J Am Stat Assoc 1997;92:1330–1341.MathSciNetMATH
84.
Scharfstein DO, Tsiatis AA, Robins JM. Semiparametric Efficiency and Its Implication on the Design and Analysis of Group-Sequential Studies. J Am Stat Assoc 1997;92:1342–1350.MATH
85.
Kim K, DeMets DL. Sample size determination for group sequential clinical trials with immediate response. Stat Med 1992;11:1391–1399.
86.
Lee JW, DeMets DL. Sequential Comparison of Changes with Repeated Measurements Data. J Am Stat Assoc 1991;86:757–762.MathSciNet
87.
Lee JW, DeMets DL. Sequential Rank-Tests with Repeated Measurements in Clinical-Trials. J Am Stat Assoc 1992;87:136–142.MathSciNetMATH
88.
Lee JW. Group sequential testing in clinical trials with multivariate observations: a review. Statist Med 1994;13:101–111.
89.
Su JQ, Lachin JM. Group Sequential Distribution-Free Methods for the Analysis of Multivariate Observations. Biometrics 1992;48:1033–1042.MATH
90.
Wei LJ, Su JQ, Lachin JM. Interim Analyses with Repeated Measurements in A Sequential Clinical-Trial. Biometrika 1990;77:359–364.MathSciNetMATH
91.
Wu MC, Lan G KK. Sequential Monitoring for Comparison of Changes in a Response Variable in Clinical Studies. Biometrics 1992;48:765–779.
92.
Gange SJ, DeMets DL. Sequential monitoring of clinical trials with correlated responses. Biometrika 1996;83:157–167.MathSciNetMATH
93.
Fairbanks K, Madsen R. P values for tests using a repeated significance test design. Biometrika 1982;69:69–74.MathSciNet
94.
Chang MN, O’Brien PC. Confidence intervals following group sequential tests. Control Clin Trials 1986;7:18–26.
95.
DeMets DL, Lan KKG. Discussion of: Interim analyses: The repeated confidence interval approach by C. Jennison and BW Turnbull. J R Stat Soc Series B Stat Methodol 1989;51:344.
96.
Emerson SS, Fleming TR: Parameter Estimation Following Group Sequential Hypothesis Testing. Biometrika 1990;77:875–892.MathSciNet
97.
Hughes MD, Pocock SJ. Stopping rules and estimation problems in clinical trials. Stat Med 1988;7:1231–1242.
98.
Jennison C, Turnbull BW. Repeated Confidence-Intervals for Group Sequential Clinical-Trials. Control Clin Trials 1984;5:33–45.
99.
Jennison C, Turnbull BW. Interim Analyses: The Repeated Confidence Interval Approach. J R Stat Soc Series B Stat Methodol 1989;51:305–361.MathSciNetMATH
100.
Kim K, DeMets DL. Confidence Intervals Following Group Sequential Tests in Clinical Trials. Biometrics 12-1-1987;43:857–864.
101.
Kim K. Point Estimation Following Group Sequential Tests. Biometrics 1989;45:613–617.MATH
102.
Pocock SJ, Hughes MD. Practical problems in interim analyses, with particular regard to estimation. Control Clin Trials 1989;10:209S–221S.
103.
Rosner GL, Tsiatis AA. Exact confidence intervals following a group sequential trial: A comparison of methods. Biometrika 1988;75:723–729.MATH
104.
Siegmund D. Estimation following sequential tests. Biometrika 1978;65:341–349.MathSciNetMATH
105.
Tsiatis AA, Rosner GL, Mehta CR. Exact confidence intervals following a group sequential test. Biometrics 1984;797–803.
106.
Whitehead J: On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986;73:573–581.MathSciNetMATH
107.
Whitehead J, Facey KM. Analysis after a sequential trial: A comparison of orderings of the sample space. Joint Society for Clinical Trials/International Society for Clinical Biostatistics, Brussels 1991.
108.
Fleming TR. Treatment evaluation in active control studies. Cancer Treat Rep 1987;71:1061–1065.
109.
Fleming TR. Evaluation of active control trials in AIDS. J Acquir Immune Defic Syndr 1990;3:S82–S87.
110.
DeMets DL, Ware JH. Group Sequential Methods for Clinical Trials with A One-Sided Hypothesis. Biometrika 1980;67:651–660.MathSciNet
111.
DeMets DL, Ware JH. Asymmetric Group Sequential Boundaries for Monitoring Clinical-Trials. Biometrika 1982;69:661–663.
112.
Emerson SS, Fleming TR. Symmetric Group Sequential Test Designs. Biometrics 1989;45:905–923.MathSciNetMATH
113.
Gould AL, Pecore VJ. Group sequential methods for clinical trials allowing early acceptance of Ho and incorporating costs. Biometrika 1982;69:75–80.
114.
DeMets DL, Pocock SJ, Julian DG. The agonising negative trend in monitoring of clinical trials. Lancet 1999;354:1983–1988.
115.
Cardiac Arrhythmia Suppression Trial Investigators. Effect of the antiarrhythmic agent moricizine on survival after myocardial infarction. N Engl J Med 1992;327:227–233.
116.
Friedman LM, Bristow JD, Hallstrom A, et al. Data monitoring in the cardiac arrhythmia suppression trial. Online Journal of Current Clinical Trials 7–31–1993;79.
117.
Pawitan Y, Hallstrom A. Statistical interim monitoring of the cardiac arrhythmia suppression trial. Statist Med 1990;9:1081–1090.
118.
Feyzi J, Julian DG, Wikstrand J, Wedel H. Data monitoring experience in the Metoprolol CR/XL randomized intervention trial in chronic heart failure: Potentially high-risk treatment in high-risk patients; in Data Monitoring in Clinical Trials: Springer, 2006, pp 136–147.
119.
Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Taylor & Francis, 1999.
120.
Proschan MA, Lan KKG, Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. Springer New York, 2006.
121.
Alling DW. Early decision in the Wilcoxon two-sample test. J Am Stat Assoc 1963;58:713–720.MathSciNet
122.
Alling DW. Closed sequential tests for binomial probabilities. Biometrika 1966;73–84.
123.
Halperin M, Ware J. Early decision in a censored Wilcoxon two-sample test for accumulating survival data. J Am Stat Assoc 1974;69:414–422.MathSciNetMATH
124.
DeMets DL, Halperin M. Early stopping in the two-sample problem for bounded random variables. Control Clin Trials 1982;3:1–11.
125.
Canner PL. Monitoring of the data for evidence of adverse or beneficial treatment effects. Control Clin Trials 1983;4:467–483.
126.
Lan KKG, Simon R, Halperin M. Stochastically curtailed tests in long-term clinical trials. Seq Anal 1982;1:207–219.MathSciNetMATH
127.
Halperin M, Gordon Lan KK, Ware JH, et al. An aid to data monitoring in long-term clinical trials. Control Clin Trials 1982;3:311–323.
128.
Lan KKG, Wittes J. The B-value: a tool for monitoring data. Biometrics 1988;44:579–585.
129.
Cohn JN, Goldstein SO, Greenberg BH, et al. A dose-dependent increase in mortality with vesnarinone among patients with severe heart failure. N Engl J Med 1998;339:1810–1816.
130.
Colton T. A model for selecting one of two medical treatments. J Am Stat Assoc 1963;58:388–400.MathSciNet
131.
Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis, Second Edition. Taylor & Francis, 2000.
132.
Choi SC, Pepple PA. Monitoring Clinical Trials Based on Predictive Probability of Significance. Biometrics 1989;45:317–323.MATH
133.
Cornfield J. A Bayesian test of some classical hypotheses—with applications to sequential clinical trials. J Am Stat Assoc 1966;61:577–594.MathSciNet
134.
Cornfield J. Recent methodological contributions to clinical trials. Am J Epidemiol 1976;104:408–421.
135.
Freedman LS, Spiegelhalter DJ, Parmar MK. The what, why and how of Bayesian clinical trials monitoring. Statist Med 1994;13:1371–1383.
136.
George SL, Li C, Berry DA, Green MR. Stopping a clinical trial early: Frequentist and bayesian approaches applied to a CALGB trial in non-small-cell lung cancer. Statist Med 1994;13:1313–1327.
137.
Grieve AP, Choi SC, Pepple PA. Predictive Probability in Clinical Trials. Biometrics 1991;47:323–330.MATH
138.
Machin D. Discussion of “The what, why and how of Bayesian clinical trials monitoring”. Statist Med 1994;13:1385–1389.
139.
Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: Conditional or predictive power? Control Clin Trials 1986;7:8–17.
140.
Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Statist Med 1986;5:421–433.
141.
Lau J, Antman EM, Jimenez-Silva J, et al. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med 1992;327:248–254.
142.
Bauer P, Kohne K. Evaluation of Experiments with Adaptive Interim Analyses. Biometrics 1994;50:1029–1041.MATH
143.
Berry DA. Adaptive clinical trials: the promise and the caution. J Clin Oncol 2011;29:606–609.
144.
Burman CF, Sonesson C. Are Flexible Designs Sound? Biometrics 2006;62:664–669.MathSciNetMATH
145.
Chen YH, DeMets DL, Lan KK. Increasing the sample size when the unblinded interim result is promising. Stat Med 2004;23:1023–1038.
146.
Cui L, Hun HMJ, Wang SJ. Impact of changing sample size in a group sequential clinical trial. Proceedings of the Biopharmaceutical Section, American Statistical Association, 1997, 52–57. 1997.
147.
Cui L, Hung HMJ, Wang SJ. Modification of Sample Size in Group Sequential Clinical Trials. Biometrics 1999;55:853–857.MATH
148.
Fisher LD. Self-designing clinical trials. Statist Med 1998;17:1551–1562.
149.
Fleming TR. Standard versus adaptive monitoring procedures: a commentary. Statist Med 2006;25:3305–3312.MathSciNet
150.
Hu F, Zhang LX, He X. Efficient randomized-adaptive designs. Ann Stat 2009;2543-2560.
151.
Hung HMJ, Wang SJ. Sample Size Adaptation in Fixed-Dose Combination Drug Trial. J Biopharm Stat 2012;22:679–686.MathSciNet
152.
Irle S, Schafer H: Interim design modifications in time-to-event studies. J Am Stat Assoc 2012;107:341–348.MathSciNetMATH
153.
Lan KKG, Trost DC. Estimation of parameters and sample size re-estimation. Proceedings–Biopharmaceutical Section American Statistical Association, 48–51. 1997. American Statistical Association.
154.
Levin GP, Emerson SC, Emerson SS. Adaptive clinical trial designs with pre-specified rules for modifying the sample size: understanding efficient types of adaptation. Statist Med 2013;32:1259–1275.MathSciNet
155.
Lui KJ Sample size determination under an exponential model in the presence of a confounder and type I censoring. Control Clin Trials 1992;13:446–458.
156.
Luo X, Li M, Shih WJ, Ouyang P. Estimation of Treatment Effect Following a Clinical Trial with Adaptive Design. J Biopharm Stat 2012;22:700–718.MathSciNet
157.
Mehta CR. Adaptive clinical trial designs with pre-specified rules for modifying the sample size: a different perspective. Statist Med 2013;32:1276–1279.
158.
Posch M, Proschan MA. Unplanned adaptations before breaking the blind. Statist Med 2012;31:4146–4153.MathSciNet
159.
Proschan MA, Hunsberger SA. Designed Extension of Studies Based on Conditional Power. Biometrics 1995;51:1315–1324.MATH
160.
Proschan MA, Liu Q, Hunsberger S. Practical midcourse sample size modification in clinical trials. Control Clin Trials 2003;24:4–15.
161.
Proschan MA. Sample size re-estimation in clinical trials. Biom J 2009;51:348–357.MathSciNet
162.
Shen Y, Fisher. Statistical Inference for Self-Designing Clinical Trials with a One-Sided Hypothesis. Biometrics 1999;55:190–197.
163.
Tsiatis AA, Mehta C. On the Inefficiency of the Adaptive Design for Monitoring Clinical Trials. Biometrika 2003;90:367–378.MathSciNetMATH
164.
van der Graaf R, Roes KC, van Delden JJ. Adaptive trials in clinical research: scientific and ethical issues to consider. JAMA 2012;307:2379–2380.