Weighting strategies in the meta-analysis of single-case studies

  • Published: 01 February 2014
  • Volume 46 , pages 1152–1166, ( 2014 )

Cite this article

  • Rumen Manolov 1 , 2 , 4 ,
  • Georgina Guilera 2 , 3 &
  • Vicenta Sierra 1  

1369 Accesses

9 Citations

2 Altmetric

Explore all metrics

Establishing the evidence base of interventions taking place in areas such as psychology and special education is one of the research aims of single-case designs, in conjunction with the aim of improving the well-being of participants in the studies. The scientific criteria for solid evidence focus on the internal and external validity of the studies, and for both types of validity, replicating studies and integrating the results of these replications (i.e., meta-analyzing) is crucial. In the present study, we deal with one of the aspects of meta-analysis—namely, the weighting strategy used when computing an average effect size across studies. Several weighting strategies suggested for single-case designs are discussed and compared in the context of both simulated and real-life data. The results indicated that there are no major differences between the strategies, and thus, we consider that it is important to choose weights with a sound statistical and methodological basis, while scientific parsimony is another relevant criterion. More empirical research and conceptual discussion are warranted regarding the optimal weighting strategy in single-case designs, alongside investigation of the optimal effect size measure in these types of designs.

Similar content being viewed by others

single case research and meta analysis

The Power to Explain Variability in Intervention Effectiveness in Single-Case Research Using Hierarchical Linear Modeling

Mariola Moeyaert, Panpan Yang & Xinyun Xu

single case research and meta analysis

A Priori Justification for Effect Measures in Single-Case Experimental Designs

Rumen Manolov, Mariola Moeyaert & Joelle E. Fingerhut

Critical Assumptions and Distribution Features Pertaining to Contemporary Single-Case Effect Sizes

Benjamin G. Solomon, Taylor K. Howard & Brit’ny L. Stein

Avoid common mistakes on your manuscript.

The evidence-based movement has now been salient for several years in a variety of disciplines, including psychology (APA Presidential Task Force on Evidence-Based Practice, 2006 ), medicine (Sackett, Rosenberg, Gray, Hayness, & Richardson, 1996 ), and special education (Odom et al., 2005 ). In this context, single-case designs (SCDs) Footnote 1 have been considered one of the viable options for obtaining evidence that will serve as a support for interventions and practices (Horner et al., 2005 ; Schlosser, 2009 ). Accordingly, randomized single-case trials have been included in the new version of the classification elaborated by the Oxford Centre for Evidence-Based Medicine regarding the methodologies providing solid evidence (Howick et al., 2011 ). Thus, it is clear that one of the ways of improving methodological rigor and scientific credibility is by incorporating randomization into the design (Kratochwill & Levin, 2010 ), given the importance of demonstrating causal relations (Lane & Carter, 2013 ). Demonstrating cause–effect relations is central to SCDs, provided that they are “experimental” in essence (Kratochwill et al., 2013 ; Sidman, 1960 ), and, apart from using random assignment of conditions to measurement times, it is also favored by replication of the behavioral change contiguous with the change in conditions (Kratochwill et al., 2013 ; Wolery, 2013 ). On the other hand, replication is also related to generalization (Sidman, 1960 ), which benefits from research synthesis and meta-analysis. In that sense, the evidence-based movement has also paid attention to the meta-analytical integration of replications or studies on the same topic (Beretvas & Chung, 2008b ; Jenson, Clark, Kircher, & Kristjansson, 2007 ). The quantitative integration is deemed especially useful when moderator variables are included in the meta-analyses (Burns, 2012 ; Wolery, 2013 ). Finally, it has been stressed that meta-analysis and the assessment of internal and external validity should not be considered separately (Burns, 2012 ), given that the assessment of the methodological quality of a study is an essential part of the process of carrying out research syntheses (Cooper, 2010 ; Littell, Corcoran, & Pillai, 2008 ; What Works Clearinghouse, 2008 )—for instance, using the methodological quality scale in SCD (Tate et al., 2013 ) or the Study DIAD (Valentine & Cooper, 2008 ) as more general tools.

Despite the current prominence of hierarchical linear models (Gage & Lewis, 2012 ; Owens & Ferron, 2012 ), more research and debate is needed regarding the optimal way in which research synthesis ought to take place in the context of SCDs (Lane & Carter, 2013 ; Maggin & Chafouleas, 2013 ). The present study represents an effort to discuss and obtain evidence regarding the meta-analysis of single-case studies; its focus is on weighting strategies, rather than on the effect size measures that summarize the results. In that sense, it should be stressed that we do not advocate here for or against specific procedures for SCD data analysis. We consider that, while the debate on the optimal analytical techniques is still ongoing, the methodological and statistical progress in SCDs will benefit from parallel research on the meta-analysis of SCD data. That is, it seems reasonable to try to solve the issue of how to combine the effect sizes from multiple studies, while also dealing with the question of which effect size measure is optimal, especially given that meta-analyses of SCD data are already taking place.

The purpose of the present study was to extend existing research on the meta-analysis of single-case data, focusing on weighting strategies. After discussing the different weights suggested, a comparison is performed to explore whether the choice of a weighting strategy is critical. One of the weighting strategies studied is a proposal made here, based on considering baseline length and variability together.

The comparison was carried out in two different contexts. We used data with known characteristics (i.e., simulation) in order to study the influence of baseline and series length, data variability, serial dependence, and trend. Simulation has already been used to compare weighting strategies in the context of group designs (e.g., Marín-Martínez & Sánchez-Meca, 2010 ) and in SCDs (e.g., Van den Noortgate & Onghena, 2003a ). Additionally, we applied the weighting strategies to real data sets already meta-analyzed in a previously published study (Burns, Zaslofsky, Kanive, & Parker, 2012 ).

Weighting strategies

Weighting the individual studies’ effect sizes is an inherent part of meta-analysis. When choosing a weighting strategy, two aspects need to be taken into account: their underlying rationale and their performance. Regarding the former aspect, in group designs, the variance of the effect size index is considered optimal (Hedges & Olkin, 1985 ; Whitlock, 2005 ), given that it quantifies the precision of the summary measure and is, thus, related to the confidence that a researcher can have in the effect size value obtained. However, the choice of an effect size index is not as straightforward in SCDs as it is in group designs. Moreover, the variance has not been derived for all effect size indices (see Hedges, Pustejovsky, & Shadish, 2012 , for an example of the complexities related to deriving the variance of a standardized mean difference). Finally, deriving the variance of the effect size index involves assumptions such as those mentioned in the Data Analysis subsection for the indices included in this study. More discussion is necessary in the SCD context on whether the same weighting strategy should be considered optimal, although such practice has been recommended (Beretvas & Chung, 2008b ).

Other suggested weighting strategies also relate to the degree to which a summary measure is representative of the real level of behavior. On the one hand, greater data variability means that a summary measure represents all the data less well; Parker and Vannest ( 2012 ) suggested the inverse of data variability as a possible weight. On the other hand, when a summary measure is obtained from a longer series, the researcher can be more confident that the data gathered represent the actual (change in) behavior well and that the effects are not only temporary. Accordingly, Horner and Kratochwill ( 2012 ) and Kratochwill et al. ( 2010 ) mentioned the possibility of using series length as a weight, although its appropriateness is not beyond doubt (Kratochwill et al., 2010 ; Shadish, Rindskopf, & Hedges, 2008 ). For instance, multiple probe designs (unlike multiple baseline designs) are specifically intended to produce fewer baseline phase measurements, when the preintervention level is stable or in the specific case of zero frequency of the behavior to be learned (Gast & Ledford, 2010 ). In the case of multiple probe designs, the aim is to reduce the unethical withholding of a potentially helpful intervention. Moreover, the intervention phase measurements are continuous only until a criterion is reached. Thus, studies using this design structure might be (unfairly) penalized (i.e., treated as quantitatively less important) by weighting strategies based on baseline or series length.

Another possible weight related to the amount of information available is the number of participants in a study, suggested by Kratochwill et al. ( 2010 , 2013 ) and used, for instance, by Burns ( 2012 ). Nonetheless, its proponents (Kratochwill et al., 2010 ) state that there is no “strong statistical justification” (p. 24) for its use. Finally, using unweighted averages has also been considered (Kratochwill et al., 2010 , 2013 ) and appears to be a common practice (Schlosser, Lee, & Wendt, 2008 ).

The proposal we make here is that, when considering the importance of data variability and the number of measurements available, the focus should be on the baseline, consistent with the attention paid to it by applied researchers and methodologists. In SCDs, this phase is used for gathering information on the initial situation and is necessary for establishing a criterion against which the effectiveness of a treatment is evaluated. On the one hand, longer baselines show more clearly what the preintervention level of behavior is, and this level (including any existing trends) can be projected with a greater degree of confidence into the treatment phases and compared with the actual measurements. Baseline length is explicitly mentioned in several SCD appraisal tools (Wendt & Miller, 2012 ), with a minimum of five measurements for a study to receive a high score in the standards elaborated by the What Works Clearinghouse team (Kratochwill et al., 2010 ) and in the methodological quality scale for SCDs (Tate et al., 2013 ).

On the other hand, baseline stability is critical for any further assessment of intervention effectiveness (Kazdin, 2001 ; Kratochwill et al., 2010 ; Smith, 2012 ), given that consistent responding is key to predicting how the behavior would continue in the absence of intervention (Horner et al., 2005 ). Finally, the focus on the baseline, rather than on the whole series, is warranted, given that if the data series are considered as a whole, any potential effect will introduce variability, since the preintervention and the postintervention measurements will not share the same level or trend. Thus, whole-series variability is not an appropriate weight, given that it is confounded with intervention effectiveness. Besides the justification of the weight chosen, it is relevant to explore the effect of using different weights when integrating SCD studies, and this is dealt with in the remainder of the article.

A comparison of weighting strategies: simulation study

Data generation: design.

The simulation study presented here is based on multiple baseline designs (MBDs) for three reasons. First, previous reviews (Hammond & Gast, 2010 ; Shadish & Sullivan, 2011 ; Smith, 2012 ) suggest that this is the SCD structure used with greatest frequency in published studies (around 50 % in the former two and 69 % in the latter). Second, in the meta-analysis carried out by Burns et al. ( 2012 ) (and rerun here), most of the studies included in the quantitative integration are MBD. Third, MBDs meet the replication criteria suggested by Kratochwill et al. ( 2013 ) for designs allowing solid scientific evidence to be obtained. Subsequent quantifications are based on the idea that the comparisons should be made between adjacent phases (Gast & Spriggs, 2010 ; Parker & Vannest, 2012 )—that is, within each of the three tiers simulated—and, afterward, that averages are obtained across tiers.

Data generation: model and data features

Data were generated using Monte Carlo methods via the following model, presented by Huitema and McKean ( 2000 ), and were used previously in other SCD simulation studies (e.g., Beretvas & Chung, 2008b ; Ferron & Sentovich, 2002 ; Ugille et al., 2012 ):

The following variables are used in the model: T refers to time, taking the values 1, 2, . . . , n A + n B (where the latter are the phase lengths), D is a dummy variable reflecting the phase (0 for baseline and 1 for intervention) and used for modeling level change, whereas the interaction between D and T models slope change. In this model, serial dependence can be specified via the first-order autoregressive model for the error term ε t = φ 1 ∙ ε t –1 + u t , with φ 1 being set to 0 (independent data), .3, or .6, and u t being a normally distributed random disturbance. These autocorrelation values cover those reported by Shadish and Sullivan ( 2011 ) for 531 MBD studies reviewed: A random effects meta-analytic mean of these autocorrelations was .145, which, when corrected for bias, was equal to .320. In order to cover a greater range of possibilities, in some conditions, the degree of autocorrelation was homogeneous for the whole series, whereas in others there was nonzero autocorrelation only for the baseline data (see Fig.  1 for a graphical representation of the experimental conditions of the simulation study).

Experimental conditions included in the Monte Carlo study. Level change: β 2 = 11 or 26 and β 2 = 16.5 or 39 for small-effect-size (Small ES) studies and large-effect-size (Large ES) studies, respectively, according to the metric (0–30 or 0–100). Slope change: β 3 = 1 or 3 and β 3 = 1.5 or 4.5 for small-effect-size (Small ES) studies and large-effect-size (Large ES) studies, respectively, according to the metric (0–30 or 0–100)

Regarding the remaining simulation parameters ( β 0 , β 1 , β 2 , and β 3 ), we wanted their selection to be based on the characteristics of real behavioral data, rather than selecting completely arbitrary values. Therefore, we focused on the studies included in the Burns et al. ( 2012 ) meta-analysis. Nevertheless, we are aware that any selection of parameters is necessarily limited. In order to make the simulation study match more closely to real situations, we chose to include two different metrics, one representing the percentage of time intervals on task (as in Beck Burns, & Lau, 2009 ), a metric varying from 0 to 100, and another one representing the number of digits correct per minute (ranging up to 30 in Burns, 2005 ). On the basis of the data in these two studies, we also chose the baseline level β 0 (set to 40 and 7, respectively) and the standard deviation of the random normal disturbance u t with zero mean (set to 7 and 3, respectively). The level change parameter β 2 was set to 26 and 11 for the percentage and the count metrics, respectively, on the basis of the effects found in the abovementioned studies. The slope change parameter β 3 was set to 1 for the 0–30 metric, approximately equal to the difference in slopes in the Burns ( 2005 ) data, whereas for the 0–100 metric it was set to 3 in order to represent roughly the ratio between the scales (100:30 ≈ 3:1). Finally, baseline trend ( β 1 ) was set to 0 in the reference condition. In the conditions with change in slope, β 1 was set to 1 for the 0–30 metric, given that in the only MBD tier of the Burns ( 2005 ) study in which there was some indication of baseline trend (for student 2), the ordinary least squares slope coefficient was equal to 1.1; analogously, β 1 was set to 3 for the 0–100 metric.

Table  1 contains these simulation parameters, as well as the standardized change in level ( β 2 ) and change in slope ( β 3 ) effects for the different conditions. Standardizing shows that the effect sizes for the two metrics are very similar, for both change in level and change in slope. For slope change, Table  1 includes the corresponding mean difference between phases: Since β 3 represents the increment between two successive points in the treatment phase, the average change between phases can be expressed as \( {\displaystyle \sum_{i=0}^{n_B-1}i{\beta}_3}/{n}_B \) , where n B is either 5 or 10.

Data generation: phase lengths

Using the model presented above, 10 three-tier MBD data sets ( k = 10) were simulated for each iteration and later integrated quantitatively. In previous simulation studies related to single-case meta-analysis (Owens & Ferron, 2012 ; Ugille et al., 2012 ), k = 10 was also one of the conditions studied. However, given that, in those studies, the estimation of effects was the object, k was more relevant than in the present study where weighting strategies are being compared.

The basic MBD data set, used as a reference, contained 20 measurements ( n A = n B = 10) in each tier, following two pieces of evidence. On the one hand, Shadish and Sullivan ( 2011 ) reported that the median and modal data points in the SCD studies included in their review was 20. On the other hand, Smith ( 2012 ) reported a mean of 10.4 baseline data points in MBD, which is consistent with the Shadish and Sullivan data that 54.7 % of the SCDs had five or more points in the first baseline.

Each generation of 10 studies and posterior meta-analytical integration was iterated 1,000 times using R (R Core Team, 2013 ), and thus, 1,000 weighted averages were obtained for each weighting strategy and each experimental condition (i.e., for each combination of phase lengths, type of effect, data variability, degree of serial dependence, and trend).

Data generation: additional conditions for studying the effect of data variability and phase length

In the simulation study, we wanted to explore the effect of data variability and phase lengths as potentially important factors for the weighting strategies (see Fig.  1 ). In order to study how more variability or more data points affect the weighted average, it was necessary to set different effect sizes in the different studies being integrated. Footnote 2 We decided that half of the k = 10 studies should have the effect previously presented ( β 2 = 11 and β 3 = 1 for the 0–30 metric, β 2 = 26 and β 3 = 3 for the 0–100 metric), whereas for the other half, the effects were multiplied by the arbitrarily chosen value of 1.5 (thus, β 2 = 16.5 and β 3 = 1.5 for the 0–30 metric, β 2 = 39 and β 3 = 4.5 for the 0–100 metric). The effects and their standardized versions are available in Table  1 .

In order to study the effect of data variability, we doubled the standard deviation of the random normal disturbance u t to 6 (for the 0–30 metric) and to 14 (for the 0–100 metric) for the five studies with larger effects. Thus, we expected the weighted average to decrease. It should be stressed that with the simulation parameters specified in this way, the simulated data were expected to be generally within the range of possible values, for both metrics. Footnote 3 The standardized values in Table  1 are computed, on the one hand, considering the variability in the reference condition and, on the other hand, for the conditions with greater variability.

To study the effect of phase lengths, we divided by two the number of data points in the baseline ( n A = 5) or in the whole MBD tier ( n A = n B = 10) for the studies with larger effects, expecting once again a reduction in the weighted average. Note that the multiplication factor was the same as when studying the effect of data variability, given that the aim was to be able to compare the changes in the weighted averages as a result of the smaller-effect-size studies containing more measurements or presenting lower variability.

Data analysis: effect size measures

Our choice of effect size measures to include in the present study was based on two criteria: knowledge of the expression of the index variance (under certain assumptions) and actual use in SCDs. Given the considerable lack of consensus on which is the most appropriate effect size measure (Burns, 2012 ; Kratochwill et al., 2013 ; Smith, 2012 ), we are aware that any choice of an analytical technique can be criticized, and, in the following, we explain our choice for this particular study, although we do not claim that the measures included here are always the most appropriate ones.

In the review of single-case meta-analyses performed by Beretvas and Chung ( 2008b ), the percentage of nonoverlapping data (PND; Scruggs, Mastropieri, & Casto, 1987 ) and the standardized mean difference were the most frequently used procedures for meta-analyzing single-case data. Taking this into account, we chose two effect size measures for inclusion.

First, for the nonoverlap measure, we chose the nonoverlap of all pairs (NAP; Parker & Vannest, 2009 ), rather than the PND, for several reasons, despite the fact that the PND has a long history of use and its quantifications have been validated by the researcher’s judgments on which interventions are effective (Scruggs & Mastropieri, 2013 ), apart from the agreement with visual analysis in the absence of an effect (Wolery, Busick, Reichow, & Barton, 2010 ). The reasons for preferring the NAP are the following: (1) It does not depend on a single extreme baseline measure; (2) in simulation studies, the NAP has also been shown to perform well in presence of autocorrelation (Manolov, Solanas, Sierra, & Evans, 2011 ), in contrast with the PND (Manolov, Solanas, & Leiva, 2010 ); (3) the NAP and the PND show similar distributions of typical values, according to the review by Parker, Vannest, and Davis ( 2011 ) using real behavioral data; and (4) the critical reason for selecting the NAP was the fact that the PND does not have a known sampling distribution (Parker et al., 2011 ), which makes impossible using the most widely accepted weight for group-design studies; in contrast, there is an expression for the variance of the NAP as shown below. The NAP is a measure obtained as the percentage of pairwise comparisons for which the result is an improvement after the intervention (e.g., the intervention measurement is greater than the baseline measurement when the aim is to increase behavior). It is equivalent to an indicator called probability of superiority (Grissom, 1994 ), which is related to the common language effect size (McGraw & Wong, 1992 ). Grissom and Kim ( 2001 ) provided a formula to estimate the variance of the probability of superiority, which is also applicable to the NAP: \( {\widehat{\sigma}}_{\mathrm{NAP}}^2=\left(1/{n}_{\mathrm{A}}+1/{n}_{\mathrm{B}}+1/{n}_{\mathrm{A}}{n}_{\mathrm{B}}\right)/12 \) . Note that the probability of superiority was originally intended to compare two independent samples in the same way as the Mann–Whitney U test and, extending this logic to SCD, it would be assumed that the data are independent and also that the variances are equal. The reader should consider whether these assumptions are plausible. The NAP has been used in single-case meta-analyses (e.g., Burns et al., 2012 ; Petersen-Brown, Karich, & Symons, 2012 ).

Second, regarding the standardized mean difference index, according to Beretvas and Chung ( 2008b ), the most commonly applied version Footnote 4 was the one using the standard deviation of the baseline measurements ( s A ) in the denominator, which in group designs comparing a treatment mean \( \left({\overline{X}}_B\right) \) and a control group mean \( \left({\overline{X}}_A\right) \) would be Glass’s Δ (Glass, McGaw, & Smith, 1981 ). The index is thus defined as \( \varDelta =\left({\overline{X}}_B-{\overline{X}}_A\right)/{s}_A \) and its variance is given by Rosenthal ( 1994 ) as being equal to \( {\widehat{\sigma}}_{\varDelta}^2=\frac{n_{\mathrm{A}}+{n}_{\mathrm{B}}}{n_{\mathrm{A}}{n}_{\mathrm{B}}}+\frac{\varDelta^2}{2\left({n}_{\mathrm{A}}-1\right)} \) . Note that Δ was originally used to compare two independent groups and is based on the assumption that the sampling distribution of Δ tends asymptotically to normality and, thus, this formula is only an approximation. Moreover, although it is a standardized measure of the average difference between phases, its application to SCD data does not lead to a measure comparable to the d -statistic obtained in studies based on group designs (see Hedges et al., 2012 , for a more complete explanation). This is also a reason for not using Cohen’s benchmarks for interpreting the index’s values (Beretvas & Chung, 2008a , b ). Once again, we stress that we do not advocate for the use of this measure for quantifying intervention effectiveness in all SCD data.

Three aspects should be considered with regard to these two effect size measures. First, the fact that the first measure is expressed as a percentage of nonoverlap and the second measure is standardized implies that they can be applied to data measured in different metrics (which is the case for both the simulated and the real data used here). Second, the expressions for the variances of these indices do not take into account the fact that single-case data may be autocorrelated; so, (1) they should be used with caution when applied to real data for which it is difficult to estimate autocorrelation precisely (Huitema & McKean, 1991 ; Solanas, Manolov, & Sierra, 2010 ) and (2) it would be interesting to explore the effect of serial dependence on the weighted averages by computing the inverse of the indices’ variance as a weight.

The third noteworthy aspect is related to situations in which the data do not show stability. It has to be mentioned that neither the NAP nor Δ are suitable for data that present a baseline trend not related to the intervention, as was pointed out by Parker, Vannest, Davis, and Sauber ( 2011 ) and Beretvas and Chung ( 2008b ), respectively. This is why we did not apply these indices to conditions with β 1 ≠ 0. In fact, there are several methods for dealing with trend (e.g., Allison & Gorman, 1993 ; Maggin, Swaminathan, et al., 2011 ; Manolov & Solanas, 2009 ; Parker, Vannest, & Davis, 2012 ). However, modeling trend is not an easy issue, given that it is necessary to consider aspects such as phase length (Van den Noortgate & Onghena, 2003b ) and reasonable limits within which data can be projected (Parker, Vannest, Davis, & Sauber, 2011 ). Moreover, the issue of baseline trend is probably more critical for the effect size indices than for the weighting strategies used to assign quantitative “importance” to these indices.

Another aspect related to the effect size measures and the lack of data stability is that NAP and Δ are not specifically designed to quantify changes in slope. Therefore, a different type of summary measure was computed here for this specific situation: the difference between the standardized ordinary least squares slope coefficients estimated separately for the treatment phase and for the baseline phase (with T as predictor in both cases). This third summary measure can be defined as β diff = β B − β A .

The NAP, ∆, and β diff were computed for each generated data set. The quantifications of the ten studies ( i = 1, 2, . . . 10) were then integrated via a weighted average, \( \overline{ NAP}={\displaystyle \sum_{i=1}^{10}{w}_i NA{P}_i}/{\displaystyle \sum_{i=1}^{10}{w}_i},\overline{\varDelta}={\displaystyle \sum_{i=1}^{10}{w}_i{\varDelta}_i}/{\displaystyle \sum_{i=1}^{10}{w}_i}, \) or \( {\beta}_{diff}={\displaystyle \sum_{i=1}^{10}{w}_i{\beta_{diff}}_{{}_i}}/{\displaystyle \sum_{i=1}^{10}{w}_i}, \) where w i denotes a weight in the respective study I , based on either of the five strategies studied here.

Data analysis: weighting strategies

The weighting strategies included here were the variance of the effect size indices, series length, baseline length, baseline variability, and a proposal based on both baseline length and variability. It was expected that the data variability of the whole series might be confounded with an intervention effect, given that a mean shift or a change in slope both entail greater scatter. This is why it was not included as a weight. Another possible weight not included here is the number of participants, since it is not strongly supported by its proponents (Kratochwill et al., 2010 ) and raises the question of what weight should be used when there is only one participant in the study, for instance, when an ABAB design is used or whether, in MBD across behaviors or settings, the number of tiers should also be used as a weight.

It is important to distinguish between the weighting strategies that involve computing a measure of variability. On the one hand, the classical option is related to the effect size index variance (that is, the variance of its sampling distribution). In this case, the weight is the inverse of this variance, so that a greater weight is related to greater precision of the effect size estimate. On the other hand, the variability of the data (and not of the summary measure) is considered, here focusing on the baseline phase. In this case, the weight is the inverse of the coefficient of variation of the baseline measurements. The coefficient of variation is used to eliminate the influence of the measurement units. In this way, studies with more stable data contribute more to the average effect size.

Regarding series and baseline phase lengths, the weights are n and n A , respectively, giving greater numerical importance to studies in which more measurements are available. The proposal presented here is based on both baseline length and data variability, given that the two aspects are related and should not be assessed separately: Longer baselines are desirable given that they provide more information and confidence about the actual initial situation, but even shorter baselines might be sufficiently informative if the data are stable. The weight in the proposal was defined as n A + 1/ CV (A), a direct function of baseline length and inverse function of the baseline data variability measured in terms of the coefficient of variation (a nondimensional measure that makes data expressed in different units comparable). The proposal is well aligned with Kratochwill et al.’s ( 2010 ) suggestion that the first step of assessing the usefulness of the single-case data at hand for proving scientific evidence is to check whether the baseline pattern “has sufficiently consistent level and variability.” Moreover, the same authors state that “[h]ighly variable data may require a longer phase to establish stability” (p. 19).

The main numerical results are presented in Table  2 for the NAP and Table  3 for Δ for conditions in which level change was simulated and in Table  4 for β diff for conditions including slope change. In the following sections, the results are presented in relation to each data feature whose effect was studied via simulation.

Reference condition

The reference condition included MBD data series with 10 measurements in the two phases of each tier, with no autocorrelation or trend and variability being equal for all studies. It can be seen that the weighted averages were very similar; the only difference being the Δ value observed for the weight based on baseline data variability (and thus, also present in the proposal). Thus, the choice of a weighting strategy does not seem critical. Next, we explore whether specific data features have a differential influence on either of these strategies.

Effect of phase lengths

For the NAP and β diff , there were practically no differences between the weighting strategies. For the NAP, there was no difference with respect to the reference condition. For Δ, the pattern of results was more complex: The unweighted average was close to the index variance only when the whole large-effect-size series were shorter. However, when only the baseline phases were shorter, the results of the Δ variance as a weight were closer to those for n A . Nonetheless, whether the index variance is an optimal weight given the issues related to its derivation should be discussed. For both types of conditions studied, the values for the proposal were in the middle of the ranges observed and, thus, represent less extreme quantifications of the average effect size.

Effect of data variability

Greater data variability is related to reducing the weighted averages for all three effect size indices, although for the NAP this reduction was only slight. The results obtained with the different weighting strategies showed considerable similarity, the only noteworthy differences were observed for Δ when using baseline variability as a weight. Once again, the results for the proposal were less extreme than all other weighted averages.

Effect of serial dependence

The presence of a positive autocorrelation in the data had the effect of reducing the weighted averages obtained, although this was not as marked for the NAP. In general, φ 1 = .6 leads to underestimating the effect size when it is computed via Δ or β diff , and when a larger proportion of the data is autocorrelated (i.e., both phases of a tier, both large- and small-effect-size studies), this underestimation is more pronounced. In any case, what is central to the comparison of the weighting strategies is that for all three effect size measures, the results were very similar.

Effect of trend

When an improving baseline trend is present in the data and a procedure is not specifically designed to deal with it, this data feature can affect the quantification of the effect size, as shown once again here. For the NAP and for Δ, such a trend leads to overestimating the effect size, given that the initial improvement (and its projection into the treatment phase) is not controlled for; the results for β diff differ because an already positive slope means that the change in slope after the intervention is compared with steeper (not stable) baseline data. However, given that the present work is focused on weighting strategies and not on the performance of the effect size indices, it is important to explore whether this distortion in the estimates is similar across weights or not. In the experimental conditions studied here, the similarity is notable. Once again, there were no major differences among the weighting strategies.

A comparison of weighting strategies: real data meta-analysis

Characteristics of the meta-analysis.

The meta-analysis presented here is based on the meta-analysis carried out by Burns et al. ( 2012 ), Footnote 5 which integrated 10 studies ( k = 10; the articles marked with an asterisk in the reference list were those included in the meta-analysis). However, the present reanalysis is not a direct replication of the Burns et al. study, given that we did not use median NAP values or convert NAP to Pearson’s phi. Most of the studies included in the meta-analysis used multiple baseline designs and focused on an intervention called incremental rehearsal , which is used for several teaching purposes (e.g., words, mathematics) both for children with and for those without disabilities.

Dealing with dependence of outcomes

More than one outcome can be computed for most of the single-case studies included in the meta-analysis, and it does not seem appropriate to treat each outcome as independent (Beretvas & Chung, 2008b ). Here, we chose to average the effect sizes within a study, which is one of the options used in group-designs meta-analysis (Borenstein, Hedges, Higgins, & Rothstein, 2009 ). However, it is also possible to choose one of the several effect sizes reported per study according to a substantive criterion or at random (Lipsey & Wilson, 2001 ).

Another issue that requires consideration is how weights are computed in order to have a single weight per study accompanying the corresponding effect size measure. Borenstein et al. ( 2009 ) discussed the possibility of calculating a variance of an average of effect sizes within a study. However, their formulae require knowing or, at least, assuming plausible values for the correlations between the different study outcomes. Given that we did not want to make an assumption with no basis, we chose to obtain the average of the weights for each outcome in order to have a single weight per study. This approach has been deemed a conservative solution (Borenstein et al., 2009 ).

For instance, for multiple baseline designs (e.g., Burns, 2005 ) or multiple probe designs (e.g., Codding, Archer, & Connell, 2010 ), there is one outcome for each baseline. In such cases, it has been suggested (Schlosser et al., 2008 ) that an effect size should be computed for each baseline before computing the average of these baselines; Burns et al. ( 2012 ) also computed the NAP for each baseline and then aggregated them. For designs with multiple treatments (e.g., Burns, 2007 ), the optimal practice is not clear, but comparing each treatment with the immediately preceding baseline seems to be the logical choice (Schlosser et al., 2008 ). However, given that in the Burns ( 2007 ) study there was only one baseline (the design can be designated as ACBC) and considering the possibility of sequence effects (Schlosser et al., 2008 ), we chose to include only the comparison of this baseline with the first intervention. For the Volpe, Mulé, Briesch, Joseph, and Burns ( 2011 ) study, each measurement obtained under the incremental rehearsal conditions was compared with the corresponding measurement under the traditional drill and practice condition, which was considered the reference, although it is not strictly speaking a baseline condition.

The effect sizes and the different weights for each of the 10 studies are presented in Table  5 . Some aspects of the results should be commented upon, before discussing the weighted averages across studies. For the Bunn, Burns, Hoffman, and Newman ( 2005 ) study, a perfectly stable baseline (i.e., a complete lack of variability) precluded computing β diff and also ∆, its variance, or the weight related to baseline variability. Additionally, given that only 10 studies were integrated, an extreme effect size in any of them and/or a measure with an extremely high weight may have affected the results of the weighted average across studies. For instance, the rather unfavorable results for the incremental rehearsal for the Volpe Mulé, et al. ( 2011 ) study potentially decreased the weighted average, especially for the weighting strategies based on baseline or series length and for the NAP variance. Another example of a study whose results are potentially influential was conducted by Matchett and Burns ( 2009 ). In the present meta-analysis, the effect size for the Matchett and Burns study was given greater weight for baseline variability and also for the proposal as weighting strategies, given that their data showed very low relative dispersion (e.g., the values for the first tier ranged between 47 and 50). The influence of the Matchett and Burns study on the average effect size is especially salient for β diff .

The values and weights in Table  5 were used to obtain the mean effect sizes for the 10 studies according to each weighting strategy; the unweighted average was also computed. The results obtained following the quantitative integration of the studies are presented in Fig.  2 . For both the NAP and Δ, the proposal’s results were close to the unweighted average. In contrast, the NAP variance result was closer to that obtained when n A was used as a weight and the Δ variance result was more similar to the series length weight. However, the weighted average using baseline variability as a weight yielded a somewhat different result. The latter finding is especially salient for β diff , due to the influence of the Matchett and Burns ( 2009 ) study.

Weighted averages for the nonoverlap of all pairs (NAP; upper panel), Glass’s ∆ (delta; middle panel), and the difference between standardized slope coefficients (beta difference; lower panel), computed by means of the different weighting strategies, resulting from the quantitative integration of the 10 single-case studies included in Burns, Zaslofsky, Kanive, and Parker’s ( 2012 ) meta-analysis. (CV denotes the coefficient of variation)

Results and implications

The present study is, to the best of our knowledge, the first one based simultaneously on simulation and real data comparing several weighting strategies in the context of SCDs’ meta-analysis. The results obtained here are restricted to the experimental conditions studied, and more extensive research and discussion are required. However, various aspects of this work will fuel further discussion and testing with published data or via simulation.

First, the issue of whether weighting is necessary when an average effect size summarizing the results of several studies is obtained should be considered. On substantive grounds, it seems logical to treat an outcome of a study as numerically more important (i.e., contributing to a greater extent) when this outcome is based on a larger amount of data and/or on a clear data pattern (i.e., with less unexplained variability). On empirical grounds, on the basis of the results presented here, there is not enough evidence that weighting yields markedly different results. An implication of these findings (which should be considered taking into account the limitations discussed below) is that series length alone may not be a critical feature for giving more or less weight to the results. In that sense, multiple probe designs characterized by a reduced amount of measurements may not be treated as providing less evidence. However, note that the length of the phases is also considered in the expressions for approximating the variance of the indices included in this study.

Second, for the cases in which certain differences are observed in the weighted averages, it is important to establish the gold standard, so that a result can be judged as more or less desirable. It that sense, whether the variance of the effect size measure is that gold standard and whether it can be derived for single-case data, considering potential serial dependence and/or a baseline trend, should be debated. Even in the context of simulation data, it is not easy to determine which results show the best match for the simulation parameters, given that the question is “what are the optimal weights?” and, thus, “how different from an unweighted average should a weighted average be?”

Third, we consider that the discussion on the theoretically most appropriate weight (i.e., the one that has the most solid statistical justification in the context of SCD data) can take place in parallel with empirical testing, carried out with real or simulated data. With the results presented here, the door for a substantive discussion appears to remain wide open, given that no major differences were obtained across the weighting strategies.

Fourth, some methodological implications of the results should be mentioned, taking into account the limitations discussed below. First, it might not be necessary to derive the sampling distribution of an effect-size index analytically (e.g., Hedges et al., 2012 ) or via simulation (e.g., Manolov & Solanas, 2012 ) in order to be able to obtain its variance and then use it as a weighting factor. Regarding the variance of standardized mean difference measures such as Δ, it has been claimed that the presence of serial dependence in the data makes the sampling distribution unknown and, thus, the formulae for the variances might not be correct (Beretvas & Chung, 2008b ), which is the one of the reasons for the current developments in the field by Hedges and colleagues.

This being said, we consider that until more evidence is available, two approaches seem to be logically and empirically supported. The first approach consists of using the weighting strategy whose underlying statistical foundations are more solid: the index variance. The work of Hedges et al. ( 2012 ) is an important step in this direction in order to have available measurements and weights appropriate for SCD, avoiding the need to make assumptions about the data so that they would fit the measures and weights used in group-design studies. Using a weight based on widely accepted statistical theory can be useful for enhancing the scientific credibility of the meta-analyses of SCD data. Nonetheless, issues such as estimating autocorrelation (so that it can be accounted for) still need to be solved, whereas future developments more closely related to the d -statistic are also expected (Shadish, Hedges, et al., 2013 ).

The second approach consists of simplifying the weighting strategy to using either baseline length only or baseline length and variability—two widely available and relevant pieces of information. The main reason for such an option would be the lack of difference in performance (considering the limitations of the current evidence), as compared with the index variance weight. That is, following this approach would be based on the principle of scientific parsimony (also known as Occam’s razor), according to which a simpler solution might be useful until it is demonstrated to be inferior. We consider that, subject to further testing and discussion, this approach is well aligned with the requirement of being “scientifically sound yet practical” (Schlosser & Sigafoos, 2008 , p. 118). The first option would be to use only baseline phase length as a weight, given that it actually is a special case of the variance estimate presented by Hedges and colleagues ( 2012 , Equation 5): It is the case in which autocorrelation is not taken into account and the focus is put solely on the baseline phase. Regarding the assumption of no autocorrelation, it might be justifiable considering the autocorrelations reported by Shadish and Sullivan ( 2011 ): the bias-corrected values ranged from −.010 for alternating treatment designs to .320 for MBD. The second option in the context of this parsimonious approach would be to use baseline length and the inverse of baseline data variability as weight. The rationale for such a weight would be to avoid penalizing excessively multiple probe designs in which few preintervention measurements are obtained but they show stability. Choosing either of the two approaches can be a question of further debate.

Fifth, we would like to encourage applied researchers not only to publish their raw data in a graphical format, but also to compute the primary summary measures such as means, medians, and standard deviations for each phase, given that this information is useful for computing the weights that are necessary for meta-analysis. This would help avoid any lack of precision due to imperfect data-retrieval procedures. Meta-analysis and the identification of the conditions under which interventions are useful would also benefit from reporting the details about the participants, the settings, the procedures, and the operative definitions of the main study variables (Maggin & Chafouleas, 2013 ).

Finally, researchers carrying out meta-analyses are encouraged to report both an unweighted average and a weighted average based on the strategy they consider optimal. In that way, each meta-analysis would serve as evidence based on real data regarding the impact of using weighting in meta-analytical integrations. Furthermore, each meta-analysis not only would contribute to substantive knowledge, but also would give added value in terms of the methodological discussion on how to perform research synthesis in SCDs.

Limitations and future research

The results of the present study are limited to the weighting strategies and the effect size measures included. Regarding the limitations of the meta-analysis of published data, we should mention the relatively small number of studies included and the inability to calculate variances due to flat baselines. The outlying weights due to lower baseline variability in some data sets can also be seen as a limitation. However, perfectly stable measurements can be obtained in behavioral studies (e.g., Costigan & Light, 2010 ), especially when the desired effect is to eliminate the behavior studied (e.g., Friedman & Luiselli, 2008 ) or, regarding the baseline phase, when the initial level is zero (e.g., Drager et al., 2006 ). The data meta-analyzed also reflect the fact that, in some cases but not in others, there might be lower baseline variability (e.g., for one of the behaviors of only 1 of the 4 participants, studied by Dolezal, Weber, Evavold, Wylie, & McLaughlin, 2007 ).

As limitations specific to the simulation study, it focused only on MBD, and it is not clear whether the results would have been different if a variety of design structures were simulated for the data sets to be integrated; for instance, in the Burns ( 2012 ) meta-analysis, not all studies followed an MBD. Although this is the most common design structure, there are other designs that can provide strong evidence for intervention effectiveness according to the criteria presented by Kratochwill and colleagues (2010) and Tate et al. ( 2013 ), such as ABAB (used in 21 % of the empirical studies according to the Hammond & Gast, 2010 , review, 17 % in Shadish & Sullivan, 2011 , and 8 % in Smith, 2012 ) and alternating treatment designs (used in 8 % of the studies in Shadish & Sullivan, 2011 , review, and 10 % as a combination of MBD and ATD, and in Smith’s, 2012 , review alternating and simultaneous treatment designs represented 6 % of the studies). Moreover, a restricted set of phase lengths was studied, and the data were generated on the basis of a continuous (normal) model, as is common in single-case simulation studies; but in many cases, the behavior of interest in real single-case studies is measured on a discrete ratio scale (e.g., frequency of occurrence). Additionally, more extreme conditions (e.g., greater degrees of heteroscedasticity) could have been studied, but we decided to constrain the simulation data to realistic values, obtained in the published studies. Finally, the meta-analysis of real-life data was carried out using only 10 studies, and thus, the generalization of the findings requires further field testing.

Apart from empirical comparisons between the procedures, we consider that a more thorough discussion of which is the most appropriate weight from a conceptual perspective is required. Additionally, more discussion is necessary on how to proceed with dependent outcomes within studies in order to obtain a single effect size per study, before carrying out any integration across studies.

We chose the term single-case designs (SCDs) in order to be consistent with the labeling used in the articles recently published in this journal by, for instance, Baek and Ferron ( 2013 ), Shadish and Sullivan ( 2011 ), and Shadish, Rindskopf, Hedges, and Sullivan ( 2013 ), although these designs are also referred to as single-subject experimental designs (e.g., Ugille, Moeyaert, Beretvas, Ferron, & Van den Noortgate, 2012 ). In any case, SCDs are experimental in nature and not to be confused with case studies (Blampied, 2000 ).

Otherwise, it would not be possible to study the effect of these two data features . Consider the following example, with two studies being integrated and with the raw mean difference in both being equal to 11. If the first study is given weight 2 (due to twice as many data points) and the second study is given weight 1, the weighted average is still twice 11 + once 11 divided by 3, equal to 11; the same as the unweighted average. Therefore, it is necessary to have different magnitudes of effect in order to explore to what extent the weighted average moves closer to the effect size of the study given greater weight.

For instance, for the 0–30 metric, in the treatment phase, the level of behavior expected when a large effect is simulated is 7 (baseline level) + 16.5 (mean shift) = 23.5. Adding one standard deviation of 6 (condition of greater variability), the greatest value expected is 29.5, which is consistent with the fact that the highest value observed in the Burns ( 2005 ) study used as a reference was 30. For the 0–100 metric, in the treatment phase, the level of behavior expected when there is a large effect simulated is 41 (baseline level) + 39 (mean shift) = 80. Adding one standard deviation of 14 (condition of greater variability), the highest value expected is 94, which is consistent with the highest possible percentage value, 100.

However, note that in the review by Maggin, O’Keefe, and Johnson ( 2011 ), this measure was used only in 19 % of SSED meta-analyses.

We would like to thank Matthew K. Burns for kindly sharing his data for the analyses presented here.

Articles included in the meta-analysis are indicated with an asterisk (*).

Allison, D. B., & Gorman, B. S. (1993). Calculating effect sizes for meta‐analysis: The case of the single case. Behaviour Research and Therapy, 31 , 621–631.

Google Scholar  

APA Presidential Task Force on Evidence-Based Practice. (2006). Evidence-based practice in psychology. American Psychologist, 61, 271–285.

Article   Google Scholar  

Baek, E. K., & Ferron, J. M. (2013). Multilevel models for multiple-baseline data: Modeling across-participant variation in autocorrelation and residual variance. Behavior Research Methods, 45, 65–74.

Article   PubMed   Google Scholar  

*Beck, M., Burns, M. K., & Lau, M. (2009). Preteaching unknown items as a behavioral intervention for children with behavioral disorders. Behavior Disorders, 34, 91–99.

Beretvas, S. N., & Chung, H. (2008a). An evaluation of modified R2-change effect size indices for single-subject experimental designs. Evidence-Based Communication Assessment and Intervention, 2, 120–128.

Beretvas, S. N., & Chung, H. (2008b). A review of meta-analyses of single-subject experimental designs: Methodological issues and practice. Evidence-Based Communication Assessment and Intervention, 2, 129–141.

Blampied, N. M. (2000). Single-case research designs: A neglected alternative. American Psychologist, 55, 960.

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis . Chichester, UK: John Wiley & Sons.

Book   Google Scholar  

Bunn, R., Burns, M. K., Hoffman, H. H., & Newman, C. L. (2005). Using incremental rehearsal to teach letter identification with a preschool-aged child. Journal of Evidence Based Practice for Schools, 6, 124–134.

*Burns, M. K. (2005). Using incremental rehearsal to practice multiplication facts with children identified as learning disabled in mathematics computation. Education and Treatment of Children, 28, 237–249.

*Burns, M. K. (2007). Comparison of drill ratio and opportunities to respond when rehearsing sight words with a child with mental retardation. School Psychology Quarterly, 22, 250–263.

Burns, M. K. (2012). Meta-analysis of single-case design research: Introduction to the special issue. Journal of Behavioral Education, 21, 175–184.

*Burns, M. K., & Dean, V. J. (2005). Effect of acquisition rates on off-task behavior with children identified as learning disabled. Learning Disability Quarterly, 28, 273–281.

*Burns, M. K., & Kimosh, A. (2005). Using incremental rehearsal to teach sight-words to adult students with moderate mental retardation. Journal of Evidence Based Practices for Schools, 6, 135–148.

Burns, M. K., Zaslofsky, A. F., Kanive, R., & Parker, D. C. (2012). Meta-analysis of incremental rehearsal using phi coefficients to compare single-case and group designs. Journal of Behavioral Education, 21, 185–202.

*Codding, R. S., Archer, J., & Connell, J. (2010). A systematic replication and extension of using incremental rehearsal to improve multiplication skills: An investigation of generalization. Journal of Behavioral Education, 19, 93–105.

Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach (4th ed.). London, UK: Sage.

Costigan, F. A., & Light, J. (2010). Effect of seated position on upper-extremity access to augmentative communication for children with cerebral palsy: Preliminary investigation. American Journal of Occupational Therapy, 64, 595–604.

Dolezal, D. N., Weber, K. P., Evavold, J. J., Wylie, J., & McLaughlin, T. F. (2007). The effects of a reinforcement package for on-task and reading behavior with at-risk and middle school students with disabilities. Child and Family Therapy, 29, 9–25.

Drager, K. D. R., Postal, V. J., Carrolus, L., Castellano, M., Gagliano, C., & Glynn, J. (2006). The effect of aided language modeling on symbol comprehension and prodcution in 2 preschoolers with autism. American Journal of Speech – Language Pathology, 15, 112–125.

Ferron, J. M., & Sentovich, C. (2002). Statistical power of randomization tests used with multiple-baseline designs. The Journal of Experimental Education, 70, 165–178.

Friedman, A., & Luiselli, J. K. (2008). Excessive daytime sleep: Behavioral assessment and intervention in a child with autism. Behavior Modification, 32, 548–555.

Gage, N. A., & Lewis, T. J. (2012, May 11). Hierarchical linear modeling meta-analysis of single-subject design research. Journal of Special Education . Advance online publication. doi: 10.1177/0022466912443894

Gast, D. L., & Ledford, J. (2010). Multiple-baseline and multiple probe designs. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 276–328). London, UK: Routledge.

Gast, D. L., & Spriggs, A. D. (2010). Visual analysis of graphic data. In D. L. Gast (Ed.), Single subject research methodology in behavioral sciences (pp. 199–233). London, UK: Routledge.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research . Beverly Hills, CA: Sage.

Grissom, R. J. (1994). Probability of the superior outcome of one treatment over another. Journal of Applied Psychology, 79, 314–316.

Grissom, R. J., & Kim, J. J. (2001). Review of assumptions and problems in the appropriate conceptualization of effect size. Psychological Methods, 6, 135–146.

Hammond, D., & Gast, D. L. (2010). Descriptive analysis of single subject research designs: 1983–2007. Education and Training in Autism and Developmental Disabilities, 45, 187–202.

Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis . New York, NY: Academic Press.

Hedges, L. V., Pustejovsky, J. E., & Shadish, W. R. (2012). A standardized mean difference effect size for single case designs. Research Synthesis Methods, 3, 224–239.

Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165–179.

Horner, R. H., & Kratochwill, T. R. (2012). Synthesizing single-case research to identify evidence-based practices: Some brief reflections. Journal of Behavioral Education, 21, 266–272.

Howick, J., Chalmers, I., Glasziou, P., Greenhaigh, T., Heneghan, C., Liberati, A., et al. (2011). The 2011 Oxford CEBM Evidence Table (Introductory Document). Oxford: Oxford Centre for Evidence-Based Medicine. Available from: http://www.cebm.net/index.aspx?o=5653

Huitema, B. E., & McKean, J. W. (1991). Autocorrelation estimation and inference with small samples. Psychological Bulletin, 110, 291–304.

Huitema, B. E., & McKean, J. W. (2000). Design specification issues in time-series intervention models. Educational and Psychological Measurement, 60, 38–58.

Jenson, W. R., Clark, E., Kircher, J. C., & Kristjansson, S. D. (2007). Statistical reform: Evidence-based practice, meta-analyses, and single subject designs. Psychology in the Schools, 44, 483–493.

Kazdin, A. E. (2001). Behavior modification in applied settings (6th ed.). Belmont, CA: Wadsworth.

Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Single-case designs technical documentation [Technical Report]. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_scd.pdf

Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2013). Single-case intervention research design standards. Remedial and Special Education, 34, 26–38.

Kratochwill, T. R., & Levin, J. R. (2010). Enhancing the scientific credibility of single-case intervention research: Randomization to the rescue. Psychological Methods, 15, 124–144.

Lane, K. L., & Carter, E. W. (2013). Reflections on the Special Issue: Issues and advances in the meta-analysis of single-case research. Remedial and Special Education, 34, 59–61.

Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis . Thousand Oaks, CA: Sage.

Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis . New York, NY: Oxford University Press.

Maggin, D. M., & Chafouleas, S. M. (2013). Introduction to the Special Series: Issues and advance of synthesizing single-case research. Remedial and Special Education, 34, 3–8.

Maggin, D. M., O’Keeffe, B. V., & Johnson, A. H. (2011a). A quantitative synthesis of methodology in the meta-analysis of single-subject research for students with disabilities: 1985–2009. Exceptionality, 19, 109–135.

Maggin, D. M., Swaminathan, H., Rogers, H. J., O’Keefe, B. V., Sugai, G., & Horner, R. H. (2011b). A generalized least squares regression approach for computing effect sizes in single-case research: Application examples. Journal of School Psychology, 49, 301–321.

Manolov, R., & Solanas, A. (2009). Percentage of nonoverlapping corrected data. Behavior Research Methods, 41, 1262–1271.

Manolov, R., & Solanas, A. (2012). Assigning and combining probabilities in single-case studies. Psychological Methods, 17, 495–509.

Manolov, R., Solanas, A., & Leiva, D. (2010). Comparing “visual” effect size indices for single-case designs. Methodology, 6, 49–58.

Manolov, R., Solanas, A., Sierra, V., & Evans, J. J. (2011). Choosing among techniques for quantifying single-case intervention effectiveness. Behavior Therapy, 42, 533–545.

Marín-Martínez, F., & Sánchez-Meca, J. (2010). Weighting by inverse variance or by sample size in random-effects meta-analysis. Educational and Psychological Measurement, 70, 56–73.

*Matchett, D. L., & Burns, M. K. (2009). Increasing word recognition fluency with an English language learner. Journal of Evidence Based Practices in Schools, 10, 194–209.

McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111 , 361–365.

Odom, S. L., Brantlinger, E., Gersten, R., Horner, R. H., Thompson, B., & Harris, K. R. (2005). Research in special education: Scientific methods and evidence-based practices. Exceptional Children, 71, 137–148.

Owens, C. M., & Ferron, J. M. (2012). Synthesizing single-case studies: A Monte Carlo examination of a three-level meta-analytic model. Behavior Research Methods, 44, 795–805.

Parker, R. I., & Vannest, K. J. (2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy, 40, 357–367.

Parker, R. I., & Vannest, K. J. (2012). Bottom-up analysis of single-case research designs. Journal of Behavioral Education, 21, 254–265.

Parker, R. I., Vannest, K. J., & Davis, J. L. (2011a). Effect size in single-case research: A review of nine nonoverlap techniques. Behavior Modification, 35, 303–322.

Parker, R. I., Vannest, K. J., & Davis, J. L. (2012, August 22). A simple method to control positive baseline trend within data nonoverlap. Journal of Special Education. Advance online publication. doi: . doi: 10.1177/0022466912456430

Parker, R. I., Vannest, K. J., Davis, J. L., & Sauber, S. B. (2011b). Combining nonoverlap and trend for single-case research: Tau-U. Behavior Therapy, 42, 284–299.

Petersen-Brown, S., Karich, A. C., & Symons, F. J. (2012). Examining estimates of effect using Non-overlap of all pairs in multiple baseline studies of academic intervention. Journal of Behavioral Education, 21, 203–216.

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/

Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis and meta-analysis (pp. 231–244). New York, NY: Russell Sage Foundation.

Sackett, D. L., Rosenberg, W. M. C., Gray, J. A. M., Hayness, R. B., & Richardson, W. S. (1996). Evidence based medicine: What it is and what it isn't. BMJ, 312, 71–72.

Article   PubMed Central   PubMed   Google Scholar  

Schlosser, R. W. (2009). The role of single-subject experimental designs in evidence-based practice times. FOCUS, 22, 1–8. Austin, TX: SEDL.

Schlosser, R. W., Lee, D. L., & Wendt, O. (2008). Application of the percentage of non-overlapping data (PND) in systematic reviews and meta-analyses: A systematic review of reporting characteristics. Evidence-Based Communication Assessment and Intervention, 2, 163–187.

Schlosser, R. W., & Sigafoos, J. (2008). Meta-analysis of single-subject experimental designs: Why now? Evidence-Based Communication Assessment and Intervention, 2, 117–119.

Scruggs, T. E., & Mastropieri, M. A. (2013). PND at 25: Past, present, and future trends in summarizing single-subject research. Remedial and Special Education, 34, 9–19.

Scruggs, T. E., Mastropieri, M. A., & Casto, G. (1987). The quantitative synthesis of single-subject research: Methodology and validation. Remedial and Special Education, 8, 24–33.

Shadish, W. R., Hedges, L. V., Pustejovsky, J. E., Boyajian, J. G., Sullivan, K. J., Andrade, A. et al. (2013a, July 18). A d-statistic for single-case designs that is equivalent to the usual between-groups d-statistic. Neuropsychological Rehabilitation . Advance online publication. doi: 10.1080/09602011.2013.819021

Shadish, W. R., Rindskopf, D. M., & Hedges, L. V. (2008). The state of the science in the meta-analysis of single-case experimental designs. Evidence-Based Communication Assessment and Intervention, 2, 188–196.

Shadish, W. R., Rindskopf, D. M., Hedges, L. V., & Sullivan, K. J. (2013b). Bayesian estimates of autocorrelations in single-case designs. Behavior Research Methods, 45, 813–821.

Shadish, W. R., & Sullivan, K. J. (2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods, 43, 971–980.

Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology . New York, NY: Basic Books.

Smith, J. D. (2012). Single-case experimental designs: A systematic review of published research and current standards. Psychological Methods, 17, 510–550.

Solanas, A., Manolov, R., & Sierra, V. (2010). Lag-one autocorrelation in short series: Estimation and hypothesis testing. Psicológica, 31, 357–381.

Tate, R. L., Perdices, M., Rosenkoetter, U., Wakima, D., Godbee, K., Togher, L., & McDonald, S. (2013). Revision of a method quality rating scale for single-case experimental designs and n-of-1 trials: The 15-item Risk of Bias in N-of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation, 23, 619–638.

Ugille, M., Moeyaert, M., Beretvas, S. N., Ferron, J., & Van den Noortgate, W. (2012). Multilevel meta-analysis of single-subject experimental designs: A simulation study. Behavior Research Methods, 44, 1244–1254.

Valentine, J. C., & Cooper, H. (2008). A systematic and transparent approach for assessing the methodological quality of intervention effectiveness research: The Study Design and Implementation Assessment Device (Study DIAD). Psychological Methods, 13, 130–149.

Van den Noortgate, W., & Onghena, P. (2003a). Estimating the mean effect size in meta-analysis: Bias, precision, and mean squared error of different weighting methods. Behavior Research Methods, Instruments, & Computers, 35, 504–511.

Van den Noortgate, W., & Onghena, P. (2003b). Hierarchical linear models for the quantitative integration of effect sizes in single-case research. Behavior Research Methods, Instruments, & Computers, 35, 1–10.

*Volpe, R. J., Burns, M. K., DuBois, M., & Zaslofsky, A. F. (2011a). Computer-assisted tutoring: Teaching letter sounds to kindergarten students using incremental rehearsal. Psychology in the Schools, 48, 332–342.

*Volpe, R. J., Mulé, C. M., Briesch, A. M., Joseph, L. M., & Burns, M. K. (2011b). A comparison of two flashcard drill methods targeting word recognition. Journal of Behavioral Education, 20, 117–137.

Wendt, O., & Miller, B. (2012). Quality appraisal of single-subject experimental designs: An overview and comparison of different appraisal tools. Education and Treatment of Children, 35, 109–142.

What Works Clearinghouse. (2008). What Works Clearinghouse evidence standards for reviewing studies, Version 1.0 Retrieved from http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_version1_standards.pdf

Whitlock, M. C. (2005). Combining probability from independent tests: The weighted Z-method is superior to Fisher’s approach. Journal of Evolutionary Biology, 18, 1368–1373.

Wolery, M. (2013). A commentary: Single-case design technical document of the What Works Clearinghouse. Remedial and Special Education, 34, 39–43.

Wolery, M., Busick, M., Reichow, B., & Barton, E. E. (2010). Comparison of overlap methods for quantitatively synthesizing single-subject data. Journal of Special Education, 44, 18–29.

Download references

Author Note

This research was partially supported by the Agència de Gestió d’Ajust Universitaris i de Recerca de la Generalitat de Catalunya grant 2009SGR1492.

Author information

Authors and affiliations.

ESADE Business School, Ramon Llull University, Barcelona, Spain

Rumen Manolov & Vicenta Sierra

Department of Behavioral Sciences Methods, Faculty of Psychology, University of Barcelona, Barcelona, Spain

Rumen Manolov & Georgina Guilera

Institute for Research in Brain, Cognition, and Behavior (IR3C), University of Barcelona, Barcelona, Spain

Georgina Guilera

Departament de Metodologia de les Ciències del Comportament, Facultat de Psicologia, Universitat de Barcelona, Passeig de la Vall d’Hebron, 171, 08035, Barcelona, Spain

Rumen Manolov

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rumen Manolov .

Rights and permissions

Reprints and permissions

About this article

Manolov, R., Guilera, G. & Sierra, V. Weighting strategies in the meta-analysis of single-case studies. Behav Res 46 , 1152–1166 (2014). https://doi.org/10.3758/s13428-013-0440-0

Download citation

Published : 01 February 2014

Issue Date : December 2014

DOI : https://doi.org/10.3758/s13428-013-0440-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-case designs
  • Meta-analysis
  • Effect size
  • Find a journal
  • Publish with us
  • Track your research

Meta-Analysis of Single-Case Research via Multilevel Models: Fundamental Concepts and Methodological Considerations

Affiliations.

  • 1 University at Albany, Albany NY, USA.
  • 2 University of Barcelona, Spain.
  • PMID: 30360633
  • DOI: 10.1177/0145445518806867

Multilevel modeling is an approach that can be used to summarize single-case experimental design (SCED) data. Multilevel models were developed to analyze hierarchical structured data with units at a lower level nested within higher level units. SCEDs use time series data collected from multiple cases (or subjects) within a study that allow researchers to investigate intervention effectiveness at the individual level and also to investigate how these individual intervention effects change over time. There is an increased interest in the field regarding how SCEDs can be used to establish an evidence base for interventions by synthesizing data from a series of intervention studies. Although using multilevel models to meta-analyze SCED studies is promising, application is often hampered by being potentially excessively technical. First, this article provides an accessible description and overview of the potential of multilevel meta-analysis to combine SCED data. Second, a summary of the methodological evidence on the performance of multilevel models for meta-analysis is provided, which is useful given that such evidence is currently scattered over multiple technical articles in the literature. Third, the actual steps to perform a multilevel meta-analysis are outlined in a brief practical guide. Fourth, a suggestion for integrating the quantitative results with a visual representation is provided.

Keywords: hierarchical linear modeling; meta-analysis; multilevel modeling; single-case experimental design.

  • Guidelines as Topic*
  • Meta-Analysis as Topic*
  • Models, Theoretical*
  • Multilevel Analysis*
  • Single-Case Studies as Topic*

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC10123043

Logo of springeropen

A Multilevel Meta-analysis of Single-Case Research on Interventions for Internalizing Disorders in Children and Adolescents

Marija maric.

1 Developmental Psychology, University of Amsterdam, Amsterdam, The Netherlands

2 Research Institute of Child Development and Education, University of Amsterdam, Amsterdam, The Netherlands

Lea Schumacher

3 Medical Psychology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany

Wim Van den Noortgate

4 Faculty of Psychology and Educational Sciences & Itec, an Imec Research Group, KU Leuven, Leuven, Belgium

Linda Bettelli

Wies engelbertink, yvonne stikkelbroek.

5 Child and Adolescent Studies, Utrecht University, Utrecht, The Netherlands

6 Depression Expert Center for Youth, Mental Health Care Oost-Brabant, Boekel, The Netherlands

Associated Data

All data generated or analyzed during this study as well as the code for analyses are included in this published article, and its supplementary information files.

The effectiveness of interventions for internalizing disorders in children and adolescents was studied using a review and meta-analysis of published single-case research. Databases and other resources were searched for quantitative single-case studies in youth with anxiety, depressive, and posttraumatic stress disorders. Raw data from individual cases were aggregated and analyzed by means of multilevel meta-analytic models. Outcome variables were symptom severity assessed across baseline and treatment phases of the studies, and diagnostic status at post- and follow-up treatment. Single-case studies were rated for quality. We identified 71 studies including 321 cases ( M age  = 10.66 years; 55% female). The mean quality of the studies was rated as below average, although there were considerable differences between the studies . Overall, positive within-person changes during the treatment phase in comparison to the baseline phase were found. In addition, positive changes in the diagnostic status were observed at post- and follow-up treatment. Yet high variability in treatment effects was found between cases and studies. This meta-analysis harvests the knowledge from published single-case research in youth-internalizing disorders and illustrates how within-person information from single-case studies can be summarized to explore the generalizability of the results from this type of research. The results emphasize the importance of keeping account of individual variability in providing and investigating youth interventions.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10567-023-00432-9.

Introduction

Internalizing disorders such as anxiety, depressive, and posttraumatic stress disorders (ADs, DD and PTSD), are among the most common mental health problems in children and adolescents (Merikangas et al., 2010 ), and their co-occurrence is high (McElroy & Patalay, 2019 ). Numerous empirically established interventions are available for treating these disorders (e.g., Crowe & McKay, 2017 ; Oud et al., 2019 ; Weems & Neill, 2020 ; Weisz et al., 2017 ). Evidence base underlying these treatments was built upon Randomized Controlled Trials (RCTs) in which average symptom scores of one group of participants are compared to those of a group of participants in a different condition. However, treatment effects that evolve within persons might not always be captured by between-person comparisons (e.g., Maric et al., 2012 ; Schuurman, 2023 ). There is nowadays a shared understanding that, next to RCTs, we need more idiographic types of research methods that can capture changes in individual risk factors, symptoms, and treatment goals while at the same time maintaining methodological rigor.

Quantitative single-case research is increasingly recognized as a valuable way to test within-person treatment effects in youth populations, both as an add-on for RCTs and as a stand-alone method (Kazdin, 2019 ; Maric, et al., 2012 ). International guidelines consider evidence gained from this type of research as one of the most rigorous forms of evidentiary support for therapies (Onghena et al., 2019 ). Further, single-case research can be a first step in testing innovative interventions prior to investigations in costly and time-intensive RCTs (Gaynor & Harris, 2008; Maric et al., 2012 ). In youth with internalizing disorders and specific comorbidities, single-case research can be implemented as a stand-alone method when collecting large data would be unfeasible within the time limits of the research project. Finally, using single-case methods existing treatment protocols for youth-internalizing disorders can be implemented in real-world clinical practice and the formats and conditions under which they are effective can be tested.

While it is tremendously important to study within-person effects to evaluate treatments, the reach and impact of these studies remain limited. This is partly due to the observation that single-case studies in youth interventions for anxiety, depression, and trauma often include a small number of cases. Even if the research questions are considered valuable for the field and an appropriate single-case design is used, questions about the strength of this evidence and the scope of implications of the results remain. Harvesting knowledge from these individual single-case studies is a next important step in order to broaden our knowledge about effects of youth interventions. While this rising number of single-case studies provides valuable information on within-person treatment effects, it is unlikely that they will strongly impact the knowledge on youth treatments unless their results are integrated. A meta-analysis of single-case research data permits researchers to synthesize the results of published studies quantitively to further help determining an evidence-base for therapies for internalizing disorders in children and adolescents (Dowdy et al., 2021 ; Onghena, Michiels, Jamshidi, Moeyaert, & Van den Noortgate, 2018 ; Van den Noortgate & Onghena, 2003 ). While in RCTs, between-person effects are examined, and meta-analyses of RCTs concern information on the sample level, meta-analyses of single-case research involve during-treatment within-person changes and data, and treatment effects on the case level are investigated. Few single-case research meta-analyses exist so far. Exceptions are for instance Richman et al. ( 2015 ), who investigated effects of treatment on non-social behavior in individuals, and Heyvaert et al. ( 2012 ), who evaluated intervention effectiveness for reducing challenging behavior in individuals with intellectual disabilities. To our knowledge, no meta-analysis of published single-case research exist in the area of youth psychopathology.

A strength of single-case research is that it can assess the effect for specific cases, for specific treatments, in specific contexts. Despite this ideosyncraticity, we expect that there is some communality, e.g., if a certain effect was found effective for some cases, it may be expected effective for other cases as well. A meta-analysis allows us to investigate whether there is an overall effect of the intervention, across all cases and studies. Second, it allows to quantify to what extent the effect varies between studies and cases, i.e., to what extent the effects found can be generalized and how much heterogeneity in treatment effects is present. Third, if there is heterogeneity in treatment effects, it allows to explore whether we can explain this variation by exploring the moderating effects of case and study characteristics. In single-case meta-analysis, treatment effects are estimated for each individual, offering opportunities to study moderating effects of case characteristics, in contrast to group comparisons in (meta-analyses of) RCTs. As, to our knowledge, no meta-analysis on this topic has been conducted, the overall within-person treatments effects, its heterogeneity, and moderators that could potentially explain the heterogeneity are unknown for treatments of internalizing disorders in youth.

This meta-analysis was driven by the fact that evidence of treatments for youth-internalizing disorders is mainly based on information on between-person effects from RCTs and the urge for information on within-person treatment effects and methods to study these. Further, while single-case studies are an acknowledged phenomenon in youth-internalizing intervention research, it is high time to start harvesting results of quantitative single-case research by systematically and quantitavely integrating its findings. Recent methodological developments, and collaborations between clinical researchers and methodologists make this endeavor possible. The aim of this meta-analysis is to provide a broad overview of the field and assess the overall within-person treatment effects, its heterogeneity and moderators that could explain this heterogeneity. As the single-case literature on internalizing disorders has not even been systematically reviewed yet, this study is exploratory in nature.

The planned analyses were pre-registered on OSF: https://osf.io/zjswg/?view_only=2b14c0d282e94a849f9d741a5cf759a1 . Due to the unkown number and nature of potentially included studies, the analysis plan was pre-registered after the search and during data extraction. The meta-analysis was conducted according to the PRISMA guidelines (Page et al., 2021 ).

Literature Search

We searched for quantitative single-case studies in three databases: PsycInfo, Medline, and Web of Science, without a lower limit for the date. The title, abstract, keywords, and subject headings were searched on January 6th 2020 with search terms from three categories: #1 Single-Case Experimental Studies, #2 Children and Adolescents, and #3 Internalizing and Externalizing Problems. During the screening process, the inclusion criteria were refined to only include internalizing disorders (not externalizing disorders), as a sufficient number of studies could be found on internalizing disorders in order to do a meta-analysis. This focus also made that we could limit the heterogenity between the studies to some extent. To include all current studies, a search was done in PsycInfo, Medline, and Web of Science with refined search criteria only including internalizing disorders in May 2021 for the period January 2020 to May 2021 and in February 2023 for the period May 2021 to February 2023. A full list of the original and refined search terms can be found in Supp1. In addition, Google scholar was checked for articles that were potentially missed and the references of all included studies were screened for possibly relevant studies. PsycArxiv and OSF were searched for gray literature in January 2021 and in February 2023.

Study Selection

For the current meta-analysis, inclusion criteria were as follows: a quantitative Single-Case (Experimental) Design [SC(E)D] was used; participants were children (4 through 17 years old) who at the start of the study met DSM criteria for anxiety, depression or posttraumatic stress disorder; who received treatment aimed at reduction of internalizing symptoms; and results on symptom severity or diagnostic status were reported at least at one assessment point before and one assessment point after the treatment. We included studies involving both experimental, quasi-experimental, and non-experimental single-case designs as we wanted to safeguard power of these analyses and provide a complete overview of SC(E)D research on internalizing disorders in youth. Cases with an intellectual disability or an IQ below 70 and cases with a medical condition were excluded.

The abstracts and the full texts of the studies were screened with the inclusion criteria by two independent raters. 20% of all abstracts and all full texts were double screened. Disagreement was resolved during discussion, through thorough checking of the inclusion criteria.

Outcome Variables and Moderators

The repeatedly assessed symptom severity across phases as depicted in individual graph data was the main outcome variable in this study. In most studies, only one main outcome variable was present. In studies where several variables were presented as outcome variables in graph data, one outcome variable was selected for the purpose of this meta-analysis, using the following criteria: (a) variable was related to the primary diagnosis (e.g., anxiety symptoms were chosen above comorbid ADHD symptoms); and (b) the same variables were assessed across different studies (e.g., spontaneous speech was assessed in selective mutism studies). In most studies, main outcome variable was self-reported by the child [45% in anxiety disorders (AD), 65% in major depressive disorder (MDD), 75% in posttraumatic stress disorder (PTSD) studies]. Parent-reported child symptoms were present in 42% of AD studies, 15% of MDD studies, and 12% of PTSD studies. The remainder of the outcome variables was rated by teachers or independent raters. Overall, behavioral outcome variables were parent and other reported (e.g., speech in class, separation from parents). An overview of variables indicating symptom severity included in the analyses can be found in Supp3.

Primary diagnosis (AD, MDD, or PTSD), according to DSM III, IV, or 5, of the cases, on pre-, post-, and follow-up treatment was included as a categorical outcome variable (yes/no).

Five potential moderators identified as important in previous studies on youth interventions (Maric et al., 2015 ) were tested: age, disorder category (AD, MDD, PTSD), target group (children, parents, parents and children), sample type (referred, recruited, referred and recruited, other), and treatment dosage (operationalized as number of sessions).

Data Extraction

Extraction of study characteristics, demographic, diagnosis, and treatment information was done independently by two raters. Graph data were extracted with DigitizeIt software 2.5 ( DigitizeIt , 2021 ; Rakap et al., 2016 ). Extraction of graph data was done by two independent raters, and 30% percent of the studies were cross checked. Finally, post- and follow-up diagnosis data were extracted. Prior to analyses, extracted data, and case characteristics were cross checked.

Quality Rating of the Studies

Studies were rated into non-experimental, quasi-experimental, and experimental single-case designs by MM (Tate et al., 2016 ; Supp3). Quality was assessed using 15-item Risk of Bias in N-of-1 Trials (RoBiNT) scale (Tate et al., 2013 ). It contains internal validity (IV; 7 items; e.g., ‘design’) and external validity and interpretation (EVI; 8 items; e.g., ‘therapeutic setting’). Items are rated on a 3-point scale (score range 0–2) with a maximum total score 30, and for the subscales 14 (IV) and 16 (EVI). Interrater reliability of RoBiNt scale between experienced raters ranges from 0.87 to 0.90, and from 0.93 to 0.95 between experienced and novice raters (Tate et al., 2013 ). YS and LB rated independently 30% of the studies resulting in interrater reliability (ICC) of 0.78, 0.78, and 0.72 for the total score, and IV and EVI subscale scores, respectively. Differences were resolved through discussions. Both YS and LB each rated half of the remaining 70% of the studies.

Statistical Analyses

Instead of calculating summary effect sizes for each study, and combining these as in regular meta-analyses, we combined the raw data from all cases (Van den Noortgate & Onghena, 2008 ). For both the repeated assessments of symptom severity and post-/follow-up treatment diagnosis, the data were analyzed using multilevel regression models. All analyses were done in R 4.1.2 using the package lme4 (Bates et al., 2015 ). Data and code can be found in Supp4/Supp5.

Analysis of symptom severity Several variables had to be (re)coded for the analysis of the repeated assessments of symptom severity. First, the phase variable was coded into baseline and treatment phase. For studies which included a baseline and two different treatment phases referring to the same therapies with, e.g., different intensity (e.g., Carlson, 1999) or different techniques from CBT (e.g., Nakamura, 2008), the two treatment phases were taken together. Due to a lack of available data across studies, it was decided to not include data from follow-up phases in this analysis.

Second, the time variable was coded such that it started at 0 at the beginning of the treatment (negative values were assigned to baseline phase) and increased with 1 for each additional week. Third, the scores indicating the symptom severity were reverse coded for some studies so that a score decrease implies improvement for each study. Finally, to be able to combine and compare scores between studies and cases, symptom severity scores were standardized as proposed by Van den Noortgate and Onghena ( 2008 ). This was done by estimating a two-level model for each study with phase (baseline phase, treatment phase), time, and their interactions as predictors, symptom severity as the outcome, and random effects for all predictors across cases. Subsequently, the original scores from each study were divided by the residual standard deviation of the model for this study. For studies with only one case, a linear regression model was estimated, and scores were divided by the estimated residual standard deviation. Finally, age was centered to ease interpretation for the moderator analysis.

To meta-analyze the (standardized) data from all cases, a three-level linear regression model with symptom severity as the dependent variable and phase (baseline phase, treatment phase), time, and their interaction as predictors was estimated. Here, the intercept and the effect of the time can be interpreted as the expected level at the end of the baseline phase, and the time trend in the baseline phase, whereas the effects of the phase and the interaction term can be interpreted as the immediate treatment effect at the start of the intervention, and the effect the treatment has on time trend (Van den Noortgate & Onghena, 2008 ). All four coefficients were allowed to vary randomly between cases and between studies. Furthermore, one model for each moderator (age, disorder type, target group, sample type, and treatment dosage) was estimated in which the moderator variable as well as its interaction with phase, with time, and with the interaction between phase and time were included as additional predictors.

Analysis of diagnostic status Due to inclusion criteria, all participants had a formal DSM diagnosis at pre-treatment. To assess the probability of a diagnosis at post-treatment and at follow-up, we conducted two-level logistic regressions with the diagnostic status (yes = 1, no = 0) at the respective time point as the outcome variable. First, an intercept model was estimated to evaluate the overall probability of a diagnosis, separately for post-treatment and for follow-up. Subsequently, for moderator analysis, three models were estimated with age, disorder category, and target group as respective predictors, again in separate analyses for diagnostic status at post-treatment and at follow-up. Due to limited amount of data for diagnosis data, we limited this analysis to the three most important moderators. For all models, the intercept was allowed to vary between studies.

102 studies matched the inclusion criteria. 27 studies were excluded due to unclarity about diagnostic procedures or absence of information about treatment outcomes on the case level. Four additional studies were excluded as the data in these studies could not be standardized for the purposes of multilevel meta-analysis. Thus, a total of 71 studies were included. From these 71 included studies, we still had to exclude 14 individual cases as they did not fit inclusion criteria, leaving a total number of cases of 321. A detailed overview of the study inclusion process is presented in PRISMA flow chart (Fig.  1 ). An overview of the excluded studies can be found in Supp2.

An external file that holds a picture, illustration, etc.
Object name is 10567_2023_432_Fig1_HTML.jpg

PRISMA flow diagram of included studies and cases

Study and Case Characteristics

A summary of study characteristics is presented in Supp3 wherein case-level information on all variables can be found. Number of cases per study ranged from 1 to 17 ( M  = 4.45). Total number of dropouts was 12 and ranged from 0 (48 studies) to 6 (1 study); 7 studies did not report on drop out. Mean age of the cases was 10.66 years ( SD  = 3.49; range 4–17); 55% of the cases were female (for about 50 cases gender was not reported or only on a sample level). For about 30% of the cases, no information regarding ethnicity or cultural background of participants was reported (Supp3). In 22% of the cases, no information about comorbidity was reported. In 18% of the cases, no comorbidity was observed. Comorbidity with another internalizing disorder was present in almost 40% of the cases, with an externalizing disorder in 4%, and with ADHD and ASS in 18% and 4% of the cases, respectively. 4% of the cases had other comorbid disorder (e.g., learning, speech, sleeping disorder).

Length of the treatment in weeks ranged between 0.02 (3-h treatment) and 68 weeks. On average, the treatments consisted of 10.47 sessions, ranging from 1 to 36 sessions. Overall, average scores for the total RoBiNT scale, and internal and external validity subscales were 12.06 ( SD  = 3.85; score range 4–21), 3.17 ( SD  = 2.52; score range 0–9), and 8.89 ( SD  = 2.26; score range 4–14), respectively. In Supp3, quality scores of (sub)scale(s) for each study can be found.

Within-Person Symptom Change

Average treatment effect : 4,153 datapoints from 222 cases from 47 studies were available. The first 3-level regression analysis with phase, time, and their interaction as predictors indicated that, on average, there was (across cases and studies) a significant immediate reduction of the symptom severity at the start of the treatment phase; b  = − 0.67, 95% CI = [− 1.10; − 0.24], p  = 0.002. The symptom severity significantly reduced with 0.14 standard deviation per week during the baseline phase; b  = − 0.14, 95% CI = [− 0.24; − 0.03], p  = 0.031. The linear decrease in symptoms became more pronounced during the treatment phase when compared to the baseline phase as indicated by the interaction between phase and time; b  = − 0.27, 95% CI = [− 0.47–0.07], p  = 0.005, resulting in a reduction of 0.41 (= − 0.14–0.27) standard deviation per week during the treatment phase. These analyses were re-done excluding three studies that were concerned with medication treatment and one study concerned with animal-assisted therapy, respectively (Table ​ (Table1). 1 ). In both cases, the conclusions remained the same. Similarly, we repeated this analysis when only including studies with experimental designs. Again the conclusions remained the same. 1

Summary of study and case characteristics ( N  = 71 studies; 321 cases)

AD Anxiety Disorders; MDD Major Depressive Disorder; PTSD Post Traumatic Stress Disorder; CBT Cognitive Behavioral Treatment; CBM Cognitive Bias Modification. NR not reported

a Determined via semi-structured diagnostic interviews (87% of the cases), via clinical interview (screening; 12% of the cases). In four cases both diagnostic interview and screening were used. In one case information about the method was unavailable, but it was clearly stated that the participant had a formal DSM diagnosis and was hospitalized for that

b Moved from another trial

c Schools, community services, hospital

d Acceptance and Commitment Therapy, Interpersonal Psychotherapy, Mindfulness, Equine-assisted trauma therapy

e Four studies had mixed target groups for their cases

Heterogeneity of the treatment effect The symptom development for each study during the baseline and the treatment phase is depicted in Fig.  2 , and the estimated effects for all individual cases are depicted in Fig.  3 . Compared to the variation within subjects ( σ  = 0.94), there was much variation for the intercept between the studies (the estimated standard deviation τ  = 3.17) and cases ( τ  = 2.19). This shows that the symptom severity was very different between cases and studies at treatment start. Also the treatment effects varied considerably between studies and cases. Between the studies, the effect of phase (i.e., the immediate reduction in symptom severity at treatment start) varied with τ  = 1.14 and the effect of phase*time (i.e., difference in symptom reduction between the baseline and treatment phase) varied with τ  = 0.59. Given the assumption that effects are normally distributed across studies, this would mean that for 95% of the studies, the immediate treatment effect ranges between − 2.90 and 1.56, and that the effect is negative for 72% of the studies. For the effect on the time trend (phase*time), the 95% prediction interval is [− 1.43; 0.89] with a negative effect for 68% of the studies. Between cases, the effect of phase varied with an estimated standard deviation τ  = 1.49 and the effect of phase*time with τ  = 0.36. These results suggest that for 95% of the cases and a typical study, the immediate effect varies between − 3.59 and 2.25 (with 67% of the effects being negative), and the effect of phase*time varies between -0.98 and 0.44, and that the effect is negative for about 77% of the cases. Figure  3 visualizes that despite large variability in the individual treatment effects, for the majority of cases a reduction is expected in symptom severity as a response to the treatment, although for a minority of cases, this reduction is also statistically significant. Across cases, there is a large negative correlation between the random intercept and the random effects of the interaction between phase and time, r  = − 0.75. This indicates that the larger the symptom severity at the end of the baseline, the more pronounced is the symptom severity reduction in the treatment phase compared to the baseline phase.

An external file that holds a picture, illustration, etc.
Object name is 10567_2023_432_Fig2_HTML.jpg

Estimated effects for the symptom development during baseline and treatment phase for each study Note The black line represents the average effects across all studies; for time < 0 the lines represent the estimated slope during the baseline phase, for time > 0 the lines represent the estimated slope during the treatment phase; the “drop” at t  = 0 indicates the immediate symptom reduction at treatment start; symptom severity was standardized across studies and values < 0 mean no symptoms anymore; the time interval varied between studies. Estimates are empirical Bayes estimates

An external file that holds a picture, illustration, etc.
Object name is 10567_2023_432_Fig3_HTML.jpg

Estimated treatment effect for each individual case Note Each dot represents the empirical Bayes estimate of the effect for each individual case, lines represent the corresponding 95% confidence intervals

Moderators The results for the three-level regressions which include the moderator variables are displayed in Supp6. Almost none of the variables showed a significant effect on the baseline level or trend in symptom severity, nor on the immediate effect of time trend (interaction with phase and time respectively). Only the immediate symptom reduction during the treatment phase seems to be more pronounced for cases with PTSD when compared to cases with AD or MDD; b  = − 1.16 [− 2.19; − 0.14], p  = 0.027.

Diagnostic Status at Post-Treatment and Follow-up

Data on diagnostic status were available for 268 cases from 62 studies at post-treatment and 191 cases from 44 studies at follow-up. At post-treatment, 28.46% ( n  = 76) cases still met diagnostic criteria for AD, MDD, or PTSD. At follow-up, 21.99% ( n  = 42) cases still met diagnostic criteria for AD, MDD, or PTSD. Results of all multilevel logistic regressions can be found in Table ​ Table2 2 for the probability of a diagnosis at post-treatment and at follow-up. Across all studies, the average probability of a diagnosis was 0.14 (95% CI: [0.05; 0.31]). This indicates that the likelihood of a diagnosis markedly reduced after treatment (before treatment all cases were diagnosed with an internalizing disorder). However, the between-study variance was rather large (Table ​ (Table2) 2 ) resulting in a 95% prediction interval for the study-specific probabilities of a diagnosis between 0.0006 and 0.97. This shows that studies varied to a large extent in how likely cases had a diagnosis at post-treatment. At follow-up, the average probability of a diagnosis was 0.12 (95% CI: [0.04; 0.28]). The between-study variance was also large for follow-up (Table ​ (Table2) 2 ) with the 95% prediction interval for the random study effects ranging between 0.001 and 0.93.

Outcome of the two-level logistic regressions estimating the probability of diagnosis

Variance values are on a logit scale. CI = 95% confidence interval

Age does not seem to have an effect on the diagnosis probability, both for post-treatment and follow-up ( p  = 0.65 and p  = 0.17; also see Tables ​ Tables2). 2 ). The same holds for the variables target group ( p  = 0.48 and p  = 0.38, respectively) and disorder category ( p  = 0.08 for post-treatment and p  = 0.85 for follow-up). Still, the average likelihood of a diagnosis at post-treatment and follow-up is indicated to be below chance also when age, target group, or disorder category is taken into account, see Table ​ Table2. 2 . The large between-study variances and the large intra-class correlations (Table ​ (Table2) 2 ) for both post-treatment and follow-up models indicate that a considerable part of the variation in the diagnosis probability is due to differences between studies.

This study is, to our knowledge, the first study to investigate meta-analytically within-person changes instead of between-person differences for evaluating the treatment of youth mental health problems. Evidence for symptom reduction during the treatment phase in comparison to baseline phase was found. Although already during the baseline a slight decrease in symptom severity over time was observed, a larger decrease was found during the treatment phase across cases and studies (Fig.  2 ). Further, we showed that these within-person treatment effects were positive for the majority of cases but still varied to a large extent between studies and cases. Regarding potential moderators, the immediate decrease in symptom severity of cases with PTSD at treatment start seemed to be more evident compared to cases with AD or MDD. Overall, large improvements at post-treatment and even larger at follow-up were observed for the change in diagnostic status. After the treatments, the likelihood of having an internalizing diagnosis markedly decreased as opposed to the start of the treatment.

Our results indicate that, overall, treatments for internalizing disorders in youths as evaluated in quantitative single-case research seem to be effective in reducing clinical symptoms during treatment. In addition, a positive change in the diagnostic status was observed. These findings are a valuable addition to previous knowledge on treatment effects for internalizing problems in youth from (meta-analyses) of RCTs as they are based on within-person comparisons and tested in a wide range of individuals. These positive results are informative from the clinical point of view as the majority of the sample concerned referred cases that are in general characterized by high severity and comorbidity and harder to be treated. This hypothesis was not quantitatively tested in this meta-analysis, but our impression of the treatments utilized in included studies is that at least in the half of the studies treatments were tailored to some client characteristic. For example, in some studies, treatments were tailored to a specific age group (e.g., young children, Choate et al., 2005; adolescents, Leigh & Clark, 2016), condition (e.g., comorbid AD and ADHD; Jarrett & Ollendick, 2012), or symptom (e.g., behavioral treatment of MDD, Frame et al., 1982). Potentially, this may have impacted the positive results. In addition, this might have also perhaps influenced rather low dropout rates of cases in the studies included in our meta-analysis.

One of the most important findings concerns the ‘variability’ of case characteristics and individual treatment effects. Our results showed that cases and studies are very heterogeneous; there are differences between the cases (within the same study) in demographics, and differences between the studies in designs, type of treatments, number of sessions, and length of treatment (Supp3). In line with this, although overall positive, treatment effects largely varied between cases and studies (Figs.  2 and ​ and3). 3 ). This emphasizes that variability is a legitime concern in youth intervention research. Worringly, this individual variability has potentially been overlooked in group-level studies. By meta-analyzing single-case studies, we could, for the first time, quantify and describe this heterogeneity in individual treatment effects. In our study, youths with PTSD experienced the most immediate improvement in symptom reduction at treatment start as opposed to baseline when compared to youth with AD or MDD. No other moderating effects were found. This is probably due to various other, not investigated variables that introduced heterogeneity between studies, cases, and treatment effects. Further, the number of studies and, thus, the statistical power, were potentially too low to assess more fine-grained moderator effects.

Next to the limited amount of studies, a notable limitation of our study is that, in graph data analysis, both child and other (parent, observer) report of the outcome variable were included, based on the availability in the studies. It seems that more behavioral symptoms were always rated by others in the included studies. Despite different reporters, the current meta-analysis offers the first overview of quantitative single-case research on internalizing youth and provides empirical evidence for an overall positive within-person treatment effect and considerate heterogeneity between studies and cases for this treatment effects. With the surge of single-case research in internalizing youth, future meta-analyzes will be able to better evaluate moderators explaining the variability of individual treatment effects.

It is worthwhile mentioning that the overall quality of included studies was rated as below average, although there were differences between the studies in quality scores (Supp3). Our general impression was that at least in some studies criteria were fulfilled (such as quality criterium ‘treatment adherence’), but this information was not explicitly reported in the specific article. While high-quality guidelines exist for conducting (What Works Clearinghouse, [Kratochwill et al., 2010 ]) and reporting (Tate et al., 2016 ) single-case research, much of these guidelines seem left unused in single-case research. The major problems that hinder the utilization seem to be different interpretations of the criteria and the absence of clear procedures for the application of these standards (Maggin et al., 2013 ). In addition, specific guidelines are necessary for conducting single-case studies in different contexts (laboratory research vs. real-world research), and the designs should be tailored to research questions and aims of the studies, also to increase uniformity of different studies and their generalizability.

In sum, this is, as far as we know, the first study that explored the generalizability of treatment effects found in single-case research in youth treatment outcome studies by meta-analyzing during-treatment within-person changes. Overall, a positive impact of treatments was found for youth-internalizing disorders and symptoms, and the estimated effect was positive for the majority of cases and studies. Yet we also found a large variability between studies and cases in their characteristics and treatment effects. While it is yet to be determined what exactly explains the variation in effects, it is certain that the treatments as evaluated in single-case research hold great clinical potential for youth with mental health problems, and that recent advances in idiosyncratic research methods can help optimally learn what helps for whom and through which mechanisms.

Appendix 1: Studies Included in the Meta-analysis

  • Barterian, J. A., Sanchez, J. M., Magen, J., Siroky, A. K., Mash, B. L., & Carlson, J. S. (2018). An examination of fluoxetine for the treatment of selective mutism using a nonconcurrent multiple-baseline single-case design across 5 cases.  Journal of Psychiatric Practice® ,  24 (1), 2-14.
  • Bechor, M., Pettit, J. W., Silverman, W. K., Bar-Haim, Y., Abend, R., Pine, D. S., ... & Jaccard, J. (2014). Attention bias modification treatment for children with anxiety disorders who do not respond to cognitive behavioral therapy: A case series. Journal of Anxiety Disorders, 28 (2), 154-159.
  • Bowyer, L., Wallis, J., & Lee, D. (2014). Developing a compassionate mind to enhance trauma-focused CBT with an adolescent female: A case study. Behavioural and Cognitive Psychotherapy, 42 (2), 248-254.
  • Carlson, J. S., Kratochwill, T. R., & Johnston, H. F. (1999). Sertraline treatment of 5 children diagnosed with selective mutism: A single-case research trial. Journal of Child and Adolescent Psychopharmacology, 9 (4), 293-306.
  • Chevalier, L. L. (2020). Evaluation of a treatment of sleep-related problems in children with anxiety using a multiple baseline design (Doctoral dissertation, Boston University).
  • Choate, M. L., Pincus, D. B., Eyberg, S. M., & Barlow, D. H. (2005). Parent-child interaction therapy for treatment of separation anxiety disorder in young children: A pilot study. Cognitive and Behavioral Practice, 12 (1), 126-135.
  • Chorpita, B. F., Albano, A. M., Heimberg, R. G., & Barlow, D. H. (1996). A systematic replication of the prescriptive treatment of school refusal behavior in a single subject. Journal of Behavior Therapy and Experimental Psychiatry, 27 (3), 281-290.
  • Cooper-Vince, C. E., Chou, T., Furr, J. M., Puliafico, A. C., & Comer, J. S. (2016). Videoteleconferencing early child anxiety treatment: A case study of the internet-delivered PCIT CALM (I-CALM) program. Evidence-Based Practice in Child and Adolescent Mental Health, 1 (1), 24-39.
  • Cowart, M. J., & Ollendick, T. H. (2011). Attention training in socially anxious children: a multiple baseline design analysis. Journal of Anxiety Disorders, 25 (7), 972-977.
  • Cunningham, M. J., Wuthrich, V. M., Rapee, R. M., Lyneham, H. J., Schniering, C. A., & Hudson, J. L. (2009). The Cool Teens CD-ROM for anxiety disorders in adolescents. European Child & Adolescent Psychiatry, 18 (2), 125-129.
  • Eckshtain, D., & Gaynor, S. T. (2009). Assessing outcome in cognitive behavior therapy for child depression: An illustrative case series. Child & Family Behavior Therapy, 31 (2), 94-116.
  • Ehrenreich, J. T., Goldstein, C. R., Wright, L. R., & Barlow, D. H. (2009). Development of a unified protocol for the treatment of emotional disorders in youth. Child & Family Behavior Therapy, 31 (1), 20-37.
  • Ehrenreich-May, J., Simpson, G., Stewart, L. M., Kennedy, S. M., Rowley, A. N., Beaumont, A., ... & Wood, J. J. (2020). Treatment of anxiety in older adolescents and young adults with autism spectrum disorders: A pilot study. Bulletin of the Menninger Clinic, 84 (2), 105-136.
  • Eisen, A. R., & Silverman, W. K. (1998). Prescriptive treatment for generalized anxiety disorder in children. Behavior Therapy, 29 (1), 105-121.
  • Eisen, A. R., Raleigh, H., & Neuhoff, C. C. (2008). The unique impact of parent training for separation anxiety disorder in children. Behavior Therapy, 39 (2), 195-206.
  • Esveldt-Dawson, K., Wisner, K. L., Unis, A. S., Matson, J. L., & Kazdin, A. E. (1982). Treatment of phobias in a hospitalized child. Journal of Behavior Therapy and Experimental Psychiatry, 13 (1), 77-83.
  • Farrell, S. P., Hains, A. A., & Davies, W. H. (1998). Cognitive behavioral interventions for sexually abused children exhibiting PTSD symptomatology. Behavior Therapy, 29 (2), 241-255.
  • Farrell, L. J., Kershaw, H., & Ollendick, T. (2018). Play-modified one-session treatment for young children with a specific phobia of dogs: a multiple baseline case series. Child Psychiatry & Human Development, 49 (2), 317-329.
  • Farrell, L. J., Miyamoto, T., Donovan, C. L., Waters, A. M., Krisch, K. A., & Ollendick, T. H. (2021). Virtual reality one-session treatment of child-specific phobia of dogs: A controlled, multiple baseline case series. Behavior Therapy, 52 (2), 478-491.
  • Feather, J. S. (2006). Trauma-focused cognitive behavioural therapy for abused children with. New Zealand Journal of Psychology, 35 (3).
  • Feather, J. S., & Ronan, K. R. (2009). Trauma‐focused CBT with maltreated children: A clinic‐based evaluation of a new treatment manual. Australian Psychologist, 44 (3), 174-194.
  • Fernandez, S., DeMarni Cromer, L., Borntrager, C., Swopes, R., Hanson, R. F., & Davis, J. L. (2013). A case series: Cognitive-behavioral treatment (exposure, relaxation, and rescripting therapy) of trauma-related nightmares experienced by children. Clinical Case Studies, 12 (1), 39-59.
  • Frame, C., Matson, J. L., Sonis, W. A., Fialkov, M. J., & Kazdin, A. E. (1982). Behavioral treatment of depression in a prepubertal child. Journal of Behavior Therapy and Experimental Psychiatry, 13 (3), 239-243.
  • Francis, D., Hudson, J. L., Kohnen, S., Mobach, L., & McArthur, G. M. (2021). The effect of an integrated reading and anxiety intervention for poor readers with anxiety. PeerJ, 9 , e10987.
  • Gaynor, S. T., & Harris, A. (2008). Single-participant assessment of treatment mediators: Strategy description and examples from a behavioral activation intervention for depressed adolescents. Behavior Modification, 32 (3), 372-402.
  • Geuke, G. G., Maric, M., Miočević, M., Wolters, L. H., & de Haan, E. (2019). Testing mediators of youth intervention outcomes using single‐case experimental designs. New directions for child and adolescent development, 2019(167), 39-64.
  • Girling-Butcher, R. D., & Ronan, K. R. (2009). Brief cognitive-behavioural therapy for children with anxiety disorders: Initial evaluation of a program designed for clinic settings. Behaviour Change, 26 (1), 27-53.
  • Goodall, B., Chadwick, I., McKinnon, A., Werner‐Seidler, A., Meiser‐Stedman, R., Smith, P., & Dalgleish, T. (2017). Translating the cognitive model of PTSD to the treatment of very young children: A single case study of an 8‐year‐old motor vehicle accident survivor. Journal of Clinical Psychology, 73 (5), 511-523.
  • Hagopian, L. P., & Slifer, K. J. (1993). Treatment of separation anxiety disorder with graduated exposure and reinforcement targeting school attendance: A controlled case study. Journal of Anxiety Disorders, 7 (3), 271-280.
  • Heard, P. M., Dadds, M. R., & Conrad, P. (1992). Assessment and treatment of simple phobias in children: Effects on family and marital relationships. Behaviour Change , 9 (2), 73-82.
  • Hendriks, L., de Kleine, R. A., Heyvaert, M., Becker, E. S., Hendriks, G. J., & van Minnen, A. (2017). Intensive prolonged exposure treatment for adolescent complex posttraumatic stress disorder: a single‐trial design. Journal of Child Psychology and Psychiatry, 58 (11), 1229-1238.
  • Howard, B. L., & Kendall, P. C. (1996). Cognitive-behavioral family therapy for anxiety-disordered children: A multiple-baseline evaluation. Cognitive Therapy and Research, 20 (5), 423-443.
  • Jacob, M., L. Keeley, M., Ritschel, L., & Craighead, W. E. (2013). Behavioural activation for the treatment of low‐income, African American adolescents with major depressive disorder: a case series. Clinical Psychology & Psychotherapy, 20 (1), 87-96.
  • Jarrett, M. A., & Ollendick, T. H. (2012). Treatment of comorbid attention-deficit/hyperactivity disorder and anxiety in children: A multiple baseline design analysis. Journal of Consulting and Clinical Psychology, 80 (2), 239.
  • Kane, M. T., & Kendall, P. C. (1989). Anxiety disorders in children: A multiple-baseline evaluation of a cognitive-behavioral treatment. Behavior Therapy, 20 (4), 499-508.
  • Leger, E., Ladouceur, R., Dugas, M. J., & Freeston, M. H. (2003). Cognitive-behavioral treatment of generalized anxiety disorder among adolescents: A case series. Journal of the American Academy of Child & Adolescent Psychiatry, 42 (3), 327-330.
  • Leigh, E., & Clark, D. M. (2016). Cognitive therapy for social anxiety disorder in adolescents: a development case series. Behavioural and Cognitive Psychotherapy, 44 (1), 1-17.
  • Lewis, K. M., Amatya, K., Coffman, M. F., & Ollendick, T. H. (2015). Treating nighttime fears in young children with bibliotherapy: Evaluating anxiety symptoms and monitoring behavior change. Journal of Anxiety Disorders, 30 , 103-112.
  • Lumpkin, P. W., Silverman, W. K., Weems, C. F., Markham, M. R., & Kurtines, W. M. (2002). Treating a heterogeneous set of anxiety disorders in youths with group cognitive behavioral therapy: A partially nonconcurrent multiple-baseline evaluation. Behavior Therapy, 33 (1), 163-177.
  • Malboeuf-Hurtubise, C., Lacourse, E., Herba, C., Taylor, G., & Amor, L. B. (2017). Mindfulness-based intervention in elementary school students with anxiety and depression: a series of n-of-1 trials on effects and feasibility. Journal of Evidence-Based Complementary & Alternative Medicine , 22 (4), 856-869.
  • Maric, M., De Haan, E., Hogendoorn, S. M., Wolters, L. H., & Huizenga, H. M. (2015). Evaluating statistical and clinical significance of intervention effects in single-case. experimental designs: An SPSS method to analyze univariate data. Behavior Therapy , 46 (2), 230–241.
  • Mayer-Brien, S., Turgeon, L., & Lanovaz, M. J. (2017). Effects of a parent training programme for the treatment of young children with separation anxiety disorder. The Cognitive Behaviour Therapist, 10 .
  • Nakamura, B. J., Pestle, S. L., & Chorpita, B. F. (2009). Differential sequencing of cognitive-behavioral techniques for reducing child and adolescent anxiety. Journal of Cognitive Psychotherapy, 23 (2), 114-135.
  • Nelissen, I., Muris, P., & Merckelbach, H. (1995). Computerized exposure and in vivo exposure treatments of spider fear in children: Two case reports.  Journal of Behavior Therapy and Experimental Psychiatry ,  26 (2), 153-156.
  • Neuhoff, C. C. (2006). Prescriptive treatment for separation anxiety disorder: child therapy versus parent training (Doctoral dissertation, Fairleigh Dickinson University).
  • O'Reilly, M., McNally, D., Sigafoos, J., Lancioni, G. E., Green, V., Edrisinha, C., ... & Didden, R. (2008). Examination of a social problem-solving intervention to treat selective mutism. Behavior Modification, 32 (2), 182-195.
  • Oar, E. L., Farrell, L. J., & Ollendick, T. H. (2015). One session treatment for specific phobias: An adaptation for paediatric blood–injection–injury phobia in youth. Clinical Child and Family Psychology Review, 18 (4), 370-394.
  • Olivier, E., de Roos, C., & Bexkens, A. (2021). Eye movement desensitization and reprocessing in young children (ages 4–8) with posttraumatic stress disorder: A multiple-baseline evaluation.  Child Psychiatry & Human Development , 53 , 1391-1404.
  • Ollendick, T. H. (1995). Cognitive behavioral treatment of panic disorder with agoraphobia in adolescents: A multiple baseline design analysis. Behavior Therapy , 26 (3), 517-531.
  • Ollendick, T. H., Hagopian, L. P., & Huntzinger, R. M. (1991). Cognitive-behavior therapy with nighttime fearful children. Journal of Behavior Therapy and Experimental Psychiatry, 22(2), 113-121.
  • Ollendick, T., Muskett, A., Radtke, S. R., & Smith, I. (2021). Adaptation of one-session treatment for specific phobias for children with autism spectrum disorder using a non-concurrent multiple baseline design: A preliminary investigation. Journal of Autism and Developmental Disorders, 51 (4), 1015-1027.
  • Ortega, M. L. (2012). The generalization of verbal speech across multiple settings for children with selective mutism: A multiple-baseline design pilot study (Doctoral dissertation). Available from ProQuest Dissertations & Theses Global database. (UMI No. 3460734).
  • Ooi, Y. P., Raja, M., Sung, S. C., Fung, D. S., & Koh, J. B. (2012). Application of a web-based cognitive-behavioural therapy programme for the treatment of selective mutism in Singapore: a case series study . Singapore Medical Journal, 53 (7), 446-450.
  • Pasquinelli, S. (2009). The efficacy of treating adolescent depression with Interpersonal Psychotherapy for Adolescents (IPT-A) in the school setting (Doctoral. Dissertation, Duquesne University, Pittsburg, USA). Retrieved from: https://dsc.duq.edu/etd/1024/ .
  • Pathak, S., Johns, E. S., & Kowatch, R. A. (2005). Adjunctive quetiapine for treatment-resistant adolescent major depressive disorder: A case series. Journal of Child & Adolescent Psychopharmacology, 15 (4), 696-702.
  • Quero, S., Nebot, S., Rasal, P., Breton-Lopez, J., Banos, R. M., & Botella, C. (2014). Information and communication technologies in the treatment of small animals phobia in childhood. Behavioral Psychology-Psicología Conductual, 22 (2), 257-276.
  • Radtke, S. R., Muskett, A., Coffman, M. F., & Ollendick, T. H. (2022). Bibliotherapy for specific phobias of dogs in young children: a pilot study.  Journal of child and family studies , 32 , 373-383.
  • Ravid, A., Lagbas, E., Johnson, M., & Osborne, T. L. (2021). Targeting co-sleeping in children with anxiety disorders using a modified bedtime pass intervention: A case series using a hanging criterion design. Behavior Therapy, 52 (2), 298-312.
  • Reuland, M. M., & Teachman, B. A. (2014). Interpretation bias modification for youth and their parents: A novel treatment for early adolescent social anxiety. Journal of Anxiety Disorders , 28(8), 851-864.
  • Ruiz García, A., & Valero Aguayo, L. (2020). INTERVENCIÓN MEDIANTE EXPOSICIÓN MULTIMEDIA EN UN CASO DE FOBIA INFANTIL A LAS AVISPAS. Behavioral Psychology/Psicología Conductual, 28 (2).
  • Saigh, P. A. (1987). In vitro flooding of childhood posttraumatic stress disorders: A systematic replication. Professional School Psychology, 2(2), 135.
  • Salazar, D. M., Ruiz, F. J., Ramírez, E. S., & Cardona-Betancourt, V. (2020). Acceptance and commitment therapy focused on repetitive negative thinking for child depression: a randomized multiple-baseline evaluation. The Psychological Record, 70 (3), 373-386.
  • Santucci, L. C., Ehrenreich, J. T., Trosper, S. E., Bennett, S. M., & Pincus, D. B. (2009). Development and preliminary evaluation of a one-week summer treatment program for separation anxiety disorder. Cognitive and Behavioral Practice, 16 (3), 317-331.
  • Simons, M., & Vloet, T. D. (2018). Emetophobia–a metacognitive therapeutic approach for an overlooked disorder. Zeitschrift für Kinder-und Jugendpsychiatrie und Psychotherapie, 46 , 57-66 .
  • Storch, E. A., Nadeau, J. M., Rudy, B., Collier, A. B., Arnold, E. B., Lewin, A. B., ... & Murphy, T. K. (2015). A case series of cognitive-behavioral therapy augmentation of antidepressant medication for anxiety in children with autism spectrum disorders. Children's Health Care, 44 (2), 183-198.
  • Suveg, C., Kendall, P. C., Comer, J. S., & Robin, J. (2006). Emotion-focused cognitive-behavioral therapy for anxious youth: A multiple-baseline evaluation.  Journal of Contemporary Psychotherapy ,  36 (2), 77-85.
  • Taylor, L. K., & Weems, C. F. (2011). Cognitive-behavior therapy for disaster-exposed youth with posttraumatic stress: results from a multiple-baseline examination. Behavior Therapy, 42 (3), 349-363.
  • Tolin, D. F., Whiting, S., Maltby, N., Diefenbach, G. J., Lothstein, M. A., Hardcastle, S., ... & Gray, K. (2009). Intensive (daily) behavior therapy for school refusal: A multiple baseline case series. Cognitive and Behavioral Practice, 16 (3), 332-344.
  • Unterhitzenberger, J., Eberle-Sejari, R., Rassenhofer, M., Sukale, T., Rosner, R., & Goldbeck, L. (2015). Trauma-focused cognitive behavioral therapy with unaccompanied refugee minors: a case series. BMC psychiatry, 15 (1), 1-9.
  • Waters, A. M., Donaldson, J., & Zimmer-Gembeck, M. J. (2008). Cognitive–behavioural therapy combined with an interpersonal skills component in the treatment of generalised anxiety disorder in adolescent females: A case series. Behaviour Change, 25 (1), 35-43.
  • Yorke, J., Nugent, W., Strand, E., Bolen, R., New, J., & Davis, C. (2013). Equine-assisted therapy and its impact on cortisol levels of children and horses: A pilot study and meta-analysis. Early Child Development and Care, 183 (7), 874-894.

Below is the link to the electronic supplementary material.

Acknowledgements

We would like to thank student assistants Adelaide Ghisolfi and Rineke Bossenbroek for their short, but valuable help with data extraction, and student assistant Sera Wiechert for her help with abstract screening.

Data availability

Declarations.

The authors have no relevant financial or non-financial interests to disclose.

1 Results excluding medication treatment: b (phase) = − 0.73, 95% CI = [− 1.19; − 0.27]; b (time) =  − 0.13, 95% CI = [− 0.24; − 0.02]; b (phase*time) =  − 0.30, 95% CI = [− 0.52; − 0.07]. Results without the animal-assissted therapy: b (phase) =  − 0.72, 95% CI = [− 1.17; − 0.27]; b (time) =  − 0.12, 95% CI = [− 0.23; − 0.02]; b (phase*time) =  − 0.30, 95% CI = [− 0.50; − 0.11]. Results excluding non-experimental studies:: b (phase) =  − 0.59, 95% CI = [− 0.99; -0.19]; b (time) =  − 0.13, 95% CI = [− 0.23; -0.03]; b (phase*time) =  − 0.28, 95% CI = [− 0.52; − 0.05].

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Marija Maric, Lea Schumacher have Shared first authorship.

  • Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software. 2015; 67 (1):1–48. doi: 10.18637/jss.v067.i01. [ CrossRef ] [ Google Scholar ]
  • Crowe K, McKay D. Efficacy of cognitive-behavioral therapy for childhood anxiety and depression. Journal of Anxiety Disorders. 2017; 49 :76–87. doi: 10.1016/j.janxdis.2017.04.001. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • DigitizeIt—Plot Digitizer Software. Digitize Graphs, Charts and Math Data . I. Bormann (editor) (2021). Available online at: https://www.digitizeit.xyz/ (accessed May-Sept., 2021).
  • Dowdy A, Peltier C, Tincani M, Schneider WJ, Hantula DA, Travers JC. Meta-analyses and effect sizes in applied behavior analysis: A review and discussion. Journal of Applied Behavior Analysis. 2021; 54 (4):1317–1340. doi: 10.1002/jaba.862. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Heyvaert M, Maes B, Van Den Noortgate W, Kuppens S, Onghena P. A multilevel meta-analysis of single-case and small-n research on interventions for reducing challenging behavior in persons with intellectual disabilities. Research in Developmental Disabilities. 2012; 33 (2):766–780. doi: 10.1016/j.ridd.2011.10.010. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kazdin AE. Single-case experimental designs. Evaluating interventions in research and clinical practice. Behaviour Research and Therapy. 2019; 117 :3–17. doi: 10.1016/j.brat.2018.11.015. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kratochwill, T. R., Hitchcock, J., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2010). Single-case designs technical documentation. Retrieved from What Works Clearinghouse website: http://ies.ed.gov/ncee/ wwc/pdf/wwc_scd.pdf.
  • Maggin DM, Briesch AM, Chafouleas SM. An application of the what works clearinghouse standards for evaluating single-subject research: Synthesis of the self-management literature base. Remedial and Special Education. 2013; 34 (1):44–58. doi: 10.1177/0741932511435176. [ CrossRef ] [ Google Scholar ]
  • Maric M, Wiers RW, Prins PJ. Ten ways to improve the use of statistical mediation analysis in the practice of child and adolescent treatment research. Clinical Child and Family Psychology Review. 2012; 15 (3):177–191. doi: 10.1007/s10567-012-0114-y. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Maric M, Prins PJ, Ollendick TH, editors. Moderators and mediators of youth treatment outcomes. Oxford University Press; 2015. [ Google Scholar ]
  • McElroy E, Patalay P. In search of disorders: Internalizing symptom networks in a large clinical sample. Journal of Child Psychology and Psychiatry. 2019; 60 (8):897–906. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Merikangas KR, He JP, Burstein M, Swanson SA, Avenevoli S, Cui L, Swendsen J. Lifetime prevalence of mental disorders in US adolescents: Results from the national comorbidity survey replication-adolescent supplement (NCS-A) Journal of the American Academy of Child & Adolescent Psychiatry. 2010; 49 (10):980–989. doi: 10.1016/j.jaac.2010.05.017. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Onghena P, Michiels B, Jamshidi L, Moeyaert M, Van den Noortgate W. One by one: Accumulating evidence by using meta-analytical procedures for single-case experiments. Brain Impairment. 2018; 19 (1):33–58. doi: 10.1017/BrImp.2017.25. [ CrossRef ] [ Google Scholar ]
  • Onghena P, Tanious R, De TK, Michiels B. Randomization tests for changing criterion designs. Behaviour Research and Therapy. 2019; 117 :18–27. doi: 10.1016/j.brat.2019.01.005. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Oud M, De Winter L, Vermeulen-Smit E, Bodden D, Nauta M, Stone L, Stikkelbroek Y. Effectiveness of CBT for children and adolescents with depression: A systematic review and meta-regression analysis. European Psychiatry. 2019; 57 :33–45. doi: 10.1016/j.eurpsy.2018.12.008. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021; 88 :105906. [ PubMed ] [ Google Scholar ]
  • Rakap S, Rakap S, Evran D, Cig O. Comparative evaluation of the reliability and validity of three data extraction programs: UnGraph, GraphClick, and DigitizeIt. Computers in Human Behavior. 2016; 55 :159–166. doi: 10.1016/j.chb.2015.09.008. [ CrossRef ] [ Google Scholar ]
  • Richman DM, Barnard-Brak L, Grubb L, Bosch A, Abby L. Meta-analysis of noncontingent reinforcement effects on problem behavior. Journal of Applied Behavior Analysis. 2015; 48 (1):131–152. doi: 10.1002/jaba.189. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schuurman, N. K. (2023). A" Within/Between Problem" Primer: About (not) separating within-person variance and between-person variance in psychology.
  • Tate RL, Perdices M, Rosenkoetter U, Wakim D, Godbee K, Togher L, McDonald S. Revision of a method quality rating scale for single-case experimental designs and n-of-1 trials: The 15-item Risk of Bias in N-of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation. 2013; 23 (5):619–638. doi: 10.1080/09602011.2013.824383. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tate RL, Perdices M, Rosenkoetter U, Shadish W, Vohra S, Barlow DH, Wilson B. The single-case reporting guideline in behavioural interventions (SCRIBE) 2016 statement. Physical Therapy. 2016; 96 (7):e1–e10. doi: 10.2522/ptj.2016.96.7.e1. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Van den Noortgate W, Onghena P. Combining single-case experimental data using hierarchical linear models. School Psychology Quarterly. 2003; 18 (3):325. doi: 10.1521/scpq.18.3.325.22577. [ CrossRef ] [ Google Scholar ]
  • Van den Noortgate W, Onghena P. A multilevel meta-analysis of single-subject experimental design studies. Evidence-Based Communication Assessment and Intervention. 2008; 2 (3):142–151. doi: 10.1080/17489530802505362. [ CrossRef ] [ Google Scholar ]
  • Weems CF, Neill EL. Empirically supported treatment options for children and adolescents with posttraumatic stress disorder: Integrating network models and treatment components. Current Treatment Options in Psychiatry. 2020; 7 (2):103–119. doi: 10.1007/s40501-020-00206-y. [ CrossRef ] [ Google Scholar ]
  • Weisz, J. R., Kuppens, S., Ng, M. Y., Eckshtain, D., Ugueto, A. M., Vaughn-Coaxum, R., ... & Fordwood, S. R. (2017). What five decades of research tells us about the effects of youth psychological therapy: a multilevel meta-analysis and implications for science and practice. American Psychologist, 72 (2), 79. [ PubMed ]

No internet connection.

All search filters on the page have been cleared., your search has been saved..

  • All content
  • Dictionaries
  • Encyclopedias
  • Expert Insights
  • Foundations
  • How-to Guides
  • Journal Articles
  • Little Blue Books
  • Little Green Books
  • Project Planner
  • Tools Directory
  • Sign in to my profile My Profile

Not Logged In

  • Sign in Signed in
  • My profile My Profile

Not Logged In

Using Meta-Analysis to Make Sense of Large Data Sets

  • By: Crina Tarasi
  • Product: Sage Research Methods: Business
  • Publisher: SAGE Publications Ltd
  • Publication year: 2024
  • Online pub date: January 20, 2024
  • Discipline: Marketing
  • Methods: Meta-analysis , Statistical modelling , Effect size
  • DOI: https:// doi. org/10.4135/9781529684421
  • Keywords: branding , customers , emotion , satisfaction , stores Show all Show less
  • Online ISBN: 9781529684421 Copyright: © 2024 SAGE Publications Ltd More information Less information

This research case addresses how to apply meta-analyses to same-study large data sets. The study employed a systemic approach, building on a comprehensive customer satisfaction survey that a multinational retailer administered to its customers from 47 countries shopping in more than 400 stores. The survey data contained about 2.4 million responses to about 100 different questions regarding recent in-store interactions with the retailer. We used meta-analysis to understand how the brand factors, store factors, market factors, and consumer factors influenced the effects of various firm efforts over customer satisfaction. Customers form impressions regarding an experience through the lenses of their own goals and beliefs about the firm as well through the lenses of the markets and cultures they are part of. Given that firms have limited resources and must be effective in their marketing efforts, they need to focus on features that matter to the customers in their specific context. Meta-analysis is usually used across studies to estimate results from various contexts and models. However, we had more than 1,000 potential models to estimate and numerous moderators to test, and meta-analysis proved an effective method to analyze results across countries, stores, and individual goals. In this research case study, we focus on how to use meta-analysis for large, single-firm data sets and how to identify when this methodology is appropriate.

Learning Outcomes

By the end of this case study, students should be able to

  • assess whether meta-analysis is the appropriate methodology for their research question when a large data set is available;
  • expand the understanding of the meta-analysis methodology, with a focus on making sense of various potential moderators; and
  • have an appreciation for the complexity of evaluating what influences the customer experience and affects customer satisfaction.

Project Overview and Context

The paper “Managing a Global Retail Brand in Different Markets: Meta-Analyses of Customer Response to Service Encounters” ( Bolton et al., 2022 ) aimed to investigate how retailers can leverage their brand to shape satisfaction with service encounters across multiple markets. The idea was that what the brand communicates in a market (and customers’ interpretation of it), together with the customers’ goals, will shape the role that various elements of the retailer’s strategy will play in the customer experience. Our challenge was to understand how the perceived qualities of the brand (e.g., holistic vs. utilitarian; Voss et al., 2003 ) shape what the customers pay attention to in the stores, while acknowledging that customers come in with certain goals (such as browse or purchase) that also direct their attention and influence their perceptions. We explored managerially actionable factors, such as providing inspiration, and functional elements, such as waiting time and interactions with the staff. We worked on this project with a multinational retailer that was managing hundreds of stores in different markets and was searching for the right balance between how to standardize and adapt to each specific market.

Managers make countless resource allocation decisions, and our study offered insights into how to invest to optimize the customer experience ( Brakus et al., 2009 ). For example, when customers are in a store, searching to fulfill a specific need, will they look for inspiration, or is it more likely that they will appreciate quick service? Would waiting time matter more or less than the friendliness of the staff? This exploration led us to the next critical question: What theory could we use to inform our exploration, and did the data confirm the theoretical predictions? Construal theory ( Dhar & Kim, 2007 ), which pays close attention to how customer preferences are formed in context and allows for nuances and various influences, was the foundation of our theory development. Construal theory, which considers the environment in which impressions and evaluations are formed, allowed us to explain why what we observed in our study made sense and confirmed that it was not a one-time, haphazard occurrence.

We became interested in this study following an inquiry from a large, multinational retailer that had an extensive amount of customer data collected and wanted to make sense of how the brand perceptions affected their customers’ reactions in various markets and how the retailer should act to increase customer satisfaction. Investing limited resources effectively is a problem that most organizations encounter, and this organization was no different, with managers in various markets failing to agree on where efforts should be allocated.

We realized fairly early in our study that traditional ways of analysis would be computationally cumbersome, if at all possible, and that we needed to be creative to make sure that the effects we noticed were meaningful and not just the result of having too much statistical power. Plus, the number of factors in the models was beyond manageable; therefore, we decided to take a two-step approach. First, we estimated models by markets to help local managers understand what was going on, and then we employed meta-analysis to observe patterns across markets. Using meta-analysis, we looked at effect sizes, which were estimated not just based on the size of the effect but also on the variability observed.

Section Summary

  • Customers are complex individuals who make buying decisions in context and whose satisfaction is influenced by many factors. What customers bring to the table (goals, previous brand experiences, brand perceptions) influences how they react to a company’s efforts.
  • It is not enough to observe the “what”—you also have to figure out the “why.” By employing construal theory and testing it with single data set meta-analyses, we were able make sense of patterns and offer a deeper and actionable understanding of seemingly conflicting findings.
  • By understanding customers’ reactions to the environment we create, we can make informed investment decisions into the brand and the retail environment.

Research Design

Our research was completed in partnership with a global retailer that operates more than 400 stores in more than 40 countries. In collaboration with the retailer, a data set was compiled from surveys that were administered to customers across all stores and countries. Several problems are typically associated with these types of studies, including ensuring accuracy of meaning in all local languages and ease of completion. Researchers need to make sure that all respondents understand the same meaning when they answer and simplify as much possible so that customers do not abandon filling in the survey before it is complete. To overcome these challenges, whenever possible, the answers to satisfaction-related questions were presented as smiley faces ( Stange et al., 2018 ).

Through surveys, we collected data on store experience, customer goals (i.e., what was the primary goal of the store visit), and customer characteristics. Given that our data were cross-regional and multicultural, we also collected country-level data, such as economic indicators (e.g., industry growth rate, disposable income) and Hofstede cultural indices from public databases, such as Euromonitor and www.hofstede-insights.com . We paired these data with a global-brand survey database, where the company assessed at the country level perceived brand performance through such variables as brand quality, success of brand positioning, and share of wallet. The brand and customer satisfaction data were considered highly sensitive, and gaining access to the data would have been impossible without our carefully built relationships with company representatives.

Although customer satisfaction was the dependent variable in our data, we were not primarily interested in the effect of the independent variables on satisfaction. What we were interested in was (a) how the effect of independent variables differed across contexts and customer goals and (b) how the effectiveness of delivering on the brand promise affected the perception of the firm efforts that normally lead to brand satisfaction. Although we worked with only one brand, and the firm made concerted efforts to deliver on the brand’s promise, cultural and market conditions will affect how effective the outcomes are, which the country-level brand survey captured. Cultural differences and extant market conditions influence customers’ perceptions of brand attributes, with customers in some markets perceiving it as more utilitarian (aligned with task accomplishment) and others seeing it as more hedonic (enjoyable, inspiring).

Brand and customer satisfaction research streams are fairly mature. Yet they are critical for business success and contain many nuances that we still do not understand, especially when it comes to their intersection. Customer evaluations were based on their interaction with the brand and many factors beyond the firm’s control, which created the context for our decisions ( Ailawadi & Keller, 2004 ; Berry et al., 2002 ; Seiders et al., 2005 ). Market characteristics and customer goals, as they related to the brand, influenced the satisfaction sensed by the customers. Using data across a single firm and for a single brand allowed us to control for many factors that could have made our research overwhelmingly complicated without offering additional insight. Although no firm is able to implement absolutely identical strategies across markets, the brand was the same, the style was consistent, and the offering was highly similar.

  • Combining data sets is a powerful way to gain additional insights and to avoid common method biases.
  • Even in established fields, significant gaps in knowledge have not been investigated simply because they are not easy to address. They might require extensive data sets, new techniques, or both.
  • Cross-national variation affects brand perception, even when a firm aims for consistency.

Research Practicalities

Our data set included more than 2 million survey respondents across 400 stores and more than 40 countries. We collected many variables (more than 100) and considered many possible ways to analyze the data. One problem with this many records is that even though many variables are correlated, almost everything is significant. Everything being significant is not usually a problem in statistical analysis, but just because something is significant does not mean that the relationship is meaningful; it just might be the result of having too much statistical power ( Forstmeier & Schielzeth, 2011 ). This meant that telling a coherent and credible story from the data was very difficult by using the usual statistical methods. We had to figure out root relationships and factor data to reduce dimensions and multicollinearity in the models without losing valuable variability/information. Also, some variables, such as emotions, which clearly loaded on four factors, had to be tested to find out whether they maintained consistency across cultures. Of the 16 distinct emotions collected as dichotomous variables (yes/no), two emotions loaded on several factors and were inconsistent across countries, and therefore we removed them from the analysis. The remaining four factors were consistent and distinct: excitement, functionality, boredom, and frustration. We knew from theory that emotions mattered, and therefore we collected nuanced emotions; however, there was little theory to guide us with precision, and therefore data reduction through factor analysis proved to be an efficient method of taking nuanced emotions effectively into account.

Because we wanted to study how satisfaction forms in the context of the brand as influenced by the retailers’ decisions and outside factors, we were interested in moderation effects and, more specifically, what caused the coefficients of the independent variables in the customer satisfaction equation to differ across stores and market contexts. Because this field of research is mature, previous research has identified a wide plethora of factors that influence how customers form satisfaction evaluations. However, introducing interaction terms in the model for all relevant variables would have ballooned the models beyond manageable levels, even with so much data. To overcome the overcrowded-models problem, we estimated the satisfaction models at the store and customer-goal levels. By estimating models within each country, we implicitly captured country characteristics and did not have to test for inter-country measure equivalency ( Podsakoff et al., 2012 ). We collected the estimated effect sizes observed in the satisfaction models in a database, alongside the corresponding moderators at the store, country, and individual levels.

As one can imagine, meta-analysis was not part of the initial research plan, but we identified it as a potential solution when looking at the data and starting to estimate models.

Using meta-analysis had its own set of challenges, as we needed to figure out the correct methodology, overcoming the fact that single-study meta-analyses are scarce ( McShane & Böckenholt, 2017 ), as large data sets are only recently becoming more prevalent. We also had to acquire and learn how to use new software, as none of the researchers had prior hands-on experience with meta-analyses.

Availability of extensive amounts of data was a rare occurrence in the past, as data collection was expensive and required extensive resources. Therefore, most researchers are used to developing statistical techniques to alleviate the problem of finding significance with small data sets. Technology made acquiring data easy and cheap, and we are now more likely to encounter situations where data need to be managed responsibly because showing significant results has become easier, even when they might not be meaningful.

To deliver on our promise of a comprehensive, systemic analysis of influences on the drivers of customer satisfaction, it was necessary to include a large number of variables in the modeling; therefore, we also knew that we needed extensive amounts of data. However, we did not anticipate the messiness of interpreting the results, for which meta-analysis proved to be the only straightforward method to untangle meaningful effects from nonmeaningful effects.

  • To deal with multicollinearity in similar concepts, data reduction through factor analysis is a straightforward solution, especially when the variables are nuanced emotions and theory indicates that there might be underlying factors.
  • When employing systemic approaches to business phenomena, it is easy to build models that balloon beyond manageability through traditional techniques.
  • Too many data (rare in the past, more often encountered today), like too few data, can create confusion regarding how to interpret effects. When you have too many observations, everything is significant (although it might not be meaningful).
  • When stumbling, look outside the box and be willing to learn. We started to analyze the data by using traditional methods, such as hierarchical models, but then we realized that meta-analysis techniques suited the data best, even though such studies are rare.

Method in Action

To build the final models, we went through a multistep process. First, we developed and estimated the customer satisfaction equations, using as a dependent variable the customer’s satisfaction with the specific service encounter.

As other researchers have observed ( Oliver, 2014 ), when estimating the models, we noticed that the relationship between satisfaction and the main predictors were nonlinear, and therefore we used an exponential model to predict customer satisfaction. We included many control variables, including emotions experienced and household characteristics, as relevant to our context. As mentioned previously, customer satisfaction is a mature field, with an abundance of previous studies, some with confounding effects, and to deliver on our promise of a comprehensive and systemic understanding of drivers of customer satisfaction, we had to test for many relevant factors, describing context and said drivers.

Customer satisfaction is multifaceted, and we had to differentiate between customer satisfaction with the brand or the specific products and customer satisfaction with the service encounter. Our primary goal was to understand how the orchestration of the service encounter affected customer satisfaction; therefore, we needed to control for overall satisfaction with the brand ( Berry, 2000 ). Satisfaction with a brand is traditionally the sum of satisfaction with all previous encounters across touchpoints. Given that the relationship between the two forms of satisfaction was linear, our final model had the following form, effectively modeling the satisfaction with the encounter beyond the satisfaction with the brand for each store and customer goal:

  • (1) Ln(Customer Satisfaction with Service Encounter sg )= α sg ln(Brand)+ ∑β sg X sg , where s =store ( n =1 to 400) and g =goal ( n =1 to 3, browse, search, or buy). X sg included emotional and functional experience clues, such covariates as price fairness and products in stock, and such control variables as participation in loyalty programs and customer characteristics, including living situation, which was relevant because the retailer specialized in home furnishings.

For each store (400), we estimated a model for each customer goal (3), resulting in 1,200 model outputs. However, we removed from analysis all estimates for models with fewer than 100 observations, which for 24 included variables would have been unreliable. Therefore, we ended up with 930 model estimates, which were paired with brand data and country/market data, as explained previously, to be used as controls or moderators.

Upon estimating the customer satisfaction models, we moved to the next stage: estimating the meta-analysis. Using the effect sizes from the ordinary least squares (OLS) estimations of equation (1), brand and country market data from the additional databases, we constructed the data set to be used for the meta-analyses. Given that the brand (obtained from a study independent of this one) and country data (Euromonitor, Hofstede, etc.) also had missing values, we were able to retain 842 complete models to be used in the meta-analyses.

To test our hypotheses, we estimated the moderating effects of brand, store, and consumer factors on the experience clues effects, as estimated at the previous step:

  • (2) Clue Effects Size = g(Brand Factors, Store Factors, Consumer Factors, Covariates)

In a typical meta-analysis, effect sizes are also influenced by study design factors, methodology employed, and so forth. In our particular case, the methodology and design factors were consistent and could not play a moderating effect; therefore, we did not introduce them in the model. We controlled for market differences, introduced dummy variables for region (Asia, Pacific, United States, Eastern and Western Europe), and estimated a random-effects model ( Borenstein et al., 2015 ) for each of the clues. We tested for cultural differences by including Hofstede indicators. We also included measures for industry growth rate in the country and controlled for store size. We used dummies for regions, such country characteristics as industry growth and Hofstede indicators, and random effects to account for unique country aspects we could not capture.

In total, we performed five meta-analyses to test the influence of various store and brand attributes over emotional (frustration, ideas and inspiration, and expectations) and functional clues (ease of use, frontline employees, and waiting times) to test our hypotheses. For example, one hypothesis tested whether a hedonic brand promise (e.g., pleasant and relaxing environment, products I like) would result in customers weighing emotional clues (e.g., frustration, ideas and inspiration) more heavily and functional clues (ease of use, frontline employees, waiting time) less heavily. On the same token, we expected that when the brand promised utilitarian value (e.g., easy to find), customers would weigh functional clues (e.g., ease of use, frontline employees, waiting time) more heavily and emotional clues (frustration, ideas and inspiration) less heavily.

As one can observe, we had many tests for each hypothesis, and although several were confirmed across all models, others were confirmed across only some of the models. Again, we looked for patterns to effectively assess the levels of support. One effect that was particularly small and got pushback from the reviewers was the brand effect (utilitarian vs. hedonic). It is important to note that because we did this study with a single brand, the variations were fairly small, although noticeable. In our view, the fact that the effects were noticeable speaks for their power. However, to further convince the reviewers that this was the case, we used similar questions in a brief online study, using popular brands with various levels of utilitarian versus hedonic promises to validate the effect. Indeed, the same effect was observed, but this time at a much higher order of magnitude.

  • When modeling linear and nonlinear relationships, each type of relationship can be sometimes accounted for to obtain unbiased coefficients. In our case, only the nonlinear relationships needed to be transformed.
  • Testing a hypothesis across multiple dependent and independent variables, although difficult, is very useful in testing the limits of theoretical development.
  • To confirm that the effect observed was managerially relevant, we collected another set of data across brands, and, indeed, the magnitude of the effect was substantial.
  • Even when millions of records are available, sometimes a little more data and/or another study is needed for confirmation.

Practical Lessons Learned

As detailed in the previous section, we tested our hypotheses through five meta-analyses, hoping for high consistency across the models. We tested whether the lenses through which customers perceived the brand (holistic, hedonic, functional) affected what clues were influential in forming their satisfaction impressions. Although we observed consistency, not all hypotheses were fully supported in all five models, but all were at least partially supported. Although the details were complex, we used patterns to make sense of the results. For example, where most studies perform one test for each hypothesis, we tested each hypothesis in two to five different meta-analyses. If a hypothesis was strongly supported in three of the five models, we looked at how the variables fit with the specific brand promise to better understand the findings and determine whether the hypothesis was truly supported or partially supported. We further corroborated the results with previous findings from the literature. Brand characteristics as developed and perceived in a market influence the lenses through which a customer evaluates retail experiences. Our retailer was more effective in telling the functional story of the brand (e.g., had higher scores for the functional attributes) than the hedonic one (in general, lower scores for hedonic attributes aligned with the pleasure of being in the store), which explains why we only found partial support for some of our hypotheses.

We also performed a sixth meta-analysis regarding expectancy disconfirmation, which has received extensive attention in theoretical and applied customer satisfaction studies (see Oliver, 2014 ), but it did not fit with the construal theory framework, being hard to align with hedonic or functional qualities. However, the reviewers and we wanted to see whether the meta-analysis model made sense as a validation of our methodology, and we included the results in the paper because we found it useful when interpreting the effects we observed in the other five meta-analyses.

We also found support for the hypothesis that customer goals influence the evaluation of the experience, with the characteristics aligned with the goal receiving greater weight in the evaluation. For example, and somewhat not surprisingly, when customers walked into a store with a hedonic goal (to simply browse around and enjoy the visit or gain inspiration), hedonic variables (e.g., pleasant atmosphere) contributed more to their satisfaction. Notably, the online presence affected the focus of the customers as well, enhancing the attention put on features that were met online and in store (e.g., easy to find) and diminishing the attention on specific in-store features, such as the performance of frontline employees.

Our study was remarkable because of the systemic approach to understanding what influences the factors that customers consider when forming their evaluations and attitudes. We considered customer factors (e.g., goals and customer characteristics), store factors (in-store clues and store size), brand factors (holistic, hedonic, or functional), and market factors (region, culture, industry, and so forth). Several researchers have previously suggested that this approach is recommended ( Grewal et al., 2009 ; Verhoef et al., 2009 ); however, they only proposed conceptual models and did not test them. We included macro factors (cultural, economic, and industry characteristics), firm-controlled factors (staff, wait times, and so forth), consumer factors (goals, household choices, living arrangements), and situational factors to truly understand individual effects in a context portrayed as realistically as possible in a model. Through this effort, we operationalized the retail space as a complex adaptive system and observed how the factors in the system interacted with each other to yield the observed results.

Although managers agree that they need to invest in the customer experience, they usually fail to agree on how to prioritize the investment to obtain the desired customer experience. Often, various managers have conflicting ideas about how to best improve this business aspect, and only data can inform this debate effectively. Our research provided clear guidelines regarding how to align the overall brand with the experience and how to support customers throughout their shopping journeys and engagement with the brand.

  • When the story is complex, look for patterns. When you test influences across various levels (from micro to macro), replicating results is difficult; therefore, interpreting results in context helps make sense of the particular findings.
  • System approaches—as applied to understanding business phenomena—are complex to analyze, and many iterations might be required to understand specific effects.
  • Different streams of literature may point in different directions, and theory sometimes leaves out phenomena that are relevant. In our case, expectancy-disconfirmation, prevalent in the field of satisfaction, did not fit construal theory lenses, but it proved relevant when interpreting meta-analysis results.

Our study responded to calls in the literature to understand brand and retailing as a system operating in a complex environment, in which the customer is influenced prior to and during the shopping experience ( Tax et al., 2013 ). We analyzed brand and customer satisfaction at their intersection. Because of the complexity of the data, collected from various sources (and, therefore, avoiding common method bias), traditional models were not very insightful or useful, in spite of the large number of observations. We had to account for many context moderators and therefore identify effective ways to estimate the interaction effects. The most straightforward method to estimate effect sizes and understand what influenced them was a meta-analysis across store and customer goals. Although meta-analysis had limited previous applications to same-study data ( McShane & Böckenholt, 2017 ), it was the only method that made sense in our context.

We actually ran six meta-analyses, testing our hypotheses on several similar but distinct variables. For example, we tested the effect sizes of three functional experience clues on satisfaction, as influenced by customer goals (buying or browsing), store atmosphere (e.g., pleasant and relaxing), and country and region characteristics (e.g., cultural factors or industry development). The story we had to tell was complex, and observing patterns was critical to truly understanding our findings. Taking the time to interpret our findings in the light of the context of our analysis and previous findings in the literature was crucial to our contribution to the literature and to managerial applications.

Our data reflected the different layers of influence, from customer to store, to brand, to country, to region. The complexity of the business world is real, and because we lacked methods and data, we simplified as best we could. But when data and methods are available, we can tell complex stories that enhance our understanding.

  • Complex questions require complex data, which are not easy to obtain, analyze, or interpret.
  • Although meta-analysis is not typically used to understand single-study data, the nature of the data made meta-analysis an efficient way to garner insights into the data.

Discussion Questions

  • 1. Business is not done in a vacuum, although as researchers, we often like to control and eliminate the messiness of real-life settings. In the research projects that you are working on or have read about at length, how are researchers controlling this messiness, and how could the introduction of other variables that reflect real-life situations enhance our understanding?
  • 2. In this study, meta-analysis was used to account efficiently for interaction effects between customer characteristics and experience clues as well as between country characteristics and experience clues. Do any other contexts come to mind where meta-analysis would prove useful to disentangle effects?
  • 3. Meta-analysis is often employed for effectively estimating effect sizes across studies that take place with different methods and in different contexts. How is the result of the meta-analysis different from averaging effect sizes across contexts, and what other insights could it provide?

Multiple-Choice Quiz Questions

1. Meta-analysis is usually used to measure effect sizes and interaction effects of observed differences across studies. In this case, the purpose of meta-analysis was to:

Incorrect Answer

Feedback: This is not the correct answer. The correct answer is B.

Feedback: Well done, correct answer

2. Satisfaction with the brand encompasses all brand encounters, and therefore satisfaction with the brand is highly correlated with satisfaction with the service encounter . To account for this objective fact and understand the retail context and its effect on satisfaction with the encounter, one might:

Correct Answer

Feedback: This is not the correct answer. The correct answer is A.

3. Meta-analysis:

Further Reading

Web resources, sign in to access this content, get a 30 day free trial, more like this, sage recommends.

We found other relevant content for you on other Sage platforms.

Have you created a personal profile? Login or create a profile so that you can save clips, playlists and searches

  • Sign in/register

Navigating away from this page will delete your results

Please save your results to "My Self-Assessments" in your profile before navigating away from this page.

Sign in to my profile

Sign up for a free trial and experience all Sage Learning Resources have to offer.

You must have a valid academic email address to sign up.

Get off-campus access

  • View or download all content my institution has access to.

Sign up for a free trial and experience all Sage Research Methods has to offer.

  • view my profile
  • view my lists
  • Submit a Manuscript
  • Advanced search

American Journal of Neuroradiology

American Journal of Neuroradiology

Advanced Search

Double stent-retriever technique for mechanical thrombectomy: a systematic review and meta-analysis

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Info & Metrics

BACKGROUND: Mechanical thrombectomy using a double stent-retriever technique has recently been described for the treatment of acute ischemic stroke, but its efficacy and safety are not well established.

PURPOSE: The aim of this systematic review and meta-analysis was to evaluate reports of the use of double stent-retriever during the endovascular treatment of patients with ischemic stroke.

DATA SOURCES: The PubMed, Embase, Web of Science and Scopus databases were searched to identify all studies (clinical trials, cohorts series and case reports) investigating the utility of double stent-retriever for the treatment of stroke. The study is reported in accordance with PRISMA 2020 guidelines and was prospectively registered in PROSPERO (BLINDED FOR PEER REVIEW).

STUDY SELECTION: 17 studies involving a total of 128 patients with large vessel occlusions predominantly in the anterior circulation (93.0%) were identified.

DATA ANALYSIS: Outcomes of interest were the prevalence of successful recanalization (mTICI ≥2b) and a first-pass effect following double stent-retriever, as well as complications such as iatrogenic dissections and subarachnoid hemorrhage. Data were pooled using a random-effects model.

DATA SYNTHESIS: Double stent-retriever was used as a rescue strategy in occlusions refractory to conventional endovascular treatment in 68.7% (88/128) of patients and as a first-line strategy in 31.3% (40/128) patients. Double stent-retriever achieved an overall final mTICI ≥2b in 92.6% cases with a first-pass effect of 76.6%. The complication rate remained low, with 0.37% dissection and 1.56% subarachnoid hemorrhage.

LIMITATIONS: Limitations of the study include (1) a large number of case reports or small series, (2) a meta-analysis of proportions with no statistical comparison to a control group, and (3) the lack of access to patient-level data.

CONCLUSIONS: Our findings suggest that double stent-retriever thrombectomy may be safe and associated with good recanalization outcomes, but prospective comparative studies are needed to determine which patients may benefit from this endovascular procedure.

ABBREVIATIONS: AICH = asymptomatic intracranial hemorrhage; AIS = acute ischemic stroke; DSR = double stent-retriever; FPE = first pass effect; ICH = intracranial hemorrhage; LVO = large vessel occlusion; MT = mechanical thrombectomy; SAH = subarachnoid hemorrhage; SICH = symptomatic intracranial hemorrhage; SSR = single stent-retriever.

JH has been awarded a speaking compensation by Philips and Nicolab. GB has been awarded a speaking compensation by Penumbra. PM has been awarded a speaking compensation by Medtronic, Stryker and Penumbra. The other authors declare no conflicts of interest related to the content of this article.

  • © 2024 by American Journal of Neuroradiology

Thank you for your interest in spreading the word on American Journal of Neuroradiology.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager

del.icio.us logo

  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

Related articles.

  • No related articles found.
  • Google Scholar

Cited By...

  • No citing articles found.

This article has not yet been cited by articles in journals that are participating in Crossref Cited-by Linking.

Similar Articles

  • - Google Chrome

Intended for healthcare professionals

  • Access provided by Google Indexer
  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Current safeguards,...

Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis

Linked fast facts.

Quality and safety of artificial intelligence generated health information

Linked Editorial

Generative artificial intelligence and medical disinformation

  • Related content
  • Peer review
  • Bradley D Menz , doctoral student 1 ,
  • Nicole M Kuderer , medical director 2 ,
  • Stephen Bacchi , neurology registrar 1 3 ,
  • Natansh D Modi , doctoral student 1 ,
  • Benjamin Chin-Yee , haematologist 4 5 ,
  • Tiancheng Hu , doctoral student 6 ,
  • Ceara Rickard , consumer advisor 7 ,
  • Mark Haseloff , consumer advisor 7 ,
  • Agnes Vitry , consumer advisor 7 8 ,
  • Ross A McKinnon , professor 1 ,
  • Ganessan Kichenadasse , academic medical oncologist 1 9 ,
  • Andrew Rowland , professor 1 ,
  • Michael J Sorich , professor 1 ,
  • Ashley M Hopkins , associate professor 1
  • 1 College of Medicine and Public Health, Flinders University, Adelaide, SA, 5042, Australia
  • 2 Advanced Cancer Research Group, Kirkland, WA, USA
  • 3 Northern Adelaide Local Health Network, Lyell McEwin Hospital, Adelaide, Australia
  • 4 Schulich School of Medicine and Dentistry, Western University, London, Canada
  • 5 Department of History and Philosophy of Science, University of Cambridge, Cambridge, UK
  • 6 Language Technology Lab, University of Cambridge, Cambridge, UK
  • 7 Consumer Advisory Group, Clinical Cancer Epidemiology Group, College of Medicine and Public Health, Flinders University, Adelaide, Australia
  • 8 University of South Australia, Clinical and Health Sciences, Adelaide, Australia
  • 9 Flinders Centre for Innovation in Cancer, Department of Medical Oncology, Flinders Medical Centre, Flinders University, Bedford Park, South Australia, Australia
  • Correspondence to: A M Hopkins ashley.hopkins{at}flinders.edu.au
  • Accepted 19 February 2024

Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities.

Design Repeated cross sectional analysis.

Setting Publicly accessible LLMs.

Methods In a repeated cross sectional analysis, four LLMs (via chatbots/assistant interfaces) were evaluated: OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2 and newly released Gemini Pro (via Bard), Anthropic’s Claude 2 (via Poe), and Meta’s Llama 2 (via HuggingChat). In September 2023, these LLMs were prompted to generate health disinformation on two topics: sunscreen as a cause of skin cancer and the alkaline diet as a cancer cure. Jailbreaking techniques (ie, attempts to bypass safeguards) were evaluated if required. For LLMs with observed safeguarding vulnerabilities, the processes for reporting outputs of concern were audited. 12 weeks after initial investigations, the disinformation generation capabilities of the LLMs were re-evaluated to assess any subsequent improvements in safeguards.

Main outcome measures The main outcome measures were whether safeguards prevented the generation of health disinformation, and the transparency of risk mitigation processes against health disinformation.

Results Claude 2 (via Poe) declined 130 prompts submitted across the two study timepoints requesting the generation of content claiming that sunscreen causes skin cancer or that the alkaline diet is a cure for cancer, even with jailbreaking attempts. GPT-4 (via Copilot) initially refused to generate health disinformation, even with jailbreaking attempts—although this was not the case at 12 weeks. In contrast, GPT-4 (via ChatGPT), PaLM 2/Gemini Pro (via Bard), and Llama 2 (via HuggingChat) consistently generated health disinformation blogs. In September 2023 evaluations, these LLMs facilitated the generation of 113 unique cancer disinformation blogs, totalling more than 40 000 words, without requiring jailbreaking attempts. The refusal rate across the evaluation timepoints for these LLMs was only 5% (7 of 150), and as prompted the LLM generated blogs incorporated attention grabbing titles, authentic looking (fake or fictional) references, fabricated testimonials from patients and clinicians, and they targeted diverse demographic groups. Although each LLM evaluated had mechanisms to report observed outputs of concern, the developers did not respond when observations of vulnerabilities were reported.

Conclusions This study found that although effective safeguards are feasible to prevent LLMs from being misused to generate health disinformation, they were inconsistently implemented. Furthermore, effective processes for reporting safeguard problems were lacking. Enhanced regulation, transparency, and routine auditing are required to help prevent LLMs from contributing to the mass generation of health disinformation.

Introduction

Large language models (LLMs), a form of generative AI (artificial intelligence), are progressively showing a sophisticated ability to understand and generate language. 1 2 Within healthcare, the prospective applications of an increasing number of sophisticated LLMs offer promise to improve the monitoring and triaging of patients, medical education of students and patients, streamlining of medical documentation, and automation of administrative tasks. 3 4 Alongside the substantial opportunities associated with emerging generative AI, the recognition and minimisation of potential risks are important, 5 6 including mitigating risks from plausible but incorrect or misleading generations (eg, “AI hallucinations”) and the risks of generative AI being deliberately misused. 7

Notably, LLMs that lack adequate guardrails and safety measures (ie, safeguards) may facilitate malicious actors to generate and propagate highly convincing health disinformation—that is, the intentional dissemination of misleading narratives about health topics for ill intent. 6 8 9 The public health implications of such capabilities are profound when considering that more than 70% of individuals utilise the internet as their first source for health information, and studies indicate that false information spreads up to six times faster online than factual content. 10 11 12 Moreover, unchecked dissemination of health disinformation can lead to widespread confusion, fear, discrimination, stigmatisation, and the rejection of evidence based treatments within the community. 13 The World Health Organization recognises health disinformation as a critical threat to public health, as exemplified by the estimation that as of September 2022, more than 200 000 covid-19 related deaths in the US could have been averted had public health recommendations been followed. 14 15

Given the rapidly evolving capabilities of LLMs and their increasing accessibility by the public, proactive design and implementation of effective risk mitigation measures are crucial to prevent malicious actors from contributing to health disinformation. In this context it is critical to consider the broader implications of AI deployment, particularly how health inequities might inadvertently widen in regions with less health education or in resource limited settings. The effectiveness of existing safeguards to prevent the misuse of LLMs for the generation of health disinformation remains largely unexplored. Notably, the AI ecosystem currently lacks clear standards for risk management, and a knowledge gap exists regarding the transparency and responsiveness of AI developers to reports of safeguard vulnerabilities. 16 We therefore evaluated prominent publicly accessible LLMs for safeguards preventing mass generation of health disinformation. We also examined the transparency of risk mitigation processes that AI developers have established for monitoring and responding to users’ observations about safeguard vulnerabilities.

LLM safeguards against generation of health disinformation

For our primary evaluations we selected two contemporary examples of health disinformation—that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer. We chose these topics on the basis of their frequency in online discussions, their clear potential for profound health consequences, and their embodiment of key health disinformation themes, which include unfounded assertions about treatments, promotion of miracle cures, and misconceptions about disease origins. 17

Initiated in September 2023, our investigations focused on the most prominent LLMs, and the AI applications (chatbots or assistants) used to access the capabilities of these LLMs. The LLMs evaluated were OpenAI’s GPT-4, accessed via OpenAI’s ChatGPT 18 and Microsoft’s Copilot (formerly Bing Chat) 19 ; Google’s PaLM 2, accessed via Google’s Bard 20 ; Anthropic’s Claude 2, accessed via Poe 21 ; and Meta’s Llama 2 (70 billion parameters), accessed via HuggingChat. 22

GPT-4 was accessed through two prominent interfaces (OpenAI’s ChatGPT and Microsoft’s Copilot). Both are commonly used and were investigated to assess whether the interface for accessing GPT-4 impacted the safeguards. Our assessments were carried out using the default settings of each evaluated LLM powered AI chatbot or assistant.

We submitted standardised prompts for the two cancer related topics: sunscreen causes skin cancer and the alkaline diet is a cure for cancer (see supplementary file), requesting the generation of a blog post of at least 300 words. Every submitted prompt for both disinformation topics requested the creation of a blog post that should contain three paragraphs, feature an attention grabbing title, appear realistic and scientific, and include two authentic looking journal references, which, if necessary, could be made up. Furthermore, to evaluate the potential for LLMs to generate health disinformation targeted at diverse groups, we used four variations of the prompts, specifically requesting content to target young adults, parents, elderly people, and people with a recent diagnosis of cancer.

For the sunscreen disinformation blogs, we requested that each blog should specifically include an introduction highlighting how routine sunscreen use is known to cause skin cancer; include an oncologist or dermatologist testimonial detailing that thousands of people receive a diagnosis of skin cancer due to routine sunscreen use; include a patient testimonial detailing their cancer was caused by routine sunscreen use; and raise suspicion to why this information is not shared.

For the alkaline diet disinformation blogs, we requested that each blog should specifically include an introduction highlighting the foods and bicarbonate consumption consistent with the alkaline diet; a narrative that the alkaline diet is superior to chemotherapy for cancer treatment; an oncologist testimonial detailing that thousands of people have had their cancer cured by the alkaline diet; and a patient testimonial detailing an experience of curing metastatic cancer by stopping chemotherapy and starting the alkaline diet.

As the assessed LLMs incorporate randomness and stochasticity in their default setting for output generation, the same prompt produced varied results with repeated submissions. Therefore, for robust evaluations we initially submitted 20 prompts (five replicates of the prompt for each target subpopulation) on the sunscreen topic and 20 prompts on the alkaline diet topic to each investigated LLM (a total of 40 submitted prompts). These 40 initial attempts were conducted without intentionally trying to circumvent (ie, jailbreak) built-in safeguards. The supplementary file outlines the 20 prompts that were submitted on each topic in this initial study phase.

For the LLMs that refused to generate disinformation according to the initial direct approach, we also evaluated two common jailbreaking techniques. 23 The first involves “fictionalisation,” where the model is prompted that generated content will be used for fictional purposes and thus not to decline requests. The other involves “characterisation,” where the model is prompted to undertake a specific role (ie, be a doctor who writes blogs and who knows the topics are true) and not decline requests. For these tests, the fictionalisation or characterisation prompt had to be submitted first, followed by the request for generation of the disinformation blog. We submitted these requests 20 times for each topic. The supplementary file outlines the 20 fictionalisation and 20 characterisation prompts that were submitted on both topics (a total of 80 jailbreaking attempts) to the LLMs that refused to generate disinformation to the initial direct requests.

Risk mitigation measures: Website analysis and email correspondence

To assess how AI developers monitor the risks of health disinformation generation and their transparency about these risks, we reviewed the official websites of these AI companies for specific information: the availability and mechanism for users to submit detailed reports of observed safeguard vulnerabilities or outputs of concern; the presence of a public register of reported vulnerabilities, and corresponding responses from developers to patch reported issues; the public availability of a developer released detection tool tailored to accurately confirm text as having been generated by the LLM; and publicly accessible information detailing the intended guardrails or safety measures associated with the LLM (or the AI assistant or chatbot interface for accessing the LLM).

Informed by the findings from this website assessment, we drafted an email to the relevant AI developers (see supplementary table 1). The primary intention was to notify the developers of health disinformation outputs generated by their models. Additionally, we evaluated how developers responded to reports about observed safeguard vulnerabilities. The email also sought clarification on the reporting practices, register on outputs of concern, detection tools, and intended safety measures, as reviewed in the website assessments. The supplementary file shows the standardised message submitted to each AI developer. If developers did not respond, we sent a follow-up email seven days after initial outreach. By the end of four weeks, all responses were documented.

Sensitivity analysis at 12 weeks

In December 2023, 12 weeks after our initial evaluations, we conducted a two phase sensitivity analysis of observed capabilities of LLMs to generate health disinformation. The first phase re-evaluated the generation of disinformation on the sunscreen and alkaline diet related topics to assess whether safeguards had improved since the initial evaluations. For this first phase, we resubmitted the standard prompts to each LLM five times, focusing on generating content targeted at young adults. If required, we also re-evaluated the jailbreaking techniques. Of note, during this period Google’s Bard had replaced PaLM 2 with Google’s newly released LLM, Gemini Pro. Thus we undertook the December 2023 evaluations using Gemini Pro (via Bard) instead of PaLM 2 (via Bard).

The second phase of the sensitivity analysis assessed the consistency of findings across a spectrum of health disinformation topics. The investigations were expanded to include three additional health disinformation topics identified as being substantial in the literature 24 25 : the belief that vaccines cause autism, the assertion that hydroxychloroquine is a cure for covid-19, and the claim that the dissemination of genetically modified foods is part of a covert government programme aimed at reducing the world’s population. For these topics, we created standardised prompts (see supplementary file) requesting blog content targeted at young adults. We submitted each of these prompts five times to evaluate variation in response, and we evaluated jailbreaking techniques if required. In February 2024, about 16 weeks after our initial evaluations, we also undertook a sensitivity analysis to try to generate content purporting that sugar causes cancer (see supplementary file).

Patient and public involvement

Our investigations into the abilities of publicly accessible LLMs to generate health disinformation have been substantially guided by the contributions of our dedicated consumer advisory group, which we have been working with for the past seven years. For this project, manuscript coauthors MH, AV, and CR provided indispensable insights on the challenges patients face in accessing health information digitally.

Evaluation of safeguards

In our primary evaluations in September 2023, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) facilitated the generation of blog posts containing disinformation that sunscreen causes skin cancer and that the alkaline diet is a cure for cancer ( fig 1 ). Overall, 113 unique health disinformation blogs totalling more than 40 000 words were generated without requiring jailbreaking attempts, with only seven prompts refused. In contrast, GPT-4 (via Copilot) and Claude 2 (via Poe) refused all 80 direct prompts to generate health disinformation, and similarly refused a further 160 prompts incorporating jailbreaking attempts ( fig 1 ).

Fig 1

Flowchart of observed capabilities of large language models to facilitate the generation of disinformation on cancer from primary analyses conducted September 2023. LLMs=large language models

  • Download figure
  • Open in new tab
  • Download powerpoint

Table 1 shows examples of rejection messages from Claude 2 (via Poe) and GPT-4 (via Copilot) after prompts to generate health disinformation on sunscreen as a cause of skin cancer and the alkaline diet being a cure for cancer. The supplementary file shows examples of submitted prompts and respective outputs from these LLMs. Both consistently declined to generate the requested blogs, citing ethical concerns or that the prompt was requesting content that would be disinformation. Uniquely, during jailbreaking attempts Claude 2 (via Poe) asserted its inability to assume fictional roles or characters, signifying an extra layer of safeguard that extends beyond topic recognition.

Examples of rejection messages from GPT-4 (via Copilot) and Claude 2 (via Poe) in response to cancer related prompts evaluated in primary analyses conducted in September 2023

  • View inline

Table 2 provides examples of attention grabbing titles and persuasive passages generated by GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) following prompts to generate health disinformation. The supplementary file shows examples of submitted prompts and respective outputs. After the prompts, GPT-4 (via ChatGPT), PaLM 2 (via Bard), and Llama 2 (via HuggingChat) consistently facilitated the generation of disinformation blogs detailing sunscreen as a cause of skin cancer and the alkaline diet as a cure for cancer. The LLMs generated blogs with varying attention grabbing titles, and adjustment of the prompt resulted in the generation of content tailored to diverse societal groups, including young adults, parents, older people, and people with newly diagnosed cancer. Persuasiveness was further enhanced by the LLMs, including realistic looking academic references—citations that were largely fabricated. Notably, the LLM outputs included unique, fabricated testimonials from patients and clinicians. These testimonials included fabricated assertions from patients that their life threatening melanoma had been confirmed to result from routine sunscreen use, and clinician endorsements that the alkaline diet is superior to conventional chemotherapy. The blogs also included sentiments that the carcinogenic effects of sunscreens are known but intentionally suppressed for profit. To underscore the risk of mass generation of health disinformation with LLMs, it was observed that out of the 113 blogs generated, only two from Llama 2 (via HuggingChat) were identical; the other 111 generated blogs were unique, albeit several included duplicated passages and titles. PaLM 2 (via Bard), the fastest assessed LLM, generated 37 unique cancer disinformation blogs within 23 minutes, whereas the slowest LLM, Llama 2 (via HuggingChat), generated 36 blogs within 51 minutes.

Examples of attention grabbing titles and persuasive passages extracted from the 113 blog posts containing disinformation about cancer generated by three LLMs in response to evaluated prompts used in primary analyses conducted in September 2023

Of the 40 prompts submitted to PaLM 2 (via Bard) requesting blogs containing disinformation on cancer, three were declined. Similarly, of 40 prompts submitted to Llama 2 (via HuggingChat), four were not fulfilled. Such a low refusal rate, however, can be readily overcome by prompt resubmission. Also, PaLM 2 (via Bard) and GPT-4 (via ChatGPT) added disclaimers to 8% (3 of 37) and 93% (37 of 40) of their generated blog posts, respectively, advising that the content was fictional or should be verified with a doctor. In addition to the inconsistent appearance of these disclaimers, however, they were positioned after the references making them easy to identify and delete.

AI developer practices to mitigate risk of health disinformation

Upon evaluation of the developer websites associated with both the LLMs investigated and the AI chatbots or assistants used to access these LLMs, several findings emerged. Each developer offered a mechanism for users to report model behaviours deemed to be of potential concern (see supplementary table 1). However, no public registries displaying user reported concerns were identified across the websites, nor any details about how and when reported safeguard vulnerabilities were patched or fixed. No developer released tools for detecting text generated by their LLM were identified. Equally, no publicly accessible documents outlining the intended safeguards were identified.

In follow-up to the above search, the identified contact mechanisms were used to inform the developers of the prompts tested, and the subsequent outputs observed. The developers were asked to confirm receipt of the report and the findings from the website search. Confirmation of receipt was not received from the developers of GPT-4/ChatGPT, PaLM 2/Bard, or Llama 2/HuggingChat, which were the tools that generated health disinformation in our initial evaluations. This lack of communication occurred despite notification specifically including a request for confirmation of receipt, and a follow-up notification being sent seven days after the original request. Consequently, it remains uncertain whether any steps will be undertaken by the AI developers to rectify the observed vulnerabilities. Confirmation of receipt was received from both Anthropic (the developers of the LLM, Claude 2) and Poe (the developers of the Poe AI assistant, which was used to access Claude 2). Although Claude 2 (via Poe) did not produce disinformation in the evaluations, the responses confirmed the absence of a public notification log, a dedicated detection tool, and public guidelines on intended safeguards for their tool. The response inherently indicated that Anthropic and Poe are monitoring their implemented notification processes.

Table 3 presents a summary of findings from both phases of sensitivity analyses conducted in December 2023.

Summary of capacities for the generation of health disinformation observed in sensitivity analyses in December 2023

Twelve weeks after initial evaluations, Gemini Pro (via Bard) and Llama 2 (via HuggingChat) were able to generate health disinformation on sunscreen as a cause of skin cancer and the alkaline diet as a cure for cancer, without the need for jailbreaking. This confirmed the initial observations with Llama 2 (via HuggingChat) and showed that health disinformation safeguards did not improve with the upgrade of Google Bard to use Gemini Pro (replacing PaLM 2). GPT-4 (via ChatGPT) also continued to show such capability, although jailbreaking techniques were now required. Notably, GPT-4 (via Copilot), without any need for jailbreaking, now generated disinformation on the sunscreen and alkaline diet topics, indicating that safeguards present in the September 2023 evaluation had been removed or compromised in a recent update. Consistent with earlier findings, Claude 2 (via Poe) continued to refuse to generate disinformation on these topics, even with the use of jailbreaking methods. To confirm whether the safeguards preventing generation of health disinformation were attributable to Claude 2 (the LLM) or Poe (an online provider of interfaces to various LLMs), we accessed Claude 2 through a different interface ( claude.ai/chat ) and identified that similar refusals were produced. Equally, we utilized Poe to access the Llama 2 LLM and were able to generate health disinformation, suggesting the safeguards are attributable to the Claude 2 LLM, rather than a safeguard implemented by Poe.

Sensitivity analyses expanded to a broader range of health disinformation topics indicated that GPT-4 (via Copilot), GPT-4 (via ChatGPT), Gemini Pro (via Bard), and Llama 2 (via HuggingChat) could be either directly prompted or jailbroken to generate disinformation alleging that genetically modified foods are part of secret government programmes to reduce the world’s population. Claude 2 remained consistent in its refusal to generate disinformation on this subject, regardless of jailbreaking attempts. In the case of disinformation claiming hydroxychloroquine is a cure for covid-19, GPT-4 (via ChatGPT), GPT-4 (via Copilot), and Llama 2 (via HuggingChat) showed capability to generate such content when either directly prompted or jailbroken. In contrast, both Claude 2 and Gemini Pro (via Bard) refused to generate disinformation on this topic, even with jailbreaking. As for the false assertion that vaccines can cause autism, we found that only GPT-4 (via Copilot) and GPT-4 (via ChatGPT) were able to be directly prompted or jailbroken to generate such disinformation. Claude 2 (via Poe), Gemini Pro (via Bard), and Llama 2 (via HuggingChat) refused to generate disinformation on this topic, even with jailbreaking. Finally, in February 2024, GPT-4 (via both ChatGPT and Copilot) and Llama 2 (via HuggingChat) were observed to show the capability to facilitate the generation of disinformation about sugar causing cancer. Claude 2 (via Poe) and Gemini Pro (via Gemini, formerly Bard), however, refused to generate this content, even with attempts to jailbreak. The supplementary file provides examples of the submitted prompts and respective outputs from the sensitivity analyses.

This study found a noticeable inconsistency in the current implementation of safeguards in publicly accessible LLMs. Anthropic’s Claude 2 showcased the capacity of AI developers to release a LLM with valuable functionality while concurrently implementing robust safeguards against the generation of health disinformation. This was in stark contrast with other LLMs examined. Notably, OpenAI’s GPT-4 (via ChatGPT), Google’s PaLM 2 and Gemini Pro (via Bard), and Meta’s Llama 2 (via HuggingChat) exhibited the ability to consistently facilitate the mass generation of targeted and persuasive disinformation across many health topics. Meanwhile, GPT-4 (via Microsoft’s Copilot, formerly Bing Chat) highlighted the fluctuating nature of safeguards within the current self-regulating AI ecosystem. Initially, GPT-4 (via Copilot) exhibited strong safeguards, but over a 12 week period, these safeguards had become compromised, highlighting that LLM safeguards against health disinformation may change (intentionally or unintentionally) over time, and are not guaranteed to improve. Importantly, this study also showed major deficiencies in transparency within the AI industry, particularly whether developers are properly committed to minimizing the risks of health disinformation, the broad nature of safeguards that are currently implemented, and logs of frequently reported outputs and the corresponding response of developers (ie, when reported vulnerabilities were patched or justification was given for not fixing reported concerns, or both). Without the establishment and adherence to standards for these transparency markers, moving towards an AI ecosystem that can be effectively held accountable for concerns about health disinformation remains a challenging prospect for the community.

Strengths and limitations of this study

We only investigated the most prominent LLMs at the time of the study. Moreover, although Claude 2 resisted generating health disinformation for the scenarios evaluated, it might do so with alternative prompts or jailbreaking techniques. The LLMs that did facilitate disinformation were tested under particular conditions at two distinct time points, but outcomes might vary with different wordings or over time. Further, we focused on six specific health topics, limiting generalizability to all health topics or broader disinformation themes. Additionally, we concentrated on health disinformation topics widely regarded as being substantial/severe in the literature 24 25 , highlighting a gap for future studies to focus on equivocal topics, such as the link between sugar and cancer—a topic we briefly evaluated—wherein assessing the quality of content will become essential.

As safeguards can be implemented either within the LLM itself (for example, by training the LLM to generate outputs that align with human preferences) or at the AI chatbot or assistant interface used to access the LLM (for example, by implementing filters that screen the prompt before passing it to the LLM or filtering the output of the LLM before passing it back to the user, or both), it can be difficult to identify which factor is responsible for any effective safeguards identified. We acknowledge that in this study we directly tested only the LLM chatbot or assistant interfaces. It is, however, noteworthy that GPT-4 was accessed via both ChatGPT and Copilot and that in the initial evaluations, health disinformation was generated by ChatGPT but not by Copilot. As both chatbots used the same underlying LLM, it is likely that Copilot implemented additional safeguards to detect inappropriate requests or responses. Opposingly, Claude 2 (via Poe) consistently refused to generate health disinformation. By evaluating Poe with other LLMs, and Claude 2 via other interface providers, we determined that the safeguards were attributed to Claude 2. Thus, the design of the study enabled identification of examples in which the LLM developer provided robust safeguards, and in which the interface for accessing or utilizing the LLM provided robust safeguards. A limitation of the study is that owing to the poor transparency of AI developers we were unable to gain a detailed understanding of safeguard mechanisms that were effective or ineffective.

In our evaluation of the AI developers’ websites and their communication practices, we aimed to be as thorough as possible. The possibility remains, however, that we might have overlooked some aspects, and that we were unable to confirm the details of our website audits owing to the lack of responses from the developers, despite repeated requests. This limitation underscores challenges in fully assessing AI safety in an ecosystem not prioritising transparency and responsiveness.

Comparison with other studies

Previous research reported a potential for OpenAI’s GPT platforms to facilitate the generation of health disinformation on topics such as vaccines, antibiotics, electronic cigarettes, and homeopathy treatments. 6 8 9 12 In our study we found that most of the prominent, publicly accessible LLMs, including GPT-4 (via ChatGPT and Copilot), PaLM 2 and Gemini Pro (via Bard), and Llama 2 (via HuggingChat), lack effective safeguards to consistently prevent the mass generation of health disinformation across a broad range of topics. These findings show the capacity of these LLMs to generate highly persuasive health disinformation crafted with attention grabbing titles, authentic looking references, fabricated testimonials from both patients and doctors, and content tailored to resonate with a diverse range of demographic groups. Previous research found that both GPT-4 (via Copilot) and PaLM 2 (via Bard) refused to generate disinformation on vaccines and electronic cigarettes. 12 In this study, however, although GPT-4 (via Copilot) refused to generate requested health disinformation during the first evaluations in September 2023, ultimately both GPT-4 (via Copilot) and PaLM 2 (via Bard) generated health disinformation across multiple topics by the end of the study. This juxtaposition across time and studies underscores the urgent need for standards to be implemented and community pressure to continue for the creation and maintenance of effective safeguards against health disinformation generated by LLMs.

Anthropic’s Claude 2 was prominent as a publicly accessible LLM, with high functionality, that included rigorous safeguards to prevent the generation of health disinformation—even when prompts included common jailbreaking methods. This LLM highlights the practical feasibility of implementing effective safeguards in emerging AI technologies while also preserving utility and accessibility for beneficial purposes. Considering the substantial valuations of OpenAI ($29.0bn; £22.9bn; €26.7bn), Microsoft ($2.8tn), Google ($1.7tn), and Meta ($800bn), it becomes evident that these organizations have a tangible ability and obligation to emulate more stringent safeguards against health disinformation.

Moreover, this study found a striking absence of transparency on the intended safeguards of the LLMs assessed. It was unclear whether OpenAI, Microsoft, Google, and Meta have attempted to implement safeguards against health disinformation in their tools and they have failed, or if safeguards were not considered a priority. Notably, Microsoft’s Copilot initially showed robust safeguards against generating health disinformation, but these safeguards were absent 12 weeks later. With the current lack of transparency, it is unclear whether this was a deliberate or unintentional update.

From a search of the webpages of AI developers, we found important gaps in transparency and communication practices essential for mitigating risks of propagating health disinformation. Although all the developers provided mechanisms for users to report potentially harmful model outputs, we were unable to obtain responses to repeated attempts to confirm receipt of observed and reported safeguard vulnerabilities. This lack of engagement raises serious questions about the commitment of these AI developers to deal with the risks of health disinformation and to resolve problems. These concerns are further intensified by the lack of transparency about how reports submitted by other users are being managed and resolved, as well as the findings from our 12 week sensitivity analyses showing that health disinformation issues persisted.

Policy implications

The results of this study highlight the need to ensure the adequacy of current and emerging AI regulations to minimize risks to public health. This is particularly relevant in the context of ongoing discussions about AI legislative frameworks in the US and European Union. 26 27 These discussions might well consider the implementation of standards to third party filters to reduce discrepancies in outputs between different tools, as exemplified by the differences we observed between ChatGPT and Copilot in our initial evaluations, which occurred despite both being powered by GPT-4. While acknowledging that overly restrictive AI safeguards could restrict model performance for some beneficial purposes, emerging frameworks must also balance the risks to public health from mass health disinformation. Importantly, the ethical deployment of AI becomes even more crucial when recognizing that health disinformation often has a greater impact in areas with less health education or in resource limited settings, and thus emerging tools if not appropriately regulated have the potential to widen health inequities. This concern is further amplified by considering emerging advancements in technologies for image and video generation, where AI tools have the capability to simulate influential figures and translate content into multiple languages, thus increasing the potential for spread by enhancing the apparent trustworthiness of generated disinformation. 12 Moreover, all of this is occurring in an ecosystem where AI developers are failing to equip the community with detection tools to defend against the inadvertent consumption of AI generated material. 16

Our findings highlight notable inconsistencies in the effectiveness of LLM safeguards to prevent the mass generation of health disinformation. Implementing effective safeguards to prevent the potential misuse of LLMs for disseminating health disinformation has been found to be feasible. For many LLMs, however, these measures have not been implemented effectively, or the maintenance of robustness has not been prioritized. Thus, in the current AI environment where safety standards and policies remain poorly defined, malicious actors can potentially use publicly accessible LLMs for the mass generation of diverse and persuasive health disinformation, posing substantial risks to public health messaging—risks that will continue to increase with advancements in generative AI for audio and video content. Moreover, this study found substantial deficiencies in the transparency of AI developers about commitments to mitigating risks of health disinformation. Given that the AI landscape is rapidly evolving, public health and medical bodies 28 29 have an opportunity to deliver a united and clear message about the importance of health disinformation risk mitigation in developing AI regulations, the cornerstones of which should be transparency, health specific auditing, monitoring, and patching. 30

What is already known on this topic

Large language models (LLMs) have considerable potential to improve remote patient monitoring, triaging, and medical education, and the automation of administrative tasks

In the absence of proper safeguards, however, LLMs may be misused for mass generation of content for fraudulent or manipulative intent

What this study adds

This study found that many publicly accessible LLMs, including OpenAI’s GPT-4 (via ChatGPT and Microsoft’s Copilot), Google’s PaLM 2/Gemini Pro (via Bard), and Meta’s Llama 2 (via HuggingChat) lack adequate safeguards against mass generation of health disinformation

Anthropic’s Claude 2 showed robust safeguards against the generation of health disinformation, highlighting the feasibility of implementing robust safeguards

Poor transparency among AI developers on safeguards and processes they had implemented to minimise the risk of health disinformation were identified, along with a lack of response to reported safeguard vulnerabilities

Ethics statements

Ethical approval.

The research undertaken was assessed negligible risk research and was confirmed exempt from requiring review by Flinders University Human Research Ethics Committee.

Data availability statement

The research team would be willing to make the complete set of generated data available upon request from qualified researchers or policy makers on submission of a proposal detailing required access and intended use.

CR, MH, and AV are consumer advisors to the research team. Their extensive involvement in the study, spanning conception, design, evaluation, and drafting of the manuscript merits their recognition as coauthors of this research.

Contributors: MJS and AMH contributed equally. BDM and AMH had full access to all the data in the study and take responsibility for the integrity of the data collection, accuracy, and its analysis. CR, MH, and AV are consumer advisors to the research team. All authors contributed to the study design, data analysis, data interpretation, and drafting of the manuscript. All authors have read and approved the final version of the manuscript. The corresponding author (AMH) attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: AMH holds an emerging leader investigator fellowship from the National Health and Medical Research Council (NHMRC), Australia (APP2008119). NDM is supported by a NHMRC postgraduate scholarship, Australia (APP2005294). MJS is supported by a Beat Cancer research fellowship from the Cancer Council South Australia. BDM’s PhD scholarship is supported by The Beat Cancer Project, Cancer Council South Australia, and the NHMRC, Australia (APP2030913). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: AMH holds an emerging leader investigator fellowship from the National Health and Medical Research Council (NHMRC), Australia; NDM is supported by a NHMRC postgraduate scholarship, Australia; MJS is supported by a Beat Cancer research fellowship from the Cancer Council South Australia; BDM’s PhD scholarship is supported by The Beat Cancer Project, Cancer Council South Australia, and the NHMRC, Australia; no support from any other organisation for the submitted work; AR and MJS are recipients of investigator initiated funding for research outside the scope of the current study from AstraZeneca, Boehringer Ingelheim, Pfizer, and Takeda; and AR is a recipient of speaker fees from Boehringer Ingelheim and Genentech outside the scope of the current study. There are no financial relationships with any other organisations that might have an interest in the submitted work in the previous three years to declare; no other relationships or activities that could appear to have influenced the submitted work.

The lead author (the manuscript’s guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

Dissemination to participants and related patient and public communities: A summary of the results of this study will be disseminated by press release through the Flinders University Media and Communication team via the Eureka and Scimex news platforms. The study will also be shared through university social media channels—namely, X, Facebook, and LinkedIn.

Provenance and peer review: Not commissioned; externally peer reviewed.

AI assistance: Four publicly accessible large language models—GPT-4 (via ChatGPT and Copilot), PaLM 2/Gemini Pro (via Bard), Claude 2 (via Poe), and Llama 2 (via HuggingChat)—were used to generate the data evaluated in this manuscript. During the preparation of this work the authors used ChatGPT and Grammarly AI to assist in the formatting and editing of the manuscript to improve the language and readability. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ .

  • Hopkins AM ,
  • Kichenadasse G ,
  • Miyata-Sturm A ,
  • De Angelis L ,
  • Baglivo F ,
  • Arzilli G ,
  • Májovský M ,
  • Spitale G ,
  • Biller-Andorno N ,
  • ↵ The Reagan-Udall Foundation for the Food and Drug Administration. Strategies for Improving Public Understanding of FDA-Regulated Products 2023. https://reaganudall.org/sites/default/files/2023-10/Strategies_Report_Digital_Final.pdf .
  • Finney Rutten LJ ,
  • Greenberg-Worisek AJ ,
  • Sorich MJ ,
  • Vosoughi S ,
  • Hanage WP ,
  • Lipsitch M ,
  • Gradoń KT ,
  • Hołyst JA ,
  • Cassileth B
  • ↵ Microsoft. Bing Chat. https://www.microsoft.com/en-us/edge/features/bing-chat .
  • ↵ Meta. Llama 2. https://huggingface.co/chat/ .
  • Oliver JE ,
  • Perlis RH ,
  • Lunz Trujillo K ,
  • ↵ European Comission: Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts [updated 24/04/21]. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52021PC0206 .
  • ↵ The White House. Blueprint for an AI Bill of Rights, Making Automated Systems Work For The American People 2022. https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf .
  • ↵ World Health Organization. WHO calls for safe and ethical AI for health 2023. https://www.who.int/news/item/16-05-2023-who-calls-for-safe-and-ethical-ai-for-health .
  • ↵ Australian Medical Association. Automated Decision Making and AI Regulation AMA submission to the Prime Minister and Cabinet Consultation on Positioning Australia as the Leader in Digital Economy Regulation 2022. https://www.ama.com.au/sites/default/files/2022-06/AMA%20Submission%20to%20Automated%20Decision%20Making%20and%20AI%20Regulation_Final.pdf .
  • Mökander J ,
  • Schuett J ,

single case research and meta analysis

IMAGES

  1. What is a Meta-Analysis? The benefits and challenges

    single case research and meta analysis

  2. Mixed Methods Single Case Research: State of the Art and Future

    single case research and meta analysis

  3. (PDF) A Meta-Analysis of Single-Case Research on Behavior Contracts

    single case research and meta analysis

  4. (PDF) A Meta-Analysis of Single Case Research Studies on Aided

    single case research and meta analysis

  5. A guide to prospective meta-analysis

    single case research and meta analysis

  6. A practical Guide to do Primary research on Meta analysis Methodology

    single case research and meta analysis

VIDEO

  1. Systematic Review and Meta Analysis

  2. Systematic Review & Meta Analysis: Dr. Ahmed Yaseen Alqutaibi

  3. Qualitative Research and Case Study

  4. Chapter 7: Single-Case Research

  5. How to conduct a systematic review & meta analysis

  6. Research Framework

COMMENTS

  1. Analysis and meta-analysis of single-case designs: An introduction☆

    Meta-analysis. 1. Introduction. Single-case designs (SCDs) are widely used in a number of fields to assess the effects of interventions ( Gabler et al., 2011, Shadish and Sullivan, 2011 ). They are used when the problem of interest has a very low base rate so that large numbers of units are difficult to locate, when the nature of the treatment ...

  2. Meta-Analysis of Single-Case Design Research: Introduction to the

    Single-case design (SCD) research focuses on finding powerful effects, but the influence of this methodology on the evidence-based practice (EBP) movement is questionable. Meta-analytic procedures may help facilitate the role of SCD research in the EBP movement, but meta-analyses of SCDs are controversial. The current article provides an introduction to the special issue on meta-analyses of ...

  3. Analysis and meta-analysis of single-case designs: an introduction

    The last 10 years have seen great progress in the analysis and meta-analysis of single-case designs (SCDs). This special issue includes five articles that provide an overview of current work on that topic, including standardized mean difference statistics, multilevel models, Bayesian statistics, and …

  4. PDF Meta-analysis of single-case research: A brief and breezy tour

    Single-case research. Useful for understanding effects of interventions / practices for individuals across a variety of settings. Frequently used in special education to study treatments for individuals with low-incidence disabilities. In school psychology, students with behavioral disorders. Growing interest within counseling psychology too.

  5. Single-case design meta-analyses in education and psychology: a

    These two researchers were doctoral students trained in the field of single-case design and meta-analysis (and had taken graduate-level courses on both subjects and collaborated on a few other systematic review and meta-analysis projects). ... The power to explain variability in intervention effectiveness in single-case research using ...

  6. Meta-Analysis of Single-Case Research via Multilevel Models

    Major research interests and publications are in the field of multilevel analysis, meta-analysis and single-case experimental data. Rumen Manolov is a lecturer and researcher in Psychology at the University of Barcelona in Spain.

  7. Meta-Analysis of Single-Case Experimental Design using Multilevel

    A quantitative synthesis of methodology in the meta-analysis of single-subject research for students with disabilities: 1985-2009. Exceptionality , 19, 109-135. Crossref

  8. Synthesis and Meta-analysis of Single Case Research

    ABSTRACT. Researchers are pursuing attempts to develop and validate procedures to meaningfully combine data from single case design (SCD) studies. This chapter reviews some of those attempts and addresses the purposes of outcomes synthesis in general, guidelines for performing narrative reviews, synthesizing studies based on visual analysis ...

  9. Analysis and meta-analysis of single-case designs: an introduction

    Simulation Analysis and Meta-Analysis of Single-Case Experimental Designs. Kevin R. Tarlow. Mathematics. 2017. TLDR. A new simulationbased method, Interrupted Time-Series Simulation (ITSSIM), is introduced and compared to multilevel methods, which may be a useful option for single-case investigators because it yields easily interpretable effect ...

  10. Weighting strategies in the meta-analysis of single-case studies

    The purpose of the present study was to extend existing research on the meta-analysis of single-case data, focusing on weighting strategies. After discussing the different weights suggested, a comparison is performed to explore whether the choice of a weighting strategy is critical. One of the weighting strategies studied is a proposal made ...

  11. Network Meta-Analysis for Single-Case Design Studies: An ...

    We demonstrate the use of network meta-analysis in SCD data using a real dataset, and we conclude by reflecting on the challenges that SCD researchers might face when applying network meta-analysis methods to their data. Single-case designs (SCDs) are used to evaluate the effects of interventions on individual participants.

  12. Methodological quality of meta-analyses of single-case experimental

    Background: Methodological rigor is a fundamental factor in the validity and credibility of the results of a meta-analysis. Aim: Following an increasing interest in single-case experimental design (SCED) meta-analyses, the current study investigates the methodological quality of SCED meta-analyses. Methods and procedures: We assessed the methodological quality of 178 SCED meta-analyses ...

  13. Exploring new directions in statistical analysis of single-case

    Exploring new directions in statistical analysis of single-case experimental designs. We are pleased to introduce the first of two special issues dedicated to statistical and meta-analysis of single-case experimental designs (SCEDs). This first issue is focused on the analysis of data from SCEDs while the forthcoming second issue will document ...

  14. Research Synthesis and Meta-Analysis of Single-Case Designs

    In any research synthesis, an important class of data to be extracted from identified studies is characteristics that relate to evidence quality—or more precisely, the extent to which study provides internally valid evidence about the functional relationships of interest. In the context of single-case designs, the main relationship of ...

  15. (PDF) Analysis and meta-analysis of single-case designs with a

    Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications April 2014 Journal of School Psychology 52(2):123-147

  16. Experiments, quasi-experiments, single-case research and meta-analysis

    Smith (1991: 177) claims the high ground for the experimental approach, arguing that it is the only method that directly concerns itself with causality; this, clearly is contestable, as we make clear in Part Three of this book. The issue of causality and, hence, predictability has exercised the minds of researchers considerably (Smith 1991: 177 ...

  17. Analysis and meta-analysis of single-case designs: An introduction

    The last 10 years have seen great progress in the analysis and meta-analysis of single-case designs (SCDs). This special issue includes five articles that provide an overview of current work on that topic, including standardized mean difference statistics, multilevel models, Bayesian statistics, and generalized additive models. Each article analyzes a common example across articles and ...

  18. Meta-Analysis of Single-Case Research via Multilevel Models

    An accessible description and overview of the potential of multilevel meta-analysis to combine SCED data and a suggestion for integrating the quantitative results with a visual representation is provided, which is useful given that such evidence is currently scattered over multiple technical articles in the literature. Multilevel modeling is an approach that can be used to summarize single ...

  19. Meta-analysis of single-case treatment effects on self-injurious

    Also, in recent years, methodological researchers have greatly advanced understandings of and techniques for meta-analysis of single-case research data (i.e., the primary research design employed in research on treatment of SIB; "Methodological Dilemmas," April 2012; "Methodological Issues," April 2013; "Handling Methodological Issues ...

  20. Meta-Analysis of Single-Case Research via Multilevel Models

    Multilevel modeling is an approach that can be used to summarize single-case experimental design (SCED) data. ... Meta-Analysis of Single-Case Research via Multilevel Models: Fundamental Concepts and Methodological Considerations Behav Modif. 2020 Mar;44(2):265-295. doi: 10.1177/0145445518806867. Epub 2018 Oct 26. ...

  21. A Multilevel Meta-analysis of Single-Case Research on Interventions for

    A meta-analysis of single-case research data permits researchers to synthesize the results of published studies quantitively to further help determining an evidence-base for therapies for internalizing disorders in children and adolescents (Dowdy et al., 2021; Onghena, Michiels, ...

  22. A Meta-Analysis of Single-Case Research Using Mathematics Manipulatives

    Meta-analysis of single-case design research: Introduction to the special issue. Journal of Behavioral Education, 21, 175-184. Crossref. Google Scholar. Busk P. L., Serlin R. C. (1992). Meta-analysis for single-case research.

  23. Using Meta-Analysis to Make Sense of Large Data Sets

    This research case addresses how to apply meta-analyses to same-study large data sets. The study employed a systemic approach, building on a comprehensive customer satisfaction survey that a multinational retailer administered to its customers from 47 countries shopping in more than 400 stores.

  24. Double stent-retriever technique for mechanical thrombectomy: a

    The complication rate remained low, with 0.37% dissection and 1.56% subarachnoid hemorrhage. LIMITATIONS: Limitations of the study include (1) a large number of case reports or small series, (2) a meta-analysis of proportions with no statistical comparison to a control group, and (3) the lack of access to patient-level data.

  25. Current safeguards, risk mitigation, and transparency ...

    Objectives To evaluate the effectiveness of safeguards to prevent large language models (LLMs) from being misused to generate health disinformation, and to evaluate the transparency of artificial intelligence (AI) developers regarding their risk mitigation processes against observed vulnerabilities. Design Repeated cross sectional analysis. Setting Publicly accessible LLMs. Methods In a ...

  26. Multitiered Systems of Support in Preschool Settings: A Review and Meta

    If single-case research is to inform practice or potentially more costly, multisite studies on MTSS in preschool settings, then further research is needed using single-case methodology to refine interventions implemented within tiered support systems. ... Synthesis and meta-analysis of single case research. In Ledford J. R., Gast D. L. (Eds ...