Skip to contents

Say that we have means and SDs at pre-test and post-test for participants in each of two conditions, 0 and 1, and furthermore that these summary statistics are reported separately for two different sub-groups, AA and BB. Let ngjn_{gj} be the sample size for sub-group g=A,Bg = A, B in condition j=0,1j = 0, 1. Let ygjt\bar{y}_{gjt} be the sample mean of the outcome at time t=0,1t = 0,1 (where t=0t = 0 is the pre-test and t=1t = 1 is the post-test), and let sgjts_{gjt} be the sample standard deviation at time t=0,1t = 0,1.

To recover the summary statistics for the full sample (pooled across sub-groups), we can do the following:

  • The total sample size in condition jj is nj=nAj+nBj. n_{\bullet j} = n_{Aj} + n_{Bj}.
  • The average outcome in condition jj at time tt is yjt=nAjyAjt+nBjyBjtnj. y_{\bullet jt} = \frac{n_{Aj} \bar{y}_{Ajt} + n_{Bj} \bar{y}_{Bjt}}{n_{\bullet j}}.
  • The full-sample variance in condition jj at time tt is sjt2=1nj1[(nAj1)sAjt2+(nBj1)sBjt2+nAjnBjnj(yAjtyBjt)2] s_{\bullet jt}^2 = \frac{1}{n_{\bullet j} - 1} \left[(n_{Aj} - 1) s_{Ajt}^2 + (n_{Bj} - 1) s_{Bjt}^2 + \frac{n_{Aj} n_{Bj}}{n_{\bullet j}} (\bar{y}_{Ajt} - \bar{y}_{Bjt})^2 \right]

From these “rehydrated” summary statistics, one could calculate a standardized mean difference at post-test, adjusting for pre-test differences, by taking dp=(y11y01)(y10y00)s1 d_p = \frac{\left(\bar{y}_{\bullet 11} - \bar{y}_{\bullet 01}\right) - \left(\bar{y}_{\bullet 10} - \bar{y}_{\bullet 00}\right)}{s_{\bullet \bullet 1}} where s1=1n0+n1[(n01)s012+(n11)s112], s_{\bullet \bullet 1} = \sqrt{\frac{1}{n_{\bullet 0} + n_{\bullet 1}} \left[ (n_{\bullet 0} - 1) s_{\bullet 01}^2 + (n_{\bullet 1} - 1) s_{\bullet 11}^2\right]}, i.e., the pooled sample standard deviation at post-test. The sampling variance of dpd_p can be approximated as Var(dp)2(1ρ)(1n0+1n1)+d22(n0+n12), \text{Var}(d_p) \approx 2\left(1 - \rho\right)\left(\frac{1}{n_{\bullet 0}} + \frac{1}{n_{\bullet 1}}\right) + \frac{d^2}{2\left(n_{\bullet 0} + n_{\bullet 1} - 2\right)}, where ρ\rho is the correlation between the pre-test and the post-test within each condition and each sub-group.

Alternately, one could take a slightly different approach to calculating the numerator of the SMD, by instead calculating adjusted mean differences across sub-groups, and then taking their weighted average with weights corresponding to the total sample size of the sub-group. This amounts to using a mean difference that adjusts for sub-group differences. Denote the difference-in-differences within each subgroup as DDg=(yg11yg01)(yg10yg00). DD_{g} = \left(\bar{y}_{g11} - \bar{y}_{g01}\right) - \left(\bar{y}_{g10} - \bar{y}_{g00}\right). Then the average difference-in-differences is DD=1n[nADDA+nBDDB], DD_{\bullet} = \frac{1}{n_{\bullet \bullet}}\left[n_{A\bullet} DD_{A} + n_{B\bullet} DD_{B} \right], where ng=ng0+ng1n_{g\bullet} = n_{g 0} + n_{g 1} and n=nA+nB=n0+n1n_{\bullet \bullet} = n_{A\bullet} + n_{B\bullet} = n_{\bullet 0} + n_{\bullet 1}. This average difference-in-differences could then be used in the numerator of the SMD, as dsg=DDs1. d_{sg} = \frac{DD_\bullet}{s_{\bullet \bullet 1}}. The sampling variance of dsgd_{sg} can be approximated as Var(dsg)2(1ρ)n2[nA3nA0nA1+nB3nB0nB1]+d22(n2). \text{Var}(d_{sg}) \approx \frac{2\left(1 - \rho\right)}{n_{\bullet \bullet}^2}\left[\frac{n_{A\bullet}^3}{n_{A0} n_{A1}} + \frac{n_{B\bullet}^3}{n_{B0} n_{B1}}\right] + \frac{d^2}{2\left(n_{\bullet \bullet} - 2\right)}.

Multiple sub-groups

Now suppose that we have the same data as above, but reported separately for GG different sub-groups, indexed by g=1,...,Gg = 1,...,G. Let ngjn_{gj} be the sample size for sub-group g=1,...,Gg = 1,...,G in condition j=0,1j = 0, 1. Let ygjt\bar{y}_{gjt} be the sample mean of the outcome at time t=0,1t = 0,1 (where t=0t = 0 is the pre-test and t=1t = 1 is the post-test), and let sgjts_{gjt} be the sample standard deviation at time t=0,1t = 0,1.

To recover the summary statistics for the full sample (pooled across sub-groups), we can do the following:

  • The total sample size in condition jj is nj=g=1Gngj. n_{\bullet j} = \sum_{g = 1}^G n_{gj}.
  • The average outcome in condition jj at time tt is yjt=1njg=1Gngjygjt. y_{\bullet jt} = \frac{1}{n_{\bullet j}} \sum_{g = 1}^G n_{gj} \bar{y}_{gjt}.
  • The full-sample variance in condition jj at time tt is sjt2=1nj1g=1G[(ngj1)sgjt2+ngj(ygjtyjt)2] s_{\bullet jt}^2 = \frac{1}{n_{\bullet j} - 1} \sum_{g = 1}^G \left[\left(n_{gj} - 1 \right) s_{gjt}^2 + n_{gj}\left(\bar{y}_{gjt} - \bar{y}_{\bullet jt}\right)^2\right]

From these “rehydrated” summary statistics, one could calculate a standardized mean difference at post-test, adjusting for pre-test differences, as described above.

Alternately, one could calculate the numerator of the SMD as the adjusted mean difference, pooled across sub-groups. The average difference-in-differences is DD=1ng=1GngDDg, DD_{\bullet} = \frac{1}{n_{\bullet \bullet}} \sum_{g=1}^G n_{g \bullet} \ DD_{g}, where ng=ng0+ng1n_{g\bullet} = n_{g 0} + n_{g 1} and n=g=1Gngn_{\bullet \bullet} = \sum_{g = 1}^G n_{g\bullet}. This average difference-in-differences could then be used in the numerator of the SMD, as dsg=DDs1. d_{sg} = \frac{DD_\bullet}{s_{\bullet \bullet 1}}. The sampling variance of dsgd_{sg} can be approximated as Var(dsg)2(1ρ)[g=1Gng2n2(1ng0+1ng1)]+d22(n2). \text{Var}(d_{sg}) \approx 2\left(1 - \rho\right)\left[\sum_{g=1}^G \frac{n_{g\bullet}^2}{n_{\bullet \bullet}^2} \left(\frac{1}{n_{g0}} + \frac{1}{n_{g1}}\right)\right] + \frac{d^2}{2\left(n_{\bullet \bullet} - 2\right)}.

INSERT EMPIRICAL EXAMPLE