Exercise for depression: the evidence does not support this treatment

I’ve written about this before (in 2014, and in 2017), but due to popular demand, we cover it again. The topic is very important because depression is perhaps the most common mental illness or at least unpleasant situation, if you don’t want to medicalize things unnecessarily. Exercising is almost free, so if this works even moderately well, it is a great treatment option. Most of us have probably tried being sad after a breakup or loss of a loved one. Whatever the reason someone is depressed, we can generally expect them to get better over time. Why? Because of natural history of disease, or we might call it regression towards long term average (mainly genetic set point). Dalliard has a long post on regression towards the mean phenomena. The idea here is that everybody fluctuates over time in their level of sadness. Sometimes events trigger a large upwards spike (e.g. death of loved one, job loss). Some of these spikes cross the threshold to clinical depression, defined arbitrarily as some number of checked items on some list. The effects of this environmental impact fade over time, and the person returns back towards their long-term average. This then results in them no longer being clinically depressed at some point. The point of this introduction is to ensure the reader understands that one cannot conclude much of anything about treatments that lack a control group.

Generally speaking them, we have two main ways to approach this issue. First, we look for randomized controlled trials: exercise for depression. There are a couple of these, so we tend to pick the newer ones that cover more studies. As I mentioned before, we also begin with Cochrane library since this is generally the best source of meta-analyses. Second, we cover twin studies, where the ability to control for genetic and stable environmental confounding sheds light on the causality of depression and exercise.

Randomized controlled trial meta-analyses

I see a Cochrane analysis from 2008, and from 2013. I don’t see any newer (except for ‘dance movement therapy‘), so this one is about 8 years out of date. Quoting the latter:

Thirty‐nine trials (2326 participants) fulfilled our inclusion criteria, of which 37 provided data for meta‐analyses. There were multiple sources of bias in many of the trials; randomisation was adequately concealed in 14 studies, 15 used intention‐to‐treat analyses and 12 used blinded outcome assessors.

For the 35 trials (1356 participants) comparing exercise with no treatment or a control intervention, the pooled SMD for the primary outcome of depression at the end of treatment was ‐0.62 (95% confidence interval (CI) ‐0.81 to ‐0.42), indicating a moderate clinical effect. There was moderate heterogeneity (I² = 63%).

When we included only the six trials (464 participants) with adequate allocation concealment, intention‐to‐treat analysis and blinded outcome assessment, the pooled SMD for this outcome was not statistically significant (‐0.18, 95% CI ‐0.47 to 0.11). Pooled data from the eight trials (377 participants) providing long‐term follow‐up data on mood found a small effect in favour of exercise (SMD ‐0.33, 95% CI ‐0.63 to ‐0.03).

Authors’ conclusions

Exercise is moderately more effective than a control intervention for reducing symptoms of depression, but analysis of methodologically robust trials only shows a smaller effect in favour of exercise. When compared to psychological or pharmacological therapies, exercise appears to be no more effective, though this conclusion is based on a few small trials.

There’s a lot of trials, and the authors think it works. The authors do talk about publication bias, but they don’t make a big deal out of it. Their plot:

They write:

Visually our funnel plot appeared to be asymmetrical.There is evidence of bias (Begg P value = 0.02, Egger P value = 0.002) (funnel plot for Analysis 1.1, exercise versus control, Figure 4) that might be due to publication bias, to outcome reporting bias or to heterogeneity.

Looking at the plot, we see that the largest studies are all clustered around 0, and many smaller studies provide extreme effect sizes, some even larger than 2 standard deviations! These are not credible and trying to average across these is foolish. OK, so let’s look at a newer larger study. Let’s try Gordon et al 2018:

Objectives  To estimate the association of efficacy of RET with depressive symptoms and determine the extent to which logical, theoretical, and/or prior empirical variables are associated with depressive symptoms and whether the association of efficacy of RET with depressive symptoms accounts for variability in the overall effect size.

Data Sources  Articles published before August 2017, located using Google Scholar, MEDLINE, PsycINFO, PubMed, and Web of Science.

Study Selection  Randomized clinical trials included randomization to RET (n = 947) or a nonactive control condition (n = 930).

Data Extraction and Synthesis  Hedges d effect sizes were computed and random-effects models were used for all analyses. Meta-regression was conducted to quantify the potential moderating influence of participant and trial characteristics.

Main Outcomes and Measures  Randomized clinical trials used validated measures of depressive symptoms assessed at baseline and midintervention and/or postintervention. Four primary moderators were selected a priori to provide focused research hypotheses about variation in effect size: total volume of prescribed RET, whether participants were healthy or physically or mentally ill, whether or not allocation and/or assessment were blinded, and whether or not the RET intervention resulted in a significant improvement in strength.

Results  Fifty-four effects were derived from 33 randomized clinical trials involving 1877 participants. Resistance exercise training was associated with a significant reduction in depressive symptoms with a moderate-sized mean effect ∆ of 0.66 (95% CI, 0.48-0.83; z = 7.35; P < .001). Significant heterogeneity was indicated (total Q = 216.92, df = 53; P < .001; I2 = 76.0% [95% CI, 72.7%-79.0%]), and sampling error accounted for 32.9% of observed variance. The number needed to treat was 4. Total volume of prescribed RET, participant health status, and strength improvements were not significantly associated with the antidepressant effect of RET. However, smaller reductions in depressive symptoms were derived from randomized clinical trials with blinded allocation and/or assessment.

Conclusions and Relevance  Resistance exercise training significantly reduced depressive symptoms among adults regardless of health status, total prescribed volume of RET, or significant improvements in strength. Better-quality randomized clinical trials blinding both allocation and assessment and comparing RET with other empirically supported treatments for depressive symptoms are needed.

A limited scope, due to the focus on only a specific kind of exercise, but let’s look at their forest plot:

The plot is a bit confusing since the studies on the right side are actually supposed to be under the ones on the left. The plot was designed like this so as to not be so long. In this case we see they find a mean effect size of 0.66 d, similar to Cochrane in 2013 which found 0.62 d. But we also see the same pattern: the big studies — those with large boxes — are all near 0, and all but one of them overlap with 0 in the 95% confidence interval (the last is Goldfield; 95% CI 0.02 to 0.66 so p is close to .05). The smaller studies are largely on the right tail with implausibly large effect sizes, one even above 4 standard deviations! They provide the funnel plot in their supplement as well:

Clearly, this is NOT how we expect things to look like if this is solid evidence in favor of this treatment. It looks like it’s researchers fooling themselves in the same way they fool themselves about stereotype threat, social priming and all the other stuff.

Anything newer still? Find a multi-outcome meta-analysis in Dauwan et al 2021:

We performed a meta-analysis to synthesize evidence on the efficacy and safety of physical exercise as an add-on therapeutic intervention for quality of life (QoL), depressive symptoms and cognition across six chronic brain disorders: Alzheimer’s disease, Huntington’s disease, multiple sclerosis, Parkinson’s disease, schizophrenia and unipolar depression. 122 studies ( = k) (n = 7231) were included. Exercise was superior to treatment as usual in improving QoL (k = 64, n = 4334, ES = 0.40, p < 0.0001), depressive symptoms (k = 60, n = 2909, ES = 0.78, p < 0.0001), the cognitive domains attention and working memory (k = 21, n = 1313, ES = 0.24, p < 0.009), executive functioning (k = 14, n = 977, ES = 0.15, p = 0.013), memory (k = 12, n = 994, ES = 0.12, p = 0.038) and psychomotor speed (k = 16, n = 896, ES = 0.23, p = 0.003). Meta-regression showed a dose–response effect for exercise time (min/week) on depressive symptoms (β = 0.007, p = 0.012). 69% of the studies that reported on safety, found no complications. Exercise is an efficacious and safe add-on therapeutic intervention showing a medium-sized effect on QoL and a large effect on mood in patients with chronic brain disorders, with a positive dose–response correlation. Exercise also improved several cognitive domains with small but significant effects.

OK, looks nice as usual. Let’s dive in again. For simplicity, we use just the studies on depression symptoms (k = 60), where they find an effect size of 0.78 d. They don’t supply any funnel plots, but they conduct simple publication bias tests which are all p < .05. I mean, for memory and cognition generally, our prior for their effect size is 0 with a narrow band (one cannot boost intelligence even with great difficulty). Anyway, so they provide the summary statistics in the supplementary files (supplementary figure 3 to be exact, which contains no figures). Since we don’t believe in laziness on this blog, I extracted the tabular data using Tabula, and then loaded it into R. You can find the code here. And the data here. The forest plot, sorted by standard error looks like this:

Oh dear! Looks terrible. Largest studies are mostly around to 0. Let’s look at the funnel plot too:

It looks extremely bad too. We cannot take this meta-analysis as indicating much of interest. Furthermore, while we could try to adjust for publication bias, research shows this approach does not work very well. So let’s turn to the next kind of evidence.

Family studies

There are at least two twin control (monozygotic control) studies I know of:


We investigated the association between leisure time exercise participation and well-being (i.e., life satisfaction and happiness) and examined the causality underlying this association.


The association between exercise participation and well-being was assessed in around 8000 subjects, (age range 18–65 years) from The Netherlands Twin Registry (NTR). Causality was tested with the co-twin control method in 162 monozygotic (MZ) twin pairs, 174 dizygotic (DZ) twin and sibling pairs, and 2842 unrelated individuals.


Exercisers were more satisfied with their life and happier than non-exercisers at all ages. The odds ratio for life satisfaction given exercise participation was significantly higher than unity in unrelated pairs, and a trend was visible in DZ pairs. In MZ pairs, the odds ratio was close to unity. The pattern of odds ratios for happiness given exercise participation was similar.


Exercise participation is associated with higher levels of life satisfaction and happiness. This association is non-causal and appears to be mediated by genetic factors that influence both exercise behavior and well-being.

Sample size is not impressive for a MZ control. They don’t find anything, which is maybe a power issue.

Context In the population at large, regular exercise is associated with reduced anxious and depressive symptoms. Results of experimental studies in clinical populations suggest a causal effect of exercise on anxiety and depression, but it is unclear whether such a causal effect also drives the population association. We cannot exclude the major contribution of a third underlying factor influencing exercise behavior and symptoms of anxiety and depression.

Objective To test causal effects of exercise on anxious and depressive symptoms in a population-based sample.

Design Population-based longitudinal study (1991-2002) in a genetically informative sample of twin families.

Setting Causal effects of exercise were tested by bivariate genetic modeling of the association between exercise and symptoms of anxiety and depression, correlation of intrapair differences in these traits among genetically identical twins, and longitudinal modeling of changes in exercise behavior and anxious and depressive symptoms.

Participants A total of 5952 twins from the Netherlands Twin Register, 1357 additional siblings, and 1249 parents. All participants were aged 18 to 50 years.

Main Outcome Measurements Survey data about leisure-time exercise (metabolic equivalent task hours per week based on type, frequency, and duration of exercise) and 4 scales of anxious and depressive symptoms (depression, anxiety, somatic anxiety, and neuroticism, plus a composite score).

Results Cross-sectional and longitudinal associations were small and were best explained by common genetic factors with opposite effects on exercise behavior and symptoms of anxiety and depression. In genetically identical twin pairs, the twin who exercised more did not display fewer anxious and depressive symptoms than the co-twin who exercised less. Longitudinal analyses showed that increases in exercise participation did not predict decreases in anxious and depressive symptoms.

Conclusion Regular exercise is associated with reduced anxious and depressive symptoms in the population at large, but the association is not because of causal effects of exercise.

This one is for anxiety, but whatever. The sample size of twins is very large here. They don’t seem to report their confidence intervals, or standard error, so I cannot tell exactly how well powered this study was, but it looks fairly high power. The higher power, the more we can trust a near-0 finding (because the posterior from the study is narrowly around 0 only when power (i.e. precision) is very high). They find evidence of genetic correlations between exercise and symptoms of anxiety.

2008 is a long time ago too, so I looked for any more studies like the above. Looking thru 19 pages of citing papers on Google Scholar, I found one more study:


To study whether persistent leisure-time physical activity (PA) during adulthood predicts use of antidepressants later in life.


The Finnish Twin Cohort comprises same-sex twin pairs born before 1958, of whom 11 325 individuals answered PA questions in 1975, 1981 and 1990 at a mean age of 44 years (range 33–60). PA volume over 15-years was used as the predictor of subsequent use of antidepressants. Antidepressant use (measured as number of purchases) for 1995–2004 were collected from the Finnish Social Insurance Institution (KELA) prescription register. Conditional logistic regression was conducted to calculate odds ratios (OR) with 95% confidence intervals (CI) for the use of antidepressants in pairs discordant for PA (642, including 164 monozygotic (MZ) pairs).


Altogether 229 persons had used at least one prescribed antidepressant during the study period. Active co-twins had a lower risk (unadjusted OR 0.80, 95%CI 0.67–0.95) for using any amount of antidepressants than their inactive co-twins; trends being similar for DZ (0.80, 0.67–0.97) and MZ pairs (0.78, 0.51–1.17). The lowest odds ratio (0.51, 0.26–0.98) was seen among MZ pairs after adjusting for BMI, smoking and binge drinking. The point estimates were similar but non-significant for long-term antidepressant use (4+purchases equivalent to 12 months use).


Self-reported physical activity and low number of discordant MZ pairs.


Use of antidepressants was less common among physically active co-twins even when shared childhood experiences and genetic background were controlled for. Physical activity in midlife may therefore be important in preventing mild depression later in life.

The power is very low because of their use of prescriptions. The confidence intervals are very close to 1, so the p values are near .05. This is probably yet another p-hacked / publication biased study. Shrug.

All in all, the publication bias in the randomized controlled trials is rather extreme. These meta-analyses are not the least trustworthy. Turning to twin studies, the first is too small to be useful. The second is very large but about anxiety not depression. Based on the idea that if exercise treats depression, it probably also treats anxiety, this constitutes evidence that the effect size of this treatment is near 0. If you don’t want to grant that assumption, the evidence against exercise for depression is weaker. The third study is nice in theory but too imprecise and has dubious p values too.

All together, again, I find that exercise for depression is not a plausible treatment. It may be worth trying because it costs very little (mainly time).