*I have been hearing a lot lately about propensity scores. What are they, and how can I use them?* — Signed, Lost Cause

Dear Lost,

Propensity scores are useful when trying to draw causal conclusions from observational studies where the “treatment” (i.e. the “independent variable” or alleged cause) was not randomly assigned. For simplicity, let’s suppose the treatment variable has two levels: treated (T=1) and untreated (T=0). The propensity score for a subject is the probability that the subject was treated, P(T=1). In a randomized study, the propensity score is known; for example, if the treatment was assigned to each subject by the toss of a coin, then the propensity score for each subject is 0.5. In a typical observational study, the propensity score is not known, because the treatments were not assigned by the researcher. In that case, the propensity scores are often estimated by the fitted values (p-hats) from a logistic regression of T on the subjects’ baseline (pre-treatment) characteristics.

In an observational study, the treated and untreated groups are not directly comparable, because they may systematically differ at baseline. The propensity score plays an important role in balancing the study groups to make them comparable. Rosenbaum and Rubin (1983) showed that treated and untreated subjects with the same propensity scores have identical distributions for all baseline variables. This “balancing property” means that, if we control for the propensity score when we compare the groups, we have effectively turned the observational study into a randomized block experiment, where “blocks” are groups of subjects with the same propensities.

You may be wondering: why do we need to control for the propensity score, rather than controlling for the baseline variables directly? When we regress the outcome on T and other baseline characteristics, the coefficient for T is an average causal effect only under two very restrictive conditions. It assumes that the relations between the response and the baseline variables are linear, and that all of the slopes are the same whether T=0 or T=1. More elaborate analysis of covariance (ANCOVA) models can give better results (Schafer & Kang, under review), but they make other assumptions. Propensity scores provide alternative ways to estimate the average causal effect of T without strong assumptions about how the response is related to the baseline variables.

So, how do we use the propensity scores to estimate the average causal effect of T? Because the propensity score has the balancing property, we can divide the sample into subgroups (e.g., quintiles) based on the propensity scores. Then we can estimate the effect of T within each subgroup by an ordinary t-test, and pool the results across subgroups (Rosenbaum & Rubin, 1984). Alternatives to subclassification include matching and weighting. In matching, we find a subset of untreated individuals whose propensity scores are similar to those of the treated persons, or vice-versa (Rosenbaum, 2002). In weighting, we compare weighted averages of the response for treated and untreated persons, weighting the treated ones by 1/P(T=1) and the untreated ones by 1/P(T=0) (Lunceford & Davidian, 2004). However, very large weights can make estimates unstable.

Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. *Statistics in Medicine, 23,* 2937-2960.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. *Biometrika, 70*(1), 41-55.

Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. *Journal of the American Statistical Assoication, 79,* 516-524.

Rosenbaum, P. R. (2002). *Observational Studies*, 2nd Edition. New York: Springer Verlag.

Schafer, J. L., & Kang, J. D. Y. (under review). Average causal effects: A practical guide and simulated case study.