Kate Darwent and I are just back from the annual conference of the American Evaluation Association (AEA), which was held in Washington, DC this year.  We attended talks on a wide variety of topics, and attended business meetings for two interest groups (Needs Assessment and Independent Consulting).  Below, I discuss some of the current thinking around two very different aspects of measuring outcomes in evaluation.

Selecting Indicators from a Multitude of Possibilities

One session that I found particularly interesting focused on how to select indicators for an evaluation – specifically, what criteria should be used to decide which indicators to include in an evaluation. (This is a recurring topic of interest for me;  I mentioned the problem of too many indicators in a long ago blog, here.) In evaluation work, indicators are the measures of desired outcomes.  Defining an indicator involves operationalizing variables, or finding a way to identify a specific piece of data that will indicate whether an outcome has been achieved.  For example, if we want to measure whether a program increases empathy, we have to choose a specific survey question, or scale, or behavior that we will use to measure empathy at baseline and again after the intervention to see if scores go up over that period.  For any given outcome there are many possible indicators, and as a result, it is easy to get into a situation known as “indicator proliferation”.  At AEA, a team from Kimetrica gave a talk proposing a set of criteria for selecting indicators.  They proposed eight criteria that, if used, would result in indicators that would serve each of the five common evaluation paradigms. Their criteria feel intuitively reasonable to me; if you like these here’s the reference so you can give them full credit for their thinking (Watkins, B. & van den Heever, N. J. (2017, November). Identifying indicators for assessment and evaluation: A burdened system in need of a simplified approach. Paper presented at the meeting of the American Evaluation Association, Washington, DC.).  Their proposed criteria are:

  1. Comparability over time
  2. Comparability over space, culture, projects
  3. Standardized data collection/computation
  4. Intelligibility
  5. Sensitivity to context
  6. Replicability/objectivity
  7. Scalability and granularity
  8. Cost for one data point

Propensity Score Matching

In a very different vein, is the issue of how best to design an evaluation and analysis plan so that outcomes can be causally attributed to the program.  The gold standard for this is a randomized control trial, but in many situations that’s impractical, if not impossible to execute.  As a result, there is much thinking in the evaluation world about how to statistically compensate for a lack of random assignment of participants to treatment or control groups.

This year, there were a number of sessions on propensity score matching, which is a statistical technique used to select a control group that best matches a treatment group that was not randomly assigned.  For example, if we are evaluating a program that was offered to students who were already assigned to a particular health class, and we want to find other students in their grade who match them at baseline on important demographic and academic variables so that we can compare those matched students (i.e., “the controls”) to the students who got the program (i.e., “the treatment group”), propensity score matching can be used to find that set of best-matched students from the other kids in the grade who weren’t in the class with the program.

Propensity score matching is not a particularly new idea, but there are a variety of ways to execute it, and like all statistical techniques, requires some expertise to implement appropriately.  A number of sessions at the conference provided tutorials and best practices for using this analysis method.

In our work, one of the biggest challenges to using this method is simply the need to get data on demographic and outcome measures for non-participants, let alone getting all of the variables that are relevant to the probability of being a program participant.  But, assuming the necessary data can be obtained, it is still important to be aware that there are many options for how to develop and use propensity scores in an outcome analysis, there is some controversy about the effectiveness and appropriateness of various methods, and on top of it all, the process of finding a balanced model feels a lot like p-hacking, as does the potential for trying scores from multiple propensity score models in the prediction model.  So, although it’s a popular method, users need to understand the assumptions and limitations of the method, and do their due diligence to ensure they’re using it appropriately.


All-in-all, we had an interesting learning experience at AEA this year and brought back some new ideas to apply to our work. Attending professional conferences is a great way to stay on top of developments in the field and get energized about our work.