Category: Evaluations

How do you measure the value of an experience?

When I think about the professional development I did last week, I would summarize it thusly: an unexpected, profound experience.

I was given the opportunity to attend RIVA moderator training and I walked away with more than I ever could have dreamed I would get. Do you know that experience where you think back to your original expectations and you realize just how much you truly didn’t understand what you would get out of something? That was me, as I sat on a mostly-empty Southwest plane (156 seats and yet only 15 passengers) flying home. While you can expect a RIVA blog to follow, I was struck by the following thought:

What does it mean to understand the impact your company, product, or service has on your customers?

I feel like I was born and raised to think quantitatively. I approach what I do with as much logic as I can (sometimes this isn’t saying much…) When I think about measuring the impact a company, product, or service has on its customers, my mind immediately jumps to numbers – e.g. who (demographically) and how satisfied are they with it. But am I really measuring impact? I think yes and no. I’m measuring an impersonal impact; one that turns people into consumers and percentages. The other kind of impact largely missed in quantitative research is the impact on the person.

If I were to fill out a satisfaction or brand loyalty survey for RIVA, I would almost be unhappy that I couldn’t convey my thoughts and feelings about the experience. I don’t want them to know just that I was satisfied. I want them to understand how profound this experience was for me. When they talk to potential customers about this RIVA moderator class, I want them to be equipped with my personal story. If they listen and understand what I say to them, I believe they would be better equipped to sell their product.

This is one of the undeniable and extremely powerful strengths of qualitative research. Interviews, focus groups, anything that allows a researcher to sit down and talk to people is creating some of the most valuable data that can be created. We can all think of a time where a friend or family member had such a positive experience with some company, product, or service that they just couldn’t help but gush about it. Qualitative research ensures that valuable of that feedback is captured and preserved. If you want to truly understand who is buying your product or using your service, I cannot stress the importance of qualitative research enough.

Measurement Ideas in Evaluation

Kate Darwent and I are just back from the annual conference of the American Evaluation Association (AEA), which was held in Washington, DC this year.  We attended talks on a wide variety of topics, and attended business meetings for two interest groups (Needs Assessment and Independent Consulting).  Below, I discuss some of the current thinking around two very different aspects of measuring outcomes in evaluation.

Selecting Indicators from a Multitude of Possibilities

One session that I found particularly interesting focused on how to select indicators for an evaluation – specifically, what criteria should be used to decide which indicators to include in an evaluation. (This is a recurring topic of interest for me;  I mentioned the problem of too many indicators in a long ago blog, here.) In evaluation work, indicators are the measures of desired outcomes.  Defining an indicator involves operationalizing variables, or finding a way to identify a specific piece of data that will indicate whether an outcome has been achieved.  For example, if we want to measure whether a program increases empathy, we have to choose a specific survey question, or scale, or behavior that we will use to measure empathy at baseline and again after the intervention to see if scores go up over that period.  For any given outcome there are many possible indicators, and as a result, it is easy to get into a situation known as “indicator proliferation”.  At AEA, a team from Kimetrica gave a talk proposing a set of criteria for selecting indicators.  They proposed eight criteria that, if used, would result in indicators that would serve each of the five common evaluation paradigms. Their criteria feel intuitively reasonable to me; if you like these here’s the reference so you can give them full credit for their thinking (Watkins, B. & van den Heever, N. J. (2017, November). Identifying indicators for assessment and evaluation: A burdened system in need of a simplified approach. Paper presented at the meeting of the American Evaluation Association, Washington, DC.).  Their proposed criteria are:

  1. Comparability over time
  2. Comparability over space, culture, projects
  3. Standardized data collection/computation
  4. Intelligibility
  5. Sensitivity to context
  6. Replicability/objectivity
  7. Scalability and granularity
  8. Cost for one data point

Propensity Score Matching

In a very different vein, is the issue of how best to design an evaluation and analysis plan so that outcomes can be causally attributed to the program.  The gold standard for this is a randomized control trial, but in many situations that’s impractical, if not impossible to execute.  As a result, there is much thinking in the evaluation world about how to statistically compensate for a lack of random assignment of participants to treatment or control groups.

This year, there were a number of sessions on propensity score matching, which is a statistical technique used to select a control group that best matches a treatment group that was not randomly assigned.  For example, if we are evaluating a program that was offered to students who were already assigned to a particular health class, and we want to find other students in their grade who match them at baseline on important demographic and academic variables so that we can compare those matched students (i.e., “the controls”) to the students who got the program (i.e., “the treatment group”), propensity score matching can be used to find that set of best-matched students from the other kids in the grade who weren’t in the class with the program.

Propensity score matching is not a particularly new idea, but there are a variety of ways to execute it, and like all statistical techniques, requires some expertise to implement appropriately.  A number of sessions at the conference provided tutorials and best practices for using this analysis method.

In our work, one of the biggest challenges to using this method is simply the need to get data on demographic and outcome measures for non-participants, let alone getting all of the variables that are relevant to the probability of being a program participant.  But, assuming the necessary data can be obtained, it is still important to be aware that there are many options for how to develop and use propensity scores in an outcome analysis, there is some controversy about the effectiveness and appropriateness of various methods, and on top of it all, the process of finding a balanced model feels a lot like p-hacking, as does the potential for trying scores from multiple propensity score models in the prediction model.  So, although it’s a popular method, users need to understand the assumptions and limitations of the method, and do their due diligence to ensure they’re using it appropriately.


All-in-all, we had an interesting learning experience at AEA this year and brought back some new ideas to apply to our work. Attending professional conferences is a great way to stay on top of developments in the field and get energized about our work.

Keeping it constant: 3 things to keep in mind with your trackers

When conducting a program evaluation or customer tracker (e.g., brand, satisfaction, etc.), we are often collecting input at two different points in time and then measuring the difference. While the concept is straightforward, the challenge is keeping everything as consistent as possible so we can say that the actual change is NOT a result of how we conducted the survey.

Because we can be math nerds sometimes, take the following equation:

A change to any part of the equation to the left of the equal sign will result in changes to your results. Our goal then is to keep all the survey components consistent so any change can be attributed to the thing you want to measure.

These include:

  1. Asking the same questions
  2. Asking them the same way (i.e. research mode)
  3. And asking them to a comparable group

Let’s look at each of these in more detail.

Asking the same questions

This may sound obvious, but it’s too easy to have slight (or major) edits creep into your survey. The problem is, we then cannot say if the change we observed between survey periods is a result of actual change that occurred in the market, or if the change was a result of the changing question (i.e., people interpreted the question slightly differently).

Should you never add or change a question? Not necessarily. If the underlying goal of that question has changed, then it may need to be updated to get you the best information going forward. Sure, you may not be able to compare it looking back, but getting the best information today may outweigh the goal of measuring change on the previous question.

If you are going to change or add questions to the survey, try to keep them at the end of the survey so the experience of the first part of the survey is similar.

Asking them the same way

Just as changing the actual question can cause issues in your tracker, changing how you’re asking them can also make an impact. Moving from telephone to online, from in-person to self-administered, and so on can cause changes due to how respondents understand the question and other social factors. For instance, respondents may give more socially desirable answers when talking to a live interviewer than they will online. Reading a question yourself can lead to a different understanding of the question than when it is read to you.


Similarly, training your data collectors with consistent instructions and expectations makes a difference for research via live interviewers as well. Just because the mode is the same (e.g., intercept surveys, in-class student surveys, etc.) doesn’t mean it’s being implemented the same way.

Asking a comparable group

Again, this may seem obvious, but small changes in who you are asking can impact your results. For instance, if you’re researching your customers, and on one survey you only get feedback from customers who have contacted your help line, and on another survey you surveyed a random sample of all customers, the two groups, despite both being customers, are not in fact the same. The ones who have contacted your help line likely had different experiences – good or bad – that the broader customer base may not have.


So, that’s all great in theory, but we recognize that real-life sometimes gets in the way.

For example, one of the key issues we’ve seen is with changing survey modes (i.e., Asking them the same way) and who we are reaching (i.e., Asking a comparable group). Years ago, many of our public surveys were done via telephone. It was quick and reached the majority of the population at a reasonable budget. As cell phones became more dominant and landlines started to disappear, while we could have held the mode constant, the group we were reaching would change as a result. Our first adjustment was to include cell phones along with landlines. This increased costs significantly, but brought us back closer to reaching the same group as before while also benefiting from keeping the overall mode the same (i.e., interviews via telephone).

Today, depending on the exact audience we’re trying to reach, we’re commonly combining modes, meaning we may do phone (landline + cell), mail, and/or online all for one survey. This increases our coverage (http://www.coronainsights.com/2016/05/there-is-more-to-a-quality-survey-than-margin-of-error/), though it does introduce other challenges as we may have to ask questions a little differently between survey modes. But in the end, we feel it a worthy tradeoff to have a quality sample of respondents. When we have to change modes midway through a tracker, we work to diminish the possible downsides while drawing on the strengths to improve our sampling accuracy overall.

Defining Best Practices and Evidence-Based Programs

The field of evaluation, like any field, has a lot of jargon.  Jargon provides a short-hand for people in the field to talk about complex things without having to use a lot of words or background explanation, but for the same reason, it’s confusing to people outside the field. A couple of phrases that we get frequent questions about are “best practices” and “evidence-based programs”.

“Evidence-based programs” are those that have been found by a rigorous evaluation to result in statistically significant outcomes for the participants. Similarly, “best practices” are evidence-based programs or aspects of evidence-based programs that have been demonstrated through rigorous evaluation to result in the best outcomes for participants.  Sometimes, however, “best practices” is used as umbrella term to refer to a continuum of practices with varying degrees of support, where the label “best practices” anchors the high end of the continuum.  For example, the continuum may include the subcategory of “promising practices,” which typically refer to program components that have some initial support, such as a weakly significant statistical finding, that suggest those practices may help to achieve meaningful outcomes.  Those practices may or may not hold up to further study, but they may be seen as good candidates for additional study.

Does following “best practices” mean your program is guaranteed to have an impact on your participants?  No, it does not.  Similarly, does using the curriculum and following the program manual for an evidence-based program ensure that your program will have an impact on your participants? Again, no.  Following best practices and using evidence-based programs may improve your chances of achieving measurable results for your participants, but if your participants differ demographically (i.e., are older or younger, higher or lower SES, etc.) from the participants in the original study, or if your implementation fidelity does not match the original study, the program/practices may not have the same impact as they did in the original study.  (Further, the original study may have been a type 1 error, but that’s a topic for another day.)  That is why granting agencies ask you to evaluate your program even when you are using an evidence-based program.

To know whether you are making the difference you think you’re making, you need to evaluate the impact of your efforts on your participants.  If you are using an evidence-based program with a different group of people than have been studied previously, you will be contributing to the knowledge base for everyone about whether that program may also work for participants like yours.  And if you want your program to be considered evidence-based, a rigorous evaluation must be conducted that meets established criteria by a certifying organization like the Blueprints program at the University of Colorado Boulder, Institute of Behavioral Science, Center for the Study and Prevention of Violence or the Substance Abuse and Mental Health Services Administration’s (SAMHSA) National Registry of Evidence-based Programs and Practices (NREPP).

So, it is a best practice to use evidence-based programs and practices that have been proven to work through rigorous, empirical study, but doing so doesn’t guarantee success on its own. Continued evaluation is still needed.

When experiences can lead you astray

Many organizations tell me that they hear from their participants all the time telling them how much the program changed their lives.  Understandably, those experiences matter a lot to organizations and they want to capture those experiences in their evaluations.

Recently I heard a podcast that perfectly captured the risks in relying too heavily on those kinds of reports.  There are two related issues here.  The first is that while your program may have changed the lives of a few participants, your evaluation is looking to determine whether you made a difference for the majority of participants.  The second is that you are most likely to hear from participants who feel very strongly about your program, and less likely to hear from those who were less affected by it.  An evaluation will ensure that you are hearing from a representative sample of participants (or all participants) and not just a small group that may be biased in a particular direction.

An evaluation plan can ensure you capture both qualitative and quantitative measures of your impact in a way that accurately reflects the experiences of your participants.

Engagement in evaluation

Engaging program participants in the evaluation is known as participatory evaluation.  (See Matt Bruce’s recent blog on participatory research for more detail about this approach.) The logic of participatory evaluation often resonates with human services providers.  It empowers service recipients to define their needs and goals for the program.

It can be eye opening for program staff to hear participants’ views of what is most important to them, and what they’re hoping to get out of the program.  For example, program aspects that are critical to participants may be only incidental to program staff.  This kind of input can lead to improved understanding of the program logic, as well as changes to desired outcomes.

In what ways could you bring participants into your evaluation process?


Writing an RFP

So you’ve finally reached a point where you feel like you need more information to move forward as an organization, and, even more importantly, you’ve been able to secure some amount of funding to do so. Suddenly you find yourself elbow deep in old request-for-proposals (RFPs), both from your organization and others, trying to craft an RFP for your project. Where do you start?

We write a lot of proposals in response to RFPs at Corona, and based on what we’ve seen, here are a few suggestions for what to include in your next RFP:

  • Decision to be made or problem being faced. One of the most important pieces of information that is often difficult to find, if not missing from an RFP, is what decision an organization is trying to make or what problem an organization is trying to overcome. Instead, we often see RFPs asking for a specific methodology, while not describing what an organization is planning to do with the information. While specifying the methodology can sometimes be important (e.g., you want to replicate an online survey of donors, you need to perform an evaluation as part of a grant, etc.), sometimes specifying it might limit what bidders suggest in their proposals.

Part of the reason why you hire a consultant is to have them suggest the best way to gather the information that your organization needs. With that in mind, it might be most useful to describe the decision or problem that your organization is facing in layman’s terms and let bidders propose different ways to address it.

  • Other sources of data/contacts. Do you have data that might be relevant to the proposals? Did your organization conduct similar research in the past that you want to replicate or build upon? Do you have contact information for people who you might want to gather information from for this project? All these might be useful pieces of information to include in an RFP.
  • Important deadlines. If you have key deadlines that will shape this project, be sure to include them in the RFP. Timelines can impact proposals in many ways. For example, if a bidder wants to propose a survey, a timeline can determine whether to do a mail survey, which takes longer, or a phone survey, which is often more expensive but quicker.
  • Include a budget, even a rough one. I think questions about the budget are the number one question I see people ask about an RFP. While a budget might scare off a more expensive firm, it is more likely that including a budget in an RFP helps firms propose tasks that are financially feasible.

Requesting proposals can be a useful way to get a sense of what a project might cost, which might be useful if you are trying to figure out how much funding to secure. If so, it’s often helpful to just state in your RFP that your considering different options and would like pricing for each recommended task, along with the arguments for why it might be useful.

  • Stakeholders. Who has a stake in the results of the project and who will be involved in decisions about the project?  Do you have a single internal person that the contractor will report to or perhaps a small team?  Are there others in the organization who will be using the results of the project?  Do you have external funders who have goals or reporting needs that you hope to be met by the project?  Clarifying who has a stake in the project and what role they will play in the project, whether providing input on goals, or approving questionnaire design, is very helpful. It is useful for the consultant to know who will need to be involved so they can plan to make sure everyone’s needs are addressed.

Writing RFPs can be daunting, but they can also be a good opportunity for thinking about and synthesizing an important decision or problem into words. Hopefully these suggestions can help with that process!

Beyond the logic model: Improve program outcomes by mapping causes of success and failure

Logic modeling is common in evaluation work, but did you know there are a variety of other tools that can help visualize important program elements and improve planning to ensure success?

One such tool is success mapping.  A success map can be used to outline the steps needed to implement a successful program.  It can also be used to outline the steps needed to accomplish a particular program improvement.  In a success map the steps are specific activities and events to accomplish, and arrows between steps indicate the sequence of activities, in a flow chart style. Compared to a logic model, a success map puts more emphasis on each step of implementation that must occur to ensure that the program is a success.  This can help the program team ensure that responsibilities, timelines, and other resources are assigned to all of the needed tasks.

A related tool, called fault tree analysis, takes an inverse approach to the success map.  Fault tree analysis starts with a description of an undesirable event (e.g., the program fails to achieve its intended outcome), and then reverse engineers the causal chains that could lead to that failure.  For example, a program may fail to achieve intended outcomes if any one of several components fails (e.g., failure to recruit participants, failure to implement the program as planned, failure of the program design, etc.).  Step-by-step, a fault tree analysis backs out the reasons that particular lines of failure could occur.  This analysis provides a systematic way for the program team to think about which failures are most likely and then to identify steps they can take to reduce the risk of those things occurring.

These are just two of many tools that can help program teams ensure success.  Do you have other favorite tools to use?

Fresh Evaluation Ideas for Spring

As an evaluator, it’s really easy to draft a thousand lines of questioning to capture every nuance in every conceivable outcome that might result from a particular program.  I want to know everything, and I want to understand everything deeply, and so do the organizations I work with.

DataYet collecting too much data burdens both program participants and the evaluation team, (and in some cases can change how participants respond to particular items – see details here). The hard part of evaluation work is distilling the goals down to their essence and choosing the highest impact measures.

This is especially important in situations where frequent measures are needed because services are adapting to meet changing needs.  A recent NPR story illustrates this, also showing how technology makes continuous monitoring possible.  The story points out that organizations providing aid in disasters often decide what to do at the outset, and then don’t have much information about how it went until they evaluate it formally after the program ends.  But in a few recent cases, including the international response to the Ebola epidemic, a group has implemented a short survey administered weekly by text messages to cell phones of residents in affected communities.  The survey measures only around five key objectives of the Ebola response (e.g., impacts on travel, trust in communications, etc.), and because it is implemented weekly, it provides ongoing updates on progress toward the desired objectives.  The data helps steer the program activities to best meet the most pressing needs.

It is both useful and inspiring to learn about thoughtful, creative solutions that other evaluators have developed to help organizations reach their goals.  We’re always looking to learn and grow so that we can best serve organizations who are themselves continuously improving as they work to make the world a better place.


The Power of Ranking

One of the fun tools of our trade is index development.  It’s a way to rank order things on a single dimension that takes into account a number of relevant variables.  Let’s say you want to rank states with respect to their animal welfare conditions, or rank job candidates with regard to their experience and skills, or rank communities with respect to their cost of living.  In each of these cases, you would want to build an index (and indeed, we have, for several of those questions).

Index-based rankings are all the rage.  From the U.S. News & World Report ranking of Best Colleges to the Milliken Institute’s Best Cities for Successful Aging, one can find rankings on almost any topic of interest these days.  But these rankings aren’t all fun and games (as a recent article in The Economist points out), so let’s take a look at the stakeholders in a ranking and the impacts that rankings have.

  1. The Audience/User. Rankings are a perfect input for busy decision makers.  They help decision makers maximize their choices with very little effort.  As such, they influence behavior, driving decisions about where to apply to college, whom to hire, where to go on vacation, where to move in retirement, and so on.  But if the rankings are based on different variables than are important to the users, users can be misled.
  2. The “Ranked”. For the ranked, impacts reflect the collective decisions of the users.  Rankings impact colleges’ applicant pools, cities’ tourism revenues, and local economies.  And on the flip side, rankings influence the behavior of those being ranked who will work to improve their standing on the variables included in the index.  As the old adage goes, “what gets measured gets done.”
  3. The “Ranker”. The developer of the index holds a certain amount of power and responsibility.  There are both mathematical and conceptual competencies required (in other words, it’s a bit of a science and an art).  The developer has to decide which variables to include and how to weight them, and those decisions are often based on practical concerns as much or more than on relevance to the goal of the measurement.  (There is usually a strong need to use existing data sources and data that is available for all of the entities being ranked.)  Selecting certain variables and not others to include in the index can have downstream impacts on where ranked entities focus their efforts for improvement, even when those included variables were chosen for expediency rather than impact.

To illustrate, I built an index to rank “The Best Coffee Shops in My Neighborhood.”  I identified the five coffee shops I visit the most frequently in my neighborhood and compiled a data set of six variables: distance from my home, presence of “latte art,” amount of seating, comfort of seating, music selection, and lighting.

Coffee_Latte Art






My initial data set is below.  First, take note of the weight assigned to each variable.  Music selection and seating comfort are less important to my ranking than distance from home, latte art, amount of seating, and lighting.  Those weights reflect what is most important to me, but might not be consistent with the preferences of everyone else in my neighborhood.

Index Table

Next, look at the data.  Distance from home is recorded in miles (note that smaller distances are considered “better” to me, so this will require transformation prior to ranking).  Latte art is coded as present (1) or absent (0).  This is an example of a measure that is a proxy for something else.  What is important is the quality of the drink, and the barista’s ability to make latte art is likely correlated with their training overall – since I don’t have access to information about years of experience or completion of training programs, this will stand in instead as a convenience measure.  Amount of seating is pretty straightforward.  Shop #5 is a drive-through.   Seating comfort is coded as hard chairs (1) and padded seats (2).  Music selection is coded as acceptable (1) and no music (0).  Lighting is coded as north-facing windows (1), south-facing windows (2), and east- or west-facing windows (3), again, because that is my preference.

After I transform, scale, aggregate, and rank the results, here is what I get.

Index Table 2





These results correspond approximately with how often I visit each shop, suggesting that these variables have captured something real about my preferences.

Now, let’s say I post these rankings to my neighborhood’s social media site and my neighbors increase their visits to Shop #2 (which ranked 1).  My neighbors with back problems who prefer hard seat chairs may be disappointed with their choices based on my ranking.  The shop owners might get wind of this ranking and will want to know how to improve their standing.  Shops #3 and #5 might decide to teach their employees how to make latte art (without providing any additional training on espresso preparation), which would improve their rankings, but would be inconsistent with my goal for that measure, which is to capture drink quality.

With any ranking, it’s important to think about what isn’t being measured (in this example, I didn’t measure whether the shop uses a local roaster, whether they also serve food, what style of music they play, what variety of drinks they offer, etc.), and what is being measured that isn’t exactly what you care about, but is easy to measure (e.g., latte art).  These choices demonstrate the power of the ranker and have implications for the user and the ranked.

Perhaps next we’ll go ahead and create an index to rank Dave’s top ski resorts simultaneously on all of his important dimensions.

What do you want to rank?