Garbage In, Garbage Out: Get Good Data for Better Impact

Want to know if what you're doing is working? You need good data for that.

Ahhh, data quality. It’s not everyone’s idea of a good time and possibly a topic boring enough to make people do something crazy, like choosing to finish their taxes instead of reviewing a data quality report. Anyone else relate?

Rest assured, though, that today we’re not talking about data definitions and coding guidelines that often find themselves stuffed in the annexes of papers and how-to PDFs.

Good quality data starts well before we need an analyst clicking away in R to twist, slice, and code our raw data into standardized, analyzable submission.

So before I completely lose you, consider this analogy from Naked Statistics:

Most statistics books assume that you are using good data, just as a cookbook assumes that you’re not using rancid meat and rotten vegetables. But even the finest recipe isn’t going to salvage a meal that begins with spoiled ingredients. So it is with statistics; no amount of fancy statistics can make up for fundamentally flawed data (p. 111).

That makes you think a little differently about that beautiful green gazpacho, right?

With that in mind—including images of questionable gazpacho—how do we get data that’s actually “good”? And how do we avoid forcing our analysts to make beautiful dishes (graphs and reports) that disguise rancid ingredients (meaningless data)?

“Good” data has of course been frameworked and workshopped over and over again, but I like these two guiding concepts to kick things off:

  • Validity—the degree to which a tool measures what it is intended to measure 
  • Reliability—the consistency and stability of a measurement across time, items, and/or raters.

You can do a deeper dive on both here, but I want to look at what this means in practical terms for our work. These things sound simple, but a lot of people still skip over them.

Before we go any farther, if you need a TL;DR: “good data” is a product of every step of your planning, design and implementation and it starts well before you set out to clean and analyze any actual data. (The foundations for good data are arguably more important than how it’s analyzed in the end.)
Practical Steps for Better Data
  1. Get an excellent understanding of what should be collected in the first place. It’s wild how many people mentally skip this step. This means having a clear theory of change or hypothesis with well thought out measures, remembering data points can change with iterative processes and feedback loops. Don’t fear a lit review to make sure you’re on the right track, because chances are peers have tried similar things. Learn from them. And don’t forget to engage stakeholders throughout these planning and review stages through things like focus groups and consensus-seeking activities.

  2. Adapt your methods to your setting. You want meaningful data? Think about accessibility across all planes. For example:

    Do paper-based instruments make more sense than internet-connected and/or electronic ones?

    Am I seeking information from a group of people that may have difficulty reading survey questions?

    What language(s) will I need to communicate with a representative sample of all stakeholders?

  3. Do your due diligence in ensuring trust exists between your team and the group you’re seeking information from or about. It doesn’t matter what you collect if people weren’t honest or simply didn’t engage because they didn’t feel comfortable or didn’t trust the process.

  4. Don’t skimp on training for the people who will be collecting data, especially if anything is interview-based. Give these members of your team a chance to give feedback on the process and tools, and pilot as much as possible. If they aren’t comfortable with the tools then you’re not going to get great data.

  5. Plan, review, edit, review, and edit some more. Being brutally critical and unafraid to hit the delete button when something seems off is a skill and a gift. The more critical you can be in the planning and process development phase the better all the above actions will turn out.

If you’re able to incorporate the activities above then other commonly referenced data quality dimensions will better fall into place.

Think:

  • timeliness of data
  • consistency with other sources
  • completeness
  • lack of duplicity in your data set
  • (and, of course, validity and reliability)

And remember: perfect data doesn’t exist, so don’t sacrifice the good for something unattainable. Consider what you need most to answer your evaluation, QI or research questions and go for that.

One issue might be that your data set is kind of incomplete, but in my opinion, incomplete > complete garbage. 5 variables with valid and reliable data are better than 40 variables that are junk.

Once these practices have been adopted and you’ve got a solid data set, THEN you can hand it over to your analysts to clean, slice, and graph for that social media worthy final product. 

How often are you thinking about data quality in this way?