The Imperative to Increase Data Literacy

Cleveland Data Days

Thank you for joining me at Cleveland Data Days to learn about the importance of data literacy. Here are a few resources that will help you grow your and your team’s data literacy. You can download the full slides from the presentation here:

Interested in carrying out a data audit? A data audit will help you document the data you own or have access to. You should, at a minimum, be able to describe the database, where it comes from or who in the organization manages it, and the actual data fields contained in the data. Here is a simple Excel template you can download and fill out for each dataset in your organization:

How much do you know already about key data literacy topics? Challenge yourself with the following questions, or pull your team together for a great data socialization activity as you review the questions and discuss the answers!

Both of these are types of averages, or measures of the center of a population. But they are calculated differently and therefore are impacted differently by the distribution, or spread, of the values being averaged. A mean adds up all the values and divides by the number of values. So one extra big value or extra small value will have a big impact on the final mean. A median, on the other hand, lines up all the values in order of size, and then finds the middle value. It doesn’t matter at all how far apart the values are, or how big or small values toward either end are. 

Medians are good for things with a few extreme values, like household income. Means are good when every value should matter, like the average weight of goods in a container.

This seemingly simple question includes several key data literacy concepts. First is the technical distinction between “relative risk increase” and “absolute risk increase.” A two-fold increase can seem like a lot – but it’s being measured as a comparison to what it was before. Knowing what that absolute risk is now matters far more in determining if it’s truly a danger you care about – if your original risk of colon cancer was 1 in 4,000, doubling it sends it to 1 in 2,000. Is that worth skipping bacon?

The second major data literacy topic is present in that word, “worth”. There is a value judgment here. A big part of being data literate is understanding how to use important data in assessing things that don’t have a single ‘right’ answer. To know if avoiding bacon is worth eliminating a two-fold increase in colon cancer, you don’t just need to understand the technical parts of that risk assessment. You also need to know how much it matters to you not to get cancer, and how much you like bacon, and what other risks you want to take in your life. 

Ah, surveys! Any time your information is based on a SAMPLE of the full population you care about (as in, you haven’t talked to every single one of your donors here, but rather a sampling of them), you need to consider a number of factors that influence how much to trust the result.

First and foremost, you want to know how the sample was selected. After all, one goal of a survey is to use it to understand what EVERYONE would say without having to talk to everyone (because no one has time or money for that). But in order to be able to extrapolate from a survey, or apply the findings to the whole group, you need to know that everyone in the group is fairly represented. You can do this by deliberately picking people so that the important characteristics of the whole group are represented in the sample. Or you can do it by making sure that everyone has the same chance of being included – a random sample. If neither of these things are true, then your sample is biased, and you can’t really use it to know what the whole group thinks. 

Next, you want to know how MANY people were in the sample. Were only 10 donors contacted? Or 100? More people means better quality data (up to a point).

Lastly, you want to know, are these the people who’s opinion matters the most? Or are there other opinions that should also be included?

Most of us understand that we would need to first make sure that we do in fact get headaches when we drink coffee. In this case, we see that is true because we’re getting headaches after our breakfast, which includes coffee. A persnickety scientist would point out that you should probably have a breakfast of ONLY coffee to strengthen your evidence against coffee.

But there’s a second half to this issue that most people miss, because we usually want to do only things that CONFIRM what we suspect or believe. What if your new headaches are actually being caused by your new morning routine of eating first thing? In order to know for sure it’s the coffee, you need to look for evidence that can DISPROVE your theory. In other words, you would need to have breakfast without coffee and look for a headache. If one showed up even though you skipped your morning cuppa Joe, then you can say for sure that it’s not the coffee. 

Our avoidance of data that disproves our theory is hugely problematic. This tendency is called Confirmation Bias, and it’s why stereotypes persist and why newscasters get away with cherry-picking stats.