Be the Change: Removing Data Biases to Achieve Better Health Care Outcomes, is one of our panels at the RampUp: Worldwide Virtual Summit. Scheduled for 12:00-12:45 pm PDT on September 30, 2020, this session explores data bias and why it’s especially harmful when used in health care research.
We connected with Ellen Houston, a speaker on our panel and Managing Director at Civis Analytics, and her colleague, Henry Hinnefeld, Lead Data Scientist, to dive deeper into the topic for marketers. Read the Q&A to learn about the different forms of data bias and how you can actively mitigate potential negative impacts to your campaigns.
What are different types of bias?
Before diving into the sources, it’s important to understand two distinct forms of bias:
- Non-representativeness: where data is skewed in a way that doesn’t match the real world. For example, maybe a customer’s data set only includes loyalty members who shop more frequently and spend more than the broader customer base.
- Unfairness: where data is skewed by inequities or biases in the real world, such as the gender bias of the English language.
What are the common sources of data bias?
There are several common sources of bias. Often, representative bias stems from how the data is collected and unfair bias stems from existing inequalities in our society.
A few more specific examples:
- Uneven Data: When data is collected or models are trained on skewed sources, the results of any model or analysis will carry that bias forward. In a recent example, a program created to de-pixel an image led to some very skewed results.
- Unintended Correlations: Geographic limits, for example, tend to have strong correlations with many socio-economic variables. This can lead to models based heavily (although perhaps not intentionally) on race or income.
- Biased Collection: How data is collected and how a population is sampled can also create bias. For example, say a company wanted to look at the impact of their in-store promotions. As a part of this research they completed in-store interviews, but, if they were only completed on weekdays, large groups of consumers working full-time would have been excluded from the sample.
How does data bias affect marketing campaigns?
Marketing activities are subject to the same data and societal biases as data science. It’s important to remember that even today’s advanced algorithms and targeting methods are not perfect. As an illustrative exercise, go to Google images and enter ‘teacher.’ Now, go to Google images and enter ‘professor.’ Do your results look the same?
Now, imagine you are focusing on a lookalike target for your upcoming campaign. In theory, this makes sense as an efficient tactic. You’ve found success with a group of people and now want more like them. But with the wrong data you may end up in an unvirtuous cycle where you continue to exclude people not in your original conversion set. What opportunities could you be missing by not expanding your approach? Who was never given a chance based on your data?
As a marketer, it’s important to not only understand the data powering your partners’ solutions, but to take the time to ask questions. Is the data you used for a seed audience representative? Do you know what information your partners are using to target? Are all of your conversions over-representative of a specific geography or demographic? With this information, you can better account for and prevent bias.
Is an unrepresentative data set automatically unusable?
All hope is not lost. There are steps you can take to mitigate bias within a dataset. You can adjust for bias by duplicating training data for underrepresented groups (upsampling) or drop data for overrepresented groups (downsampling). Additionally, you can apply weighting to make the underrepresented groups ‘count’ more in your analysis.
While these approaches may help improve various model metrics and overall data balance, they do not alleviate the underlying issues that generated the biased data in the first place. It’s important to remember that even a representative dataset may still be unfair.
How has the problem of data bias been addressed historically?
Often, data bias has been addressed using some of the sampling methods above. However, there are some recent examples that highlight the idea of fairness.
Unfortunately, there is no silver bullet measurement that is guaranteed to detect unfairness. Choosing an appropriate definition of model fairness is task-specific and requires human judgement. As a part of your review, make sure you look at the model’s accuracy and predictions, both overall and by key subgroup.
What are three things marketers can do to prevent or account for bias in data?
- Have a diverse group of people look at the results of your analysis. There’s no substitute for having actual people look at and evaluate your results and conclusions, but you also need different perspectives. If everyone checking your work has very similar backgrounds and life experiences, there will be blind spots.
- Remember to ask, where is your data coming from and who would be excluded? For example, is your consumer data sourced from credit bureaus, voter files, or survey data? Who would not be included?
- Assume there is bias in your data. To paraphrase a line from famed statistician George Box, ‘All datasets are biased, some are useful.’ It is usually impossible to know exactly what biases are present in a dataset, so the next best option is to think carefully about the process that generated the data.
Are you joining us at RampUp: Worldwide Virtual Summit? If you haven’t yet, register now—it’s free!