Recently, J.J. Kardwell, EverString Co-founder and CEO hosted a cohort of data thought-leaders in our August 2020 Chief Data Officer Roundtable: Generating Business Value Through Data Quality. The virtual event was facilitated by DoGood, an organization that has donated over $750,000 to charitable organizations worldwide, including the American Breast Cancer Foundation, UNICEF, and the YMCA.
J.J.’s recent roundtable discussion included colleagues:
- Rachel Levanger, Director of Data Science at Fidelity National Financial
- Joe DosSantos, Chief Data Officer of Qlik
- Juan Miguel Lavista, General Manager & Lab Director for Microsoft
- Rod Bates, VP of Decision Science and Data Strategy for Coca-Cola
The hour-long deep dive covered the following B2B data-centric topics:
- Main data challenges for organizations today
- Steps teams take to elevate data quality
- Ways to balance accessibility and risk
- How to ensure the highest quality business data
Below is a recap of the discussion, including a link to watch the recorded discussion on-demand.
First, let’s acknowledge the 5 charities we’re raising money for today
The mission of St. Jude Children’s Research Hospital is to advance cures, and means of prevention, for pediatric catastrophic diseases through research and treatment. Consistent with the vision of our founder Danny Thomas, no child is denied treatment based on race, religion, or a family’s ability to pay. Learn more about St. Jude
Global Idea School is an independent, non-profit, bilingual (English-Spanish) elementary school with a strong focus on socio-emotional growth and STE(A)M©(S). It provides child-centered and hands-on education through a cross-disciplinary curriculum and a learning-teaching approach based on HighScope© and Responsive Classroom©, and Círculo Mágico© (Magic Circle). Learn more about Global Idea School
Direct Relief is a humanitarian aid organization, active in all 50 states and more than 80 countries, with a mission to improve the health and lives of people affected by poverty or emergencies – without regard to politics, religion, or ability to pay. Learn more about Direct Relief
Today, 1 in 5 climbing areas in the United States are threatened—whether it’s private land lost to development, public land managers over-regulating climbing, or climber impacts degrading the environment, the list of threats is long and constantly evolving. Access Fund is on a mission to keep climbing areas open and conserve the climbing environment. Learn more about Access Fund
The American Breast Cancer Foundation (ABCF) mission is to provide education, access, and financial assistance to aid in the early detection, treatment, and survival of breast cancer for underserved and uninsured individuals, regardless of age or gender. This is achieved, in part, by the Breast Cancer Assistance Program (BCAP), the Community Partnership Program, and the newly designed Community Advocacy Program. The BCAP program and the Community Partnership Program are time-honored and tried programs linking patients with facilities and assistance in their own areas. Learn more about the American Breast Cancer Foundation (ABCF)
How are you finding work-life balance amidst the pandemic?
To kick things off, the group shared brief introductions including one thing they’ve accomplished or are focused on during the pandemic.
- For J.J., the pandemic has helped him step outside of his exercise routine and enjoy the great outdoors, running local trails, and bicycling with his family.
- Joe recalls being creative with dinner. Once restaurants were cut off, he and his family explored a lot of fun new internet recipes.
- Rod realized his step count really changed so he focused on being more aware of his distance walked each day.
- Rachel has explored her love of culinary arts, and gardening, getting back to the basics in a fun new way.
- Juan learned the joys of Washington State, exploring outdoors with his family instead of running from event to event on the weekends.
What are your biggest data challenges and the downstream effects?
Overall, some of the biggest challenges from a data standpoint centered around 3 themes:
- Balancing accessibility vs. risk
- Measuring data asset values and investments,
- And how to keep data up-to-date
BALANCING RISK VS. ACCESSIBILITY
In tech, security and accessibility will always be at odds. With data management, there exists the same trade-off between availability and risk. As Joe DosSantos mentioned, “to be world-class in data, you must be world-class in data availability.” Yet, many organizations struggle with how much, when, and how to achieve this elusive balance.
Tremendous business value comes from when you can combine siloed data within the organization. However, it’s difficult to innovate with the ground changing underneath you simultaneously. In some cases, teams are waiting for the tools to catch up as well. Other industries are considering if they should enable citizen data scientists to have access to data.
Within the healthcare space, there exists incredible potential value in encouraging more open data sharing to further advance research, learning, and medical insights. But organizations need the right incentives to share. In academia, you may work hard to develop a massive data set but might question if there’s any benefit to sharing it, since you not only miss any credits for the work but you potentially open yourself up to privacy concerns.
MEASURING DATA VALUE & INVESTMENT
Within an organization, teams will be at varying stages of sophistication in data collection, and only thinking about their own specific use cases when they capture and manage the data. The challenge becomes considering which groups will be impacted by data acquisition, how will the data be put to work, and evaluate the needs from there and how much to invest in which areas.
For mature industries, so much consideration goes into valuing each component or asset. However, with data, the value scale and the measurement of value is still a fairly new frontier.
Often, users will ask for a data set and then won’t need it again for a while, only adding to its staleness. With first, second, and third-party data, it’s important to establish values that help you consider how to invest in each of those data sets and what level of quality to expect.
KEEPING DATA CLEAN
As Rachel pointed out, “When a data set is copied, that’s where we risk data becoming stale.” Depending on how critical that data set is, her team decides what level of automation is needed to keep data fresh and up-to-date.
Originally, IT controlled everything and that was only data. The second wave was more of a free-for-all. The 3rd wave will be about governing raw data to provide the nourishment for these machine learning models.
When machine learning technologies converge without the proper oversight, real problems arise, especially at the foundational data layer. This is even more pronounced in regulated industries, such as healthcare and fair lending areas, where data is much more sensitive.
How do you approach data quality?
So much work goes into building data lakes, but what about the bad data swimming around it? Especially during COVID-19, data is changing constantly as businesses close and employees change jobs, so how do teams keep data clean? What are some ways to optimize data quality in an organization?
For Rod and the team, validated data is a daily challenge they face. Sometimes data sets are requested, but then never used again and so it becomes stale. To optimize data quality, Rod’s team leans on two solutions:
1. Medal Charts: Gold, Silver Bronze
- With each new data asset, we decide to what standard it needs to be managed.
- For all data in our data lake, this system helps team members know how much processing and clean up has been done for that data
- Bronze = raw load of data with no processing. (someone requested it, and we dump it in)
- Gold = dedicated IT resources, data checks, load checks, transformations well defined & documented in a wiki
- Helps understand what level of caution to exercise with each data set
2. Data Captains
- Single POC for the business aspect of how that data is used
- Responsible for each data asset & it’s wiki
- Voice of customer from business user side.
Joe DosSantos also shared that some companies are getting creative with crowd-sourcing. “If you fear people have moved,” Joe said, “Ask them. If you are clever, you can get people to volunteer data.”
Juan also shared the important concept of “data on data” (or “data about the data”). If you run stats on the frequency of access of data, 80-90% of the data in an organization is never accessed. The majority of signals get collected from a very tiny set of features, and the focus on those elements needs to be VERY high. Teams should have clarity on what data matters and then make sure we are treating that with the right level of attention. Too often people treat all data like it’s equally valuable.
How do you manage primary and secondary data sources to deliver on the attributes that matter most?
For Rachel and her team, they primarily rely on a large, third-party data set of public records that flow centrally into their system, offloading the screen scraping and keying of data inputs from all the many counties. Unfortunately, the data set comes prone to errors, so the challenge becomes how much to push on the 3rd party to provide clean data vs. Rachel’s team investing to add a layer of value on top. The solution is everyone has access, but Rachel also deployed a layer of cleansing internally that helps her team offer both data accessibility & data quality at high standards.
Within today’s modern data organization, there is a fundamental tension: How much data to buy in order to a) meet the general, broad sense of our users’ needs, or b) meet very specific needs of specific user groups. How much is realistic to push to someone else. Where is the business advantage, whether the cleansing is performed internally or externally? In analytics, people don’t always understand the difference between big data and fast data. With big data, you are trying to figure out if there is a trend in the big piles. This type of work doesn’t require low latency.
The U.S. Government has quietly been doing a great job making data available for years. Data.gov has become a go-to source for data and will become even more. Juan also commended the Centers for Disease Control and Prevention (CDC) for the open data sharing that has led to their important work on public data sets, such as information on every child born in the U.S. in the last 25 years, being used to study causes and prevention measures for S.I.D.S. (Sudden Infant Death Syndrome).
Whichever the data use case, teams must assume each data set has a bias, and you need to find that bias before you can rely on that data for strong predictive power. When you don’t know how the data was collected you don’t know about biases that were introduced in the data collection phase. Reminds us to focus on the bias that may be in 3rd party data.
For Rod’s team at Coca-Cola, they focus on staying informed about innovators and disruptors in the space, nurturing those relationships, and sometimes taking big bets in the space. For example, Nielsen and others have strong market positioning. We work to give them space to innovate and do something new. Making sure we are current on disruptors, including Slice, Numerator and others, outside of the legacy providers is an important success factor.
EverString is seeking to displace the legacy vendor approach, with a product built using core capabilities in natural language processing (NLP), machine learning (ML), crawling, and interlacing these features that some might perceive as more fragile technologies, to produce a better quality data asset.
The perception has changed over the past 5 years from people thinking that what they’ve bought for 20+ years is as good as it can get. But now, teams are realizing there is a better option and we’re seeing a more open-minded audience. There’s more possible than ever before with machine learning and deep cleaning of data.
What can (and should) be done to improve data quality at the foundation?
With a subset of data being accessed frequently, while others grow stale, the question becomes how do you improve the quality of your data at the foundational level?
EverString seeks to help leaders on two fronts: 1) Change what you think is possible with data. 2) Prove it through a comprehensive data test.
Joe pointed out that as data lovers, we can talk for hours about it, but most other business stakeholders don’t really care to talk about data at length. Instead, they focus on quality data that matters. They look for something that measures against key KPIs or other priority metrics that really move the business needle. You must pick your battles.
For example, duplicate customers can be a common problem, especially if incentive plans reward new account creation, instead of growing existing ones. When you’re always chasing the clean-up, you’ll never catch up. This is where the concept of data captains and data stewardship can help.
Data sets will always have some amount of problems, but distributing the data is actually one of the best ways to bolster data quality. When you make data available, people start asking questions. Juan encouraged teams to share because “Once data is public, it’s up for scrutiny. People start asking questions. Data quality and reliability improves in the process. If no one is asking questions, then no one cares and that’s a bigger concern. You want people to discuss and debate.”
What are some of the biggest changes on the horizon for data?
In wrapping up the roundtable, the group of data leaders shared their predictions of what’s to come for the data industry:
- Joe: AI and blockchain are separate but in the future, they’ll converge.
- Rod: Data sharing will be huge, as businesses work together and consider the shared value of bringing data sets together. Discovering, measuring, and implementing those partnerships will be a great opportunity.
- Rachel: Increasing technical competency throughout org, so depts have tools to get value out of their data. Prediction: More distributed data and teams able to make use of data
- Juan: COVID has helped accessibility, as silos of data are changing. This is tremendous news for the potential of future medical advancements.
In closing, Joe reminds us of a 2013 MIT Journal article that cited the world’s volume of data is doubling every two years, and we use less than 0.5% of it. We should ask ourselves, ‘Why?’. Each of us should think about how those data sets can be employed in the context of your organization, your industry. Critical thinking to look at each company generating torrents upon torrents of meaningful information, and introduce that into other aspects of society for the greater good.
Now, you can watch this Roundtable Discussion on-demand
Register to watch this CDO Roundtable. Hear these thought-leaders directly. Thank you to our sponsor, DoGood for bringing this event together!