By Christian Prokopp on 2022-06-29
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?
The first point is well-examined and self-evident to anyone who has worked with data, so I will not dwell on it. In short, experience and gut are helpful to spark ideas, explore new avenues or unearth causalities, but metrics and hard data validate and steer us to valuable, measurable outcomes. So why should you not trust the data?
The problem with data is that it is rarely correct if you deal with data mining or data sourced from external parties, containing user input or legacy and complex systems. They suffer from broken processes, late data, fat finger input, broken Unicode characters, localisation issues, and sometimes simple lies.
For example, if you mine a marketplace like Amazon, you can expect innocent and malicious inputs from users. Which reviews are genuine or bought? Are the GTINs (identifiers) correct or merely copied from another (original?) product, or is it faked to fill a required field?
I have countless stories of assumed correct data only to find it untrue. Big data from legacy or complex systems are a common source. I remember building with a team a new analysis on top of a data source only to discover the transaction data contained was partially missing. It was a small mistake in how the batch processes synchronised at night overlooked by inexperienced staff, a silent error.
The true source of the issue here was that the people in charge of the data movement were not the same as those preparing the analysis and again not the same as those using the data. It is also an excellent example in favour of cross-functional agile product development, but that is another topic. Only when we started checking on the data and the results in detail, diving into the data 'manually' slicing it in all possible ways and going up and down the processing chain became the issue and pattern obvious. A small number of transactions got lost and underreported sales; every day!
Another example is the product feed from a retailer to a service company to create a data feed to advertise on a search platform. As part of the work at Bold Data, I dug through the final feed 'manually' and found some issues immediately. The retailer was paying another company to ingest and prepare their data, yet the result was broken in various ways, likely impacting the final advertising performance, a potentially costly error from a paid-for service.
But before judging these companies, think about the number of processes, data sources, feeds, stores and humans involved in your data processes. Can you confidently say that they all have automated metrics and robust checks to validate them technically and logically? Do you have competent staff with the time and incentives to check these data flows to harden them or find issues?
Here are a few tips to improve your situation. Nowadays, it is tempting with the plethora of tools and services with shiny UIs and automation to plug and play, but it is more plug and pray if you can't pop the hood and check yourself. Pay extra to get skilled staff at all levels. People who can drop onto a console or a query engine and dig through logs, databases, and feeds, or who can slice and dice the data in varying ways.
Have clear communication and collaboration between technical and business staff (again, cross-functional teams are fantastic) to ensure data is also logically correct. Most importantly, do not trust data that a data-driven Luddite has not validated. Someone who is not just watching automated tools or high-level metrics, which are essential for smooth operations, but also gets in there and occasionally digs around.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightf...
2023-11-07
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get i...
2023-02-15
Prevent errors and inconsistencies with Delta Lake's robust data management technology.
2023-02-07
The Battle of the AI Chatbots Begins: Google's Bard Takes on ChatGPT.
2022-05-25
When I mentor university students or discuss careers with the people I lead, I often draw from four pieces of advice. I wish I had known these when...
2022-05-10
Get huge, valuable datasets with 4.9 million Amazon bestsellers for free. No payment, registration or credit card is needed.