By Christian Prokopp on 2022-05-03
Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which they frequently see and buy, to be accurate. Bold Data, which is processing 100s of millions of products daily, has a unique ability to find hidden insights and issues. For example, active Amazon bestsellers with names resulting from data processing errors.
Amazon serves data of 100s of millions of products on its websites, often provided by marketplace sellers and many being seen and sold rarely. Over the last decade, Bold Data's founder has seen a lot of poor data in this long tail. This included gems like Amazon test data that somehow made it onto the public website.
Last week, while processing the bestsellers of amazon.co.uk, amazon.de, and amazon.com, something peculiar surfaced. Bestsellers are products ranked in the top 100 in at least one product category. We found that four bestsellers had no names. Based on how the Amazon bestseller website presents its data, that should not be possible. What happened?
Looking at the Amazon web pages for the nameless products, what happened quickly becomes apparent to people familiar with data or software engineering.
But even if you are not an engineer, you can see the names in the images sound strange. Computer systems use NULL, NaN, NA, and similar outputs to indicate no data for a field or attribute to a human user. In simple terms and making some inferences, the upload into Amazon contained a message of no data, e.g. NULL
. Instead of failing, the message was converted into text and was stored wrongly as the product name in Amazon's product catalogue. When Bold Data analysed the data, the error reappeared.
You may know the saying "garbage in, garbage out", which computer scientists use. In particular, data engineers and data scientists use it to highlight that if your foundation, the data, is imperfect, then so will be the outcome. Therefore, experienced data professionals prioritise sourcing accurate data and its processing instead of applying increasingly complex analytics or machine learning algorithms.
While this is unfortunate and surprising that these items made it into the bestsellers without names, it is not as bad as it seems. The dataset analysed comprised 4.88 million products from three Amazon websites, the United States, Great Britain, and Germany. So close to one in a million was wrong, a small number. However, it demonstrates that our systems have to expect and accommodate data errors. Mined data is only as good as its source system.
While the error rate is low, the product's name is prominent, and other attributes have not as much scrutiny. We will publish datasets, analyses and more findings in the future. Be sure to subscribe to our email list so as not to miss these updates.
The described challenge in this post is one of many that data mining and data engineering face daily. The Internet has an abundance of valuable data. Mining and processing data at scale, with low cost, high confidence and quality are complex, requiring decades of experience. This is precisely what Bold Data has focused on for our customers. Create affordable, reliable datasets and decision support Analytics so you can make better decisions daily.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightf...
2023-11-29
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retr...
2023-11-23
Recently, OpenAI released GPT4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG)...
2023-11-07
OpenAI's DevDay announcement yesterday addresses issues I wrote about in the infeasibility of RAG after building Llamar.ai this summer. Did I get i...
2022-06-29
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, fol...
2022-06-13
I never wanted to be a solo founder. Yet, in 2021, I quit my job and started Bold Data to mine the Internet single-handedly. Trust me, it sounds as...