By Christian Prokopp on 2022-04-25
Public data has an enormous commercial and social impact. For example, in Ukraine, it affects war and peace, and with the Coronavirus, it involves life and death. We must keep public data accessible for the public good.
As Techcrunch reported, the recent challenge by LinkedIn against web scraping of its public content received a blow in the courts. While this is a commercial battle between two companies, the impact of the ruling reaches much further. Besides commercial interests, academia, journalism, research and archiving depend on the ability to scrape public data without limitations.
An example of how access to information is essential is the impact Open-source intelligence (OSINT) has. The war in Ukraine demonstrates how amateurs can collect and assemble impressive intelligence informing reporting and even the defence of the country. Notably, the legal accessibility of data is the foundation of OSINT. It may not have prevented the war, but it certainly helps journalists report it, and investigators analyse possible war crimes. Data has become so essential for reporting that it coined the term Data Journalism. Who could imagine the COVID reporting of the last years without the detailed Analytics? They require vast datasets coming from data mining and web scraping, for example, "[t]he New York Times has made more than 9.98 million programmatic requests for Covid-19 data from websites around the world".
At Bold Data, we believe public data also plays a vital role in the functioning of markets. Public data can improve decision-making dramatically by helping companies, for example, with competitive pricing, reduced stock and capital costs, better forecasting models predicting demand and much more. Customers and the environment benefit from more efficient markets providing better products and services with less waste.
However, to retrieve public data at scale, making it easy to process, interpret, and analyse is complex. Public data is hard to retrieve, and its quality varies widely. For example, source system availability and data format change, the data has errors, is rarely complete, may change and often has duplication. Consequently, data mining and wrangling are out of the reach of most companies either because of technological or financial hurdles. Moreover, combining the data mining efforts across companies and industries has enormous scale and cost benefits for all participants. That is the idea behind Bold Data to share expertise, technology and economies of scale with our customers for affordable, accessible, data-driven decision-making.
The most challenging hurdle is the data monopolies by platforms like Google, Amazon, and Facebook, locking their users' and customers' data away. Bold Data does not provide copyrighted or restricted non-public data. However, wherever public data is legally available, we collect, clean and analyse it to give our customers the best possible insight to compete on a more level playing field. Over time, we aim to reduce the data monopolies. That is why the continued lawfulness of scraping public data is of importance for you, our customers and their customers, to create a competitive landscape.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightf...
2024-04-11
Today, we release a massive dataset for non-commercial use, i.e. research or personal projects. The dataset covers Amazon product data for all of 2...
2023-11-23
Recently, OpenAI released GPT4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG)...
2023-09-27
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Langua...
2023-02-11
Microsoft could follow Google's $100bn loss. I tried the new Bing Chat (ChatGPT) feature, which was great until it went disastrously wrong. It even...
2023-01-25
ChatGPT and similar language models have recently been gaining attention for their potential to revolutionise code generation and enhance developer...