By Christian Prokopp on 2024-04-12
128k tokens are 96k words in English for ChatGPT 3.5 and 4. The ratio is estimated to be 0.75 words per token. However, the answer is not straightforward for other languages, but we can approximate it using English. Confused? I was. Let me explain.
Large Language Models (LLMs) train on and predict tokens, which are frequent sequences. You can try it on OpenAI's website. For example, fish
is one token, and marriage
is two tokens. The Portuguese word for fish peixe
is two tokens, and the Japanese word 魚
is three tokens. The reasons are complex and a combination of how the encodings were trained with a bias towards English and how data are encoded, which works efficiently for English and Latin alphabet languages but not so well for other languages. I can recommend this excellent blog post by Anthony Shaw or this research paper for a deeper dive into the topic.
One approximation for how many tokens you need for other languages is comparing information density, i.e. how many more tokens it takes to say the same thing in German or Japanese versus English, which is, for the stated reasons, usually the most efficient. Of course, the translations themselves may introduce biases or verbosity.
Our baseline is 96k words for 128k tokens in English. As mentioned above, we get fewer words and less meaning per token for technical and semantic reasons, so a one-to-one comparison is hard. Let us invent a metric, the English Word Equivalence (EWE). Using the 128k tokens allows us to express the equivalent of how many English words?
Using the above-cited resources, we can make the following English Word Equivalence approximations:
This means what we can express in 128k tokens in Spanish would roughly take 73k words in English or a factor of 1.32. At the bottom of the list, 128k tokens can express in Korean as much as 41k English words or a factor of 2.36. Take this with a grain of salt. Remember, it is not a judgment or a measure of a language. Imagine if everything was based on Korean in the early days of electronic computing, and Korean were the predominant language for encoders for LLMs. Things would look very different.
Machine Learning and LLMs are rightfully under scrutiny for biases. This highlights that some of the underlying technologies going back many decades combined with recent biases, e.g. English-speaking companies training models on an English-centric dataset, impact cost profiles for everyone on top of performance/outcomes.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.
2024-04-11
Today, we release a massive dataset for non-commercial use, i.e. research or personal projects. The dataset covers Amazon product data for all of 2...
2024-03-14
Tax Shrink is a new online tool that helps owner-operators of Limited companies in the UK calculate and visualise the ideal salary-to-dividend rati...
2023-02-03
ChatGPT can combine Data with natural language and has extensive information about most subjects. That lends itself to novel applications like crea...
2022-12-02
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-o...
2022-06-20
Insurance works because it shares costs in the face of uncertainty. What happens when Tesla removes uncertainty and distributes cost seemingly more...
2022-05-30
Your hard work is not appreciated. So why should you still do it? There is a good reason.