Deep Dive: Why LLMs can count r's in Strawberry?
Hello, Dear Nerds,
Imagine this: OpenAI releases a new model, as they are saying, a "state-of-the-art" model - GPT-X. They are announcing that they can beat any PhD in their science, solve math, and write code better than any engineers. The whole internet is filled with posts that "a new era of AI" is coming, and we all will be replaced.
In 24 hours, you can open LinkedIn or Twitter and see millions of posts about LLMs' miserable failure to count letters in words. It can be counting `r` in `strawberry` or just counting letters in any other word.
But why? Why such an amazing technology can't understand really simple things? Today's deep dive, I would like to spend on finding the answer - why LLMs can do so many amazing things, but can't calculate letters? So the next release of LLMs you won't look like a caveman trying to hammer nails with a microscope
Note: This deep dive will not touch on different LLM architectures and can be used only to understand how models work in the big picture. More links for deeper understanding are provided at the bottom of this deep dive.
Tokens
So, everyone is talking about tokens. Any LLM you check will have something like a max token window, a price for input tokens, and an output token. But what is a token, actually?
A token is a chunk of text that the model understands. And yes - this is not a text. Computers are working with numbers. As always, there are different algorithms on how to generate tokens. Models like OpenAI are using Byte-Pair Encoding, and you can find more details about it here.
However, let's deep into tokenization on simple words.
Let's take the word `nerd`. For a computer to understand it, we need to convert it to binary code with only 0 and 1.
`nerd` = 01101110 01100101 01110010 01100100
Nothing surprising, right? Just represent every character in binary format. The first 8 bits will give you a byte that can represent ( if converted from binary ) a number from 0 to 255.
So now, you can replace this string with a list of numbers ( which are, in this case are just Decimal inthe ASCII table:
01101110 01100101 01110010 01100100 = 110 101 114 100
But this is not a token. Yet.