A quick overview of Byte Pair Encoding tokenizers!
Tokenizers are algorithms that convert text into numbers for computers to understand, and today we will be convering one such tokenizer - BytePairEncoding tokenizers. These are the tokenizers used in LLMs like GPT - 2, llama-3 etc. Let us talk about the preliminaries first - “Bytes”!
A bit is the smallest unit of information representing either 0 or 1. 8 bits make up one byte. Now what is the most naive way to sort of encode text? Just swap every character with a bit, or if we are talking “byte encoding” by a bit. Now even if you were to be a bit cautious and not have different encodings for the same words that come in a text, you would still end up having a mammoth numerical encoding of the given text input(Consider the fact that the english language alone has over a million unique words, so having a vocabulary of over a million words is well not the most optimal thing to do.)
But okay you’d argue that with increasing compute power, this really shouldn’t be an issue, but if you look carefully under the hood the biggest downside of this approach is not with large chunks of text, but with small ones. Now if a paragraph with say 50 words needs 50 tokens to be encoded that is excessive. Ideally we would like to minimize this overhead for smaller chunks of text. So even though BPE Tokenizers do involve replacing text with numbers, they do so in a clever way:
Consider the following chunk of text: fishing at five to return by the fire. Now If we were to tokenize the text with one token for every character, we would end up with 38 tokens, which is obscenely large for such a small piece of text. So instead of having a token for every possible character or every possible word, BPEs have a token for every possible subword, subwords that can in turn help you form complete words. Consider for instance the subword ‘th’, which alone can form the basis of many words such as ‘this’, ‘that’,’there’,’thought’,’theme’ etc. so your vast wordspace gets compressed into subwords that appear most frequently. Let’s take see how BPE tokenizer handles the above sentence fishing at five to return by the fire:
- Identify Frequent Pairs
- In the given text,’fi’ appears thrice
- Replace and Record
- Replace ‘fi’ with a new token ID that is not already in use e.g. 256 Hence the new text is:
<256>shing at <256>ve to return by the <256>reAnd simultaneously we also update the vocabulary: {0:..(some text),1…..256:’fi’}
- Replace ‘fi’ with a new token ID that is not already in use e.g. 256 Hence the new text is:
We do several iterations of the above two steps and build up the vocablulary. So given a large enough corpus, we can have a vocabulary capable of tokenizing any arbitrary text. Detokenization or inverse tokenization is also pretty simple, we just apply the reverse of the encoding steps until we obtain the actual text.
Okay but nothing is complete today unless we can see it working in code for ourselves. So we are gonna write a basic tokenizer which algorithmically lies along the lines of the BPE tokenizer used in GPT-2. So let’s build the intuition for that step by step:
Identifying frequent Pairs
The first step as discussed, was to identify the most frequent pairs which can be merged together, now if you have worked with text you will realize that doing this with text is cumbersome and hard and it would be a lot better if we had the strings in numeric form so given let’s take it as numbers shall we(For the Sake of simplicity outputs are preceeded by »>): Given a text, say:
text = "A bunch of text"
We will obtain the corresponding byte value for each character, coz of course if it is byte-pair encoding, but the reason for choosing bytes for this purpose is not arbitrary, bytes being the lowest rungs of the representation ladder which means they can encode anything and they can be reconstructed perfectly. Unlike unicode characters which can be inconsistent sometimes, having single or multiple codepoints for the same character(e.g. é). So bytes being a storage unit become language agnostic which means that the tokenizer can effortlessly handle any language or any symbols. Alright on with the implementation:
text_byte = bytes(text,encoding="utf-8")
#or
text_byte = text.encode("utf-8")
print(text_byte)
>>>b'A bunch of text'
In terms of bytes:
print(list(text_byte))
>>> [65, 32, 98, 117, 110, 99, 104, 32, 111, 102, 32, 116, 101, 120, 116]
So each number here is the byte representation of each character in the text i.e. 65 - A, 32 - ‘ ‘ etc. Now in order to carry out the process of replacement of pairs, we need to first figure out what pairs occur together most frequently.
For a deep dive, please visit - BPEs
Enjoy Reading This Article?
Here are some more articles you might like to read next: