Building a Language Model: Tokenization

In this episode21:21

This episode is freely available thanks to the support of our subscribers

Subscribers get exclusive access to new and all previous subscriber-only episodes, video downloads, and 30% discount for team members. Become a Subscriber →

We implement tokenization using the byte pair encoding algorithm.

00:06 Last time we implemented a basic language model using simple n-gram matching. We operated directly on words — essentially the raw strings from our corpus, or input text. One of the obvious next steps when working with language models is to tokenize the input, and today we want to explore this important preprocessing step.

00:39 Tokenization converts words into numbers because we can operate more efficiently on numerical data. We can build matrices with these numbers, which is far more efficient than working with raw strings. However, tokenization is more subtle than simply assigning each word a fixed number. While we could give every English word a unique number, what language models typically do is break words into parts, where each part becomes a token. This results in more tokens than words, and we can think of these tokens as common combinations of letters, ranked by how frequently they appear in text.

01:32 This leads us to byte pair encoding, a clever algorithm that starts with every character in our corpus as individual tokens. We then iterate over these tokens a number of times. For each iteration, we find the two most commonly occurring adjacent tokens and we merge them into a single new token. For example, if we have "cat" and "bat," the pair "a" and "t" would be most common, so our vocabulary would evolve to include "c", "at", "b", and "at" as our tokens. We repeat this process, and with each step, we replace occurrences in our vocabulary. If we push this to the extreme with many iterations, entire common words end up as single tokens in our dictionary.

Implementing Token Vocabulary Generation

02:42 The basic approach is to take our training sample, build up a token vocabulary, and systematically find the most common pairs. We can test this with a simple example using words like "cat," "bat," "rat," and maybe "bat" again. Let's create a function to build our token vocabulary. For a given input of words, we want to return a list of tokens. Each token is itself a String:

typealias Token = String

03:50 Our first step is to build an array of the unique characters that appear in our input, and store them as tokens (i.e. strings). We do this by joining the input words into one big string, then creating a Set to get unique characters, and then mapping each character to a String:

func tokens(inputWords: [String]) -> [Token] {
    Set(inputWords.joined()).map { String($0) }
}

04:36 We can quickly display the results of calling this function to see if it works:

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = "cat bat rat bat"

    var body: some View {
        // ...
        ScrollView {
            Text(tokens(inputWords: text.words()).joined(separator: "\n"))
                .monospaced()
        }
        // ...
    }
}

05:17 When we run this, we see the letters "b," "r," "t," "c," and "a." These strings of individual characters form our starting set of tokens. Now we need to iterate and start building up pairs, then triples, and so on.

Finding the Most Common Token Pairs

05:51 We need to find the most common pair of adjacent tokens. We need a place to store the newly discovered tokens as we build them up. We also need to pass in an iteration count. This will determine how many tokens we'll generate:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    var tokens = Set(inputWords.joined()).map { String($0) }
    for i in 0..<iterations {

    }
    return tokens
}

07:11 In each iteration, we want to find the frequencies of all adjacent token pairs. Before we can do this, we need to convert the input words into arrays of tokens. We do so using a double map — the outer map processes each word, and the inner map converts each character into a single-character token string:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    var words = inputWords.map { $0.map { String($0) }}
    // ...
}

09:25 The type of the words variable is [[String]] — each element of words is an array of tokens, and each token is a string. We loop over these words and use the zip operation to create pairs of adjacent tokens. This gives us an array of pairs of adjacent tokens. Initially these are pairs of characters, but as we iterate, they become pairs of more complex tokens:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    var words = inputWords.map { $0.map { String($0) }}
    var tokens = Set(inputWords.joined()).map { String($0) }
    for i in 0..<iterations {
        for word in words {
            for pair in zip(word, word.dropFirst()) {

            }
        }


    }
    return tokens
}

10:28 We want to keep track of the number of times we encounter each token pair. We loop over the pairs and we increment the count of each pair by one, starting with a default count of zero for new pairs. Since tuples aren't directly Hashable in Swift, we have to convert the pairs to arrays to use them as keys of our frequencies dictionary:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    var words = inputWords.map { $0.map { String($0) }}
    var tokens = Set(inputWords.joined()).map { String($0) }
    for i in 0..<iterations {
        var frequencies: [[Token]: Int] = [:]
        for word in words {
            for pair in zip(word, word.dropFirst()) {
                frequencies[[pair.0, pair.1], default: 0] += 1
            }
        }


    }
    return tokens
}

Processing Frequencies and Replacing Tokens

11:53 We now have a frequency count for each combination of adjacent tokens in our word list. Let's print the results to see if it makes sense. We should see "at" as the token with the highest frequency:

// ...
var frequencies: [[Token]: Int] = [:]
for word in words {
    for pair in zip(word, word.dropFirst()) {
        frequencies[[pair.0, pair.1], default: 0] += 1
    }
}
print(frequencies)
fatalError() // Temporary stop
// ...

12:29 The console shows [["a", "t"]: 4, ["c", "a"]: 1, ["b", "a"]: 2, ["r", "a"]: 1], which looks correct for our sample data "cat bat rat bat".

12:37 The next step is finding the highest frequency pair and replacing every occurrence of it with a combined token. To do so, we first find that pair using max:

// ...
let pair = frequencies.max(by: { $0.value < $1.value })
print(pair)
fatalError()
// ...

13:40 This returns an optional token pair (along with the pair's count, which we can ignore). After unwrapping it with if-let, we combine the pair's two tokens into a single new token. We add the combined token to the tokens array, and — to prepare for the next iteration — we replace all occurrences of the token pair in words with the combined token:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    var words = inputWords.map { $0.map { String($0) }}
    var tokens = Set(inputWords.joined()).map { String($0) }
    for i in 0..<iterations {
        var frequencies: [[Token]: Int] = [:]
        for word in words {
            for pair in zip(word, word.dropFirst()) {
                frequencies[[pair.0, pair.1], default: 0] += 1
            }
        }
        if let (pair, _) = frequencies.max(by: { $0.value < $1.value }) {
            assert(pair.count == 2)
            let newToken = pair[0] + pair[1]
            tokens.append(newToken)
            words = words.map {
                $0.replacing(pair, with: [newToken])
            }
        }
    }
    return tokens
}

15:43 After running 10 iterations, we see not only the individual characters and the entire input words, but also the token "at" appearing in our vocabulary, since it occurs twice in our sample input.

Switching to Larger Input

15:57 Let's see what happens when we run this algorithm on the entire text of Crime and Punishment:

var input: String {
    try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
}

struct ContentView: View {
    @State var ngram = NGram()
    @State var text = "cat bat rat bat"

    var body: some View {
        // ...
        ScrollView {
            Text(tokens(inputWords: input.words()).joined(separator: "\n"))
                .monospaced()
        }
        VStack {
            // ...
        }
        .padding()
        .onAppear {
            ngram = run()
        }
    }
}

16:39 We can add a print statement for each iteration to monitor progress, since processing the full text takes considerable time:

func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
    // ...
    for i in 0..<iterations {
        print("\(i)")
        var frequencies: [[Token]: Int] = [:]
        // ...
    }
    return tokens
}

17:18 The algorithm runs correctly, even though it executes twice because it's called from the view body during UI updates. When the processing finishes, we see the original alphabet plus newly discovered token combinations like "he", "in", "an", "the", "and" — the algorithm seems to be working.

Optimizing Performance with Word Frequency Counting

17:42 While we'd like to see results with 100 iterations, the current implementation would take too long. Currently, we have a list of 20,000-30,000 words, and we loop over all of them when generating each token. By counting word frequencies upfront, we can create a smaller collection of unique words to iterate over.

18:39 We can optimize even further and just do a test with unique words, ignoring the word counts entirely. This doesn't produce a correct vocabulary of tokens, but we can get some results much faster:

func tokens(inputWords: [String], iterations: Int = 100) -> [Token] {
    var words = Set(inputWords.map { $0.map { String($0) }})
    var tokens = Set(inputWords.joined()).map { String($0) }
    for i in 0..<iterations {
        print("\(i)")
        var frequencies: [[Token]: Int] = [:]
        for word in words {
            for pair in zip(word, word.dropFirst()) {
                frequencies[[pair.0, pair.1], default: 0] += 1
            }
        }
        if let (pair, _) = frequencies.max(by: { $0.value < $1.value }) {
            assert(pair.count == 2)
            let newToken = pair[0] + pair[1]
            tokens.append(newToken)
            words = Set(words.map {
                $0.replacing(pair, with: [newToken])
            })
        }
    }
    return tokens
}

19:02 We can now increase the iteration count to 100 and see what happens. The optimized version processes much more quickly, though some common words might not appear as prominently since they only count as single occurrences rather than being weighted by their actual frequency in the corpus. Still, we can see many sensible token combinations emerging from the larger corpus with more iterations.

Next Steps and Applications

19:55 Increasing the corpus size is a relatively cheap modification. This gives us more words, because English has a limited vocabulary, so we're essentially applying higher multipliers to existing patterns. The iteration process will take a long time, but this is typically done only once as a preprocessing step.

20:22 For production language models, the next step would be to replace string-based storage in our n-gram model with integer indices, making the model much more memory-efficient.

20:45 Choosing the right parameters — the number of tokens, the type of corpus, and the iteration count — is probably very important for tuning large-scale models. Our current approach works well for English and might even extend to similar languages like Dutch and German, but for languages like Chinese, we'd probably need different tokenization strategies.

21:21 Next time, we can integrate these tokens into our language model and see how that benefits the quality of our word predictions.