00:06 Last time we implemented a basic language model using simple n-gram
matching. We operated directly on words — essentially the raw strings from our
corpus, or input text. One of the obvious next steps when working with language
models is to tokenize the input, and today we want to explore this important
preprocessing step.
00:39 Tokenization converts words into numbers because we can operate
more efficiently on numerical data. We can build matrices with these numbers,
which is far more efficient than working with raw strings. However, tokenization
is more subtle than simply assigning each word a fixed number. While we could
give every English word a unique number, what language models typically do is
break words into parts, where each part becomes a token. This results in more
tokens than words, and we can think of these tokens as common combinations of
letters, ranked by how frequently they appear in text.
01:32 This leads us to byte pair encoding, a clever algorithm that starts
with every character in our corpus as individual tokens. We then iterate over
these tokens a number of times. For each iteration, we find the two most
commonly occurring adjacent tokens and we merge them into a single new token.
For example, if we have "cat" and "bat," the pair "a" and "t" would be most
common, so our vocabulary would evolve to include "c", "at", "b", and "at" as
our tokens. We repeat this process, and with each step, we replace occurrences
in our vocabulary. If we push this to the extreme with many iterations, entire
common words end up as single tokens in our dictionary.
Implementing Token Vocabulary Generation
02:42 The basic approach is to take our training sample, build up a
token vocabulary, and systematically find the most common pairs. We can test
this with a simple example using words like "cat," "bat," "rat," and maybe "bat"
again. Let's create a function to build our token vocabulary. For a given input
of words, we want to return a list of tokens. Each token is itself a String:
typealias Token = String
03:50 Our first step is to build an array of the unique characters that
appear in our input, and store them as tokens (i.e. strings). We do this by
joining the input words into one big string, then creating a Set to get unique
characters, and then mapping each character to a String:
func tokens(inputWords: [String]) -> [Token] {
Set(inputWords.joined()).map { String($0) }
}
04:36 We can quickly display the results of calling this function to see
if it works:
struct ContentView: View {
@State var ngram = NGram()
@State var text = "cat bat rat bat"
var body: some View {
ScrollView {
Text(tokens(inputWords: text.words()).joined(separator: "\n"))
.monospaced()
}
}
}
05:17 When we run this, we see the letters "b," "r," "t," "c," and "a."
These strings of individual characters form our starting set of tokens. Now we
need to iterate and start building up pairs, then triples, and so on.
Finding the Most Common Token Pairs
05:51 We need to find the most common pair of adjacent tokens. We need a
place to store the newly discovered tokens as we build them up. We also need to
pass in an iteration count. This will determine how many tokens we'll generate:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
var tokens = Set(inputWords.joined()).map { String($0) }
for i in 0..<iterations {
}
return tokens
}
07:11 In each iteration, we want to find the frequencies of all adjacent
token pairs. Before we can do this, we need to convert the input words into
arrays of tokens. We do so using a double map — the outer map processes each
word, and the inner map converts each character into a single-character token
string:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
var words = inputWords.map { $0.map { String($0) }}
}
09:25 The type of the words variable is [[String]] — each element of
words is an array of tokens, and each token is a string. We loop over these
words and use the zip operation to create pairs of adjacent tokens. This gives
us an array of pairs of adjacent tokens. Initially these are pairs of
characters, but as we iterate, they become pairs of more complex tokens:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
var words = inputWords.map { $0.map { String($0) }}
var tokens = Set(inputWords.joined()).map { String($0) }
for i in 0..<iterations {
for word in words {
for pair in zip(word, word.dropFirst()) {
}
}
}
return tokens
}
10:28 We want to keep track of the number of times we encounter each
token pair. We loop over the pairs and we increment the count of each pair by
one, starting with a default count of zero for new pairs. Since tuples aren't
directly Hashable in Swift, we have to convert the pairs to arrays to use them
as keys of our frequencies dictionary:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
var words = inputWords.map { $0.map { String($0) }}
var tokens = Set(inputWords.joined()).map { String($0) }
for i in 0..<iterations {
var frequencies: [[Token]: Int] = [:]
for word in words {
for pair in zip(word, word.dropFirst()) {
frequencies[[pair.0, pair.1], default: 0] += 1
}
}
}
return tokens
}
Processing Frequencies and Replacing Tokens
11:53 We now have a frequency count for each combination of adjacent
tokens in our word list. Let's print the results to see if it makes sense. We
should see "at" as the token with the highest frequency:
var frequencies: [[Token]: Int] = [:]
for word in words {
for pair in zip(word, word.dropFirst()) {
frequencies[[pair.0, pair.1], default: 0] += 1
}
}
print(frequencies)
fatalError()
12:29 The console shows [["a", "t"]: 4, ["c", "a"]: 1, ["b", "a"]: 2, ["r", "a"]: 1], which looks correct for our sample data "cat bat rat bat".
12:37 The next step is finding the highest frequency pair and replacing
every occurrence of it with a combined token. To do so, we first find that pair
using max:
let pair = frequencies.max(by: { $0.value < $1.value })
print(pair)
fatalError()
13:40 This returns an optional token pair (along with the pair's count,
which we can ignore). After unwrapping it with if-let, we combine the pair's two
tokens into a single new token. We add the combined token to the tokens array,
and — to prepare for the next iteration — we replace all occurrences of the
token pair in words with the combined token:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
var words = inputWords.map { $0.map { String($0) }}
var tokens = Set(inputWords.joined()).map { String($0) }
for i in 0..<iterations {
var frequencies: [[Token]: Int] = [:]
for word in words {
for pair in zip(word, word.dropFirst()) {
frequencies[[pair.0, pair.1], default: 0] += 1
}
}
if let (pair, _) = frequencies.max(by: { $0.value < $1.value }) {
assert(pair.count == 2)
let newToken = pair[0] + pair[1]
tokens.append(newToken)
words = words.map {
$0.replacing(pair, with: [newToken])
}
}
}
return tokens
}
15:43 After running 10 iterations, we see not only the individual
characters and the entire input words, but also the token "at" appearing in our
vocabulary, since it occurs twice in our sample input.
Switching to Larger Input
15:57 Let's see what happens when we run this algorithm on the entire
text of Crime and Punishment:
var input: String {
try! String(contentsOf: Bundle.main.url(forResource: "corpus", withExtension: "txt")!, encoding: .utf8)
}
struct ContentView: View {
@State var ngram = NGram()
@State var text = "cat bat rat bat"
var body: some View {
ScrollView {
Text(tokens(inputWords: input.words()).joined(separator: "\n"))
.monospaced()
}
VStack {
}
.padding()
.onAppear {
ngram = run()
}
}
}
16:39 We can add a print statement for each iteration to monitor
progress, since processing the full text takes considerable time:
func tokens(inputWords: [String], iterations: Int = 10) -> [Token] {
for i in 0..<iterations {
print("\(i)")
var frequencies: [[Token]: Int] = [:]
}
return tokens
}
17:18 The algorithm runs correctly, even though it executes twice
because it's called from the view body during UI updates. When the processing
finishes, we see the original alphabet plus newly discovered token combinations
like "he", "in", "an", "the", "and" — the algorithm seems to be working.
Optimizing Performance with Word Frequency Counting
17:42 While we'd like to see results with 100 iterations, the current
implementation would take too long. Currently, we have a list of 20,000-30,000
words, and we loop over all of them when generating each token. By counting word
frequencies upfront, we can create a smaller collection of unique words to
iterate over.
18:39 We can optimize even further and just do a test with unique
words, ignoring the word counts entirely. This doesn't produce a correct
vocabulary of tokens, but we can get some results much faster:
func tokens(inputWords: [String], iterations: Int = 100) -> [Token] {
var words = Set(inputWords.map { $0.map { String($0) }})
var tokens = Set(inputWords.joined()).map { String($0) }
for i in 0..<iterations {
print("\(i)")
var frequencies: [[Token]: Int] = [:]
for word in words {
for pair in zip(word, word.dropFirst()) {
frequencies[[pair.0, pair.1], default: 0] += 1
}
}
if let (pair, _) = frequencies.max(by: { $0.value < $1.value }) {
assert(pair.count == 2)
let newToken = pair[0] + pair[1]
tokens.append(newToken)
words = Set(words.map {
$0.replacing(pair, with: [newToken])
})
}
}
return tokens
}
19:02 We can now increase the iteration count to 100 and see what
happens. The optimized version processes much more quickly, though some common
words might not appear as prominently since they only count as single
occurrences rather than being weighted by their actual frequency in the corpus.
Still, we can see many sensible token combinations emerging from the larger
corpus with more iterations.
Next Steps and Applications
19:55 Increasing the corpus size is a relatively cheap modification.
This gives us more words, because English has a limited vocabulary, so we're
essentially applying higher multipliers to existing patterns. The iteration
process will take a long time, but this is typically done only once as a
preprocessing step.
20:22 For production language models, the next step would be to replace
string-based storage in our n-gram model with integer indices, making the model
much more memory-efficient.
20:45 Choosing the right parameters — the number of tokens, the type of
corpus, and the iteration count — is probably very important for tuning
large-scale models. Our current approach works well for English and might even
extend to similar languages like Dutch and German, but for languages like
Chinese, we'd probably need different tokenization strategies.
21:21 Next time, we can integrate these tokens into our language model
and see how that benefits the quality of our word predictions.