How Does Machine Translation Work? Part 2: The Matrix

If you haven’t read Part 1, start there.

All ML models are mathematical models. They don’t work directly on words, or images, or audio, they work only on numbers. You can’t plot a word or a sentence, so we need to convert to numbers somehow.

The naive approach would be to assign a number to each word, or come up with a scheme like using letters of the alphabet. So cat would become 030120, because it’s the 3rd, 1st, and 20th letters of the alphabet. The reason this approach doesn’t work well, is because we’re essentially giving the model random numbers. Yes, ABC has some meaning to us, but when you think about it, it’s all totally random. There’s not reason why the English language would function much differently if our alphabet started with XJQ instead of ABC. So using numbers generated in this way would be like saying to the model, “Tell me if these photos are of dogs or cats, but instead of the photo, I’m going to give you a random number I made up”. How is the model supposed to extract meaningful information from random numbers? We need a better approach.

Numbers have a relation to each other. 3 is less than 9. 3 is also one third of 9. 3 is the square root of 9. So what number represents the word dog? What number represents all the various relationships dog has with other words? We might say that dog is less than wolf, but is dog more or less than tomato? Words have more complex relationships than simple numbers do. So we need something more complex than numbers like 1, 2, 3, … We need multi-dimensional numbers, but that’s not as math scary as it sounds.

Let’s quickly review dimensions, as if we were drawing on a piece of paper. We start with a dot. It’s a dot. There’s not much more to say about it. You can say if one dot you drew is in exactly the same place as another dot, or if it’s in a different place. That’s dots. Next up is a line. Now there’s a lot more to say. Lines can be long or short. They can go in a certain direction. We can say that two lines are touching, intersecting, or not touching. We can say if they’re parallel or not. Way better than dots. One level up from lines is 2-D shapes, like squares, and circles. Again, there’s more to them than lines. We can say how many corners does it have? How many sides? How big is it? Or even more abstract things like, is it lopsided or even? We’ve come a long way since dots, but don’t worry, one more to go. Lastly we have 3-D shapes, like cubes and spheres. We can say how much surface area they have, or put another way, how much wallpaper would we need to cover it with wallpaper. Or volume, which is to say, if we filled it with water, how much water would fit inside. We can say if one shape will fit inside another shape(which is not the same as one being larger/smaller than another). So going back to our words that we need numbers for, if a number(like 4 or 5), is more similar to a dot, we want something more like a cube. This bring us to, the matrix. Well technically they’re called tensors, of which a matrix is just one kind, but this isn’t math class, and you’ve probably seen or at least heard of movie The Matrix, more than you’ve heard of a “tensor”, so we’ll go with “MATRIX”. A matrix is just a multi-dimensional number. It’s basically a box of many numbers which can be used to hold more complicated data/relationships than a single number like 9 can.

Dimensions illustration(dot, line, square, cuba)

Since we can’t compare dog, tomato, wolf, and cat as easily as we can compare 4,5,6, and 7, maybe if we had a matrix for each word we could. So glossing over a lot of complexities, We start out with each word’s matrix set to all zeros, then we feed a bunch of text into a program, as much as we can, text of any kind. It looks at each word, and what words it is often surrounded by, and each time it does some magic math “tweaks” to the matrix for that word. When it’s done, for each word, we now have a matrix that represents it. Each word has it’s own box of numbers. But how do we know if the program did a good job? It’s easy to write a program that looks at a word and tweaks some numbers randomly, it’s much harder to capture all the intricacies of word relationships. Luckily we can test it out, without having to know exactly what all the numbers in the box mean. For example, we take the matrix for the word king, and for the word man, and subtract them. Then we take the matrix for woman and add it. So we have king - man + woman, as then we ask the program to find out of all the matricies(plural for matrix), which one is most similar to this one we just created, and eventually it comes back with queen! And then maybe we take monkey and banana, and human, and ask which matricies are similar to human, in the same way banana is similar/related to monkey, and it comes back with bread, rice, pizza, burger, etc.

Hopefully it’s clear why this method of turning language into numbers is better than our original, ABC-based method. We can never perfectly represent language with numbers, because numbers live in a purely math realm, and language doesn’t. However, with these multi-dimensional, matrix, word numbers, we can do our best to give the model some data that has mathematical relationships that more closely mimic the relationships that words have with each other in the non-mathematical, language realm. We’re basically translating the words from a linguistic language that we speak, to a mathematical language that the model speaks.

Part 3 is next.