If you haven’t read Part 1, start there.

All ML models are mathematical models. They don’t work directly on words, or images, or audio, they work only on numbers. You can’t plot a word or a sentence, so we need to convert to numbers somehow.

The naive approach would be to assign a number to each word, or come up with a scheme like using letters of the alphabet. So `cat`

would become `030120`

, because it’s the 3rd, 1st, and 20th letters of the alphabet. The reason this approach doesn’t work well, is because we’re essentially giving the model random numbers. Yes, ABC has some meaning to us, but when you think about it, it’s all totally random. There’s not reason why the English language would function much differently if our alphabet started with XJQ instead of ABC. So using numbers generated in this way would be like saying to the model, “Tell me if these photos are of dogs or cats, but instead of the photo, I’m going to give you a random number I made up”. How is the model supposed to extract meaningful information from random numbers? We need a better approach.

Numbers have a relation to each other. 3 is less than 9. 3 is also one third of 9. 3 is the square root of 9. So what number represents the word `dog`

? What number represents all the various relationships `dog`

has with other words? We might say that `dog`

is less than `wolf`

, but is `dog`

more or less than `tomato`

? Words have more complex relationships than simple numbers do. So we need something more complex than numbers like 1, 2, 3, … We need multi-dimensional numbers, but that’s not as math scary as it sounds.

Let’s quickly review dimensions, as if we were drawing on a piece of paper. We start with a dot. It’s a dot. There’s not much more to say about it. You can say if one dot you drew is in exactly the same place as another dot, or if it’s in a different place. That’s dots. Next up is a line. Now there’s a lot more to say. Lines can be long or short. They can go in a certain direction. We can say that two lines are touching, intersecting, or not touching. We can say if they’re parallel or not. Way better than dots. One level up from lines is 2-D shapes, like squares, and circles. Again, there’s more to them than lines. We can say how many corners does it have? How many sides? How big is it? Or even more abstract things like, is it lopsided or even? We’ve come a long way since dots, but don’t worry, one more to go. Lastly we have 3-D shapes, like cubes and spheres. We can say how much surface area they have, or put another way, how much wallpaper would we need to cover it with wallpaper. Or volume, which is to say, if we filled it with water, how much water would fit inside. We can say if one shape will fit inside another shape(which is not the same as one being larger/smaller than another). So going back to our words that we need numbers for, if a number(like 4 or 5), is more similar to a dot, we want something more like a cube. This bring us to, the matrix. Well technically they’re called tensors, of which a matrix is just one kind, but this isn’t math class, and you’ve probably seen or at least heard of movie The Matrix, more than you’ve heard of a “tensor”, so we’ll go with “MATRIX”. A matrix is just a multi-dimensional number. It’s basically a box of many numbers which can be used to hold more complicated data/relationships than a single number like 9 can.

Since we can’t compare `dog`

, `tomato`

, `wolf`

, and `cat`

as easily as we can compare 4,5,6, and 7, maybe if we had a matrix for each word we could. So glossing over a lot of complexities, We start out with each word’s matrix set to all zeros, then we feed a bunch of text into a program, as much as we can, text of any kind. It looks at each word, and what words it is often surrounded by, and each time it does some magic math “tweaks” to the matrix for that word. When it’s done, for each word, we now have a matrix that represents it. Each word has it’s own box of numbers. But how do we know if the program did a good job? It’s easy to write a program that looks at a word and tweaks some numbers randomly, it’s much harder to capture all the intricacies of word relationships. Luckily we can test it out, without having to know exactly what all the numbers in the box mean. For example, we take the matrix for the word `king`

, and for the word `man`

, and subtract them. Then we take the matrix for `woman`

and add it. So we have `king - man + woman`

, as then we ask the program to find out of all the matricies(plural for matrix), which one is most similar to this one we just created, and eventually it comes back with `queen`

! And then maybe we take `monkey`

and `banana`

, and `human`

, and ask which matricies are similar to `human`

, in the same way `banana`

is similar/related to `monkey`

, and it comes back with `bread`

, `rice`

, `pizza`

, `burger`

, etc.

Hopefully it’s clear why this method of turning language into numbers is better than our original, ABC-based method. We can never *perfectly* represent language with numbers, because numbers live in a purely math realm, and language doesn’t. However, with these multi-dimensional, matrix, word numbers, we can do our best to give the model some data that has mathematical relationships that more closely mimic the relationships that words have with each other in the non-mathematical, language realm. We’re basically translating the words from a linguistic language that we speak, to a mathematical language that the model speaks.

Part 3 is next.