After days of work, I finally figure out how Factorization Machines (FM) works, and write down this note to explain it with a topdown approach.

Factorization Machines

FM is introduced by Rendle, the core is the following formular:

\[\hat{y} \left( \boldsymbol{x} \right) = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n w_{ij} x_ix_j\]

It looks complex. So let’s breakdown the formular with an example.

Assume we are interest to know how likely our friend Bob will go shopping in the next few days. Intuitively, the closer Bob lives beside a supermarket, the more often and likely he will go shopping. And if Thanks Giving day is coming and he lives in america, he will likely go shopping.

In the above example, we use location and time to predict Bob’s behavior. When we measure the distance between Bob’s home and the supermarket, location is used alone. When we figure out it’s Thanks Giving day and the fact that he lives in america, time and location is used as a combination.

Time and location are features. They are $x_i$ and $x_j$ in the formular, and $x_ix_j$ represents the combination of two features. The $w$ is the weight of each feature/each combination of features, it decides how strong a feature will contribute to the predict result $y$. As time and location are natural features, transformation is needed to make them machine friendly. One hot encoding is one of the transformations.

The name of FM comes from $w_{ij}$ in the formular.

The $w_{ij}$ for all $i$ and all $j$ form a matrix $W$, $w_{ij}$ is at the i-th row and j-th column of $W$. In FM, $W$ is factored to two another matrices, just like integer factorization, for example $24 = 6 \times 4$.

Matrix Factorization

Why do we factor the matrix? Netflix gives us an example.

Consider we have 4 users: Alice, Bob, Carlos, Dana, and 5 movies: Mall-Cop, Twister, Jaws, Observe and Report, sharknado, abbr to M1~M5. Their rating to movies are shown below:

  M1 M2 M3 M4 M5
A 3 1 1 3 1
B 1 2 4 1 3
C 3 1 1 3 1
D 4 3 5 4 4

Given such a rating table, how can we predict the rating of a up coming movie?

Well, we can think that users and movies have features, if the features of one user and one movie come close to each other, then the rating will be high.

I’ll pick two features for the users and movies. For the users, the two features are whether they like comedy or action movies. For the movies. the two features are how much they belong to comedy or action movies.

Now, the rating table can be factored to two smaller matrices, one for users and another for movies.

The user matrix

  comedy action
A 1 0
B 0 1
C 1 0
D 1 1

The movie matrix

  comedy action
M1 3 1
M2 1 2
M3 1 4
M4 3 1
M5 1 3

Each row of the factored matrices can be treated as a vector. To measure how close two vectors are, we can use cosine distance/dot product.

Let’s exam the user matrix and movie matrix that are indeed the factorization of the original matrix. For Alice and M3, the rating is 1, calculated by picking the first row of user matrix and the third row of movie matrix, $\langle 1, 0 \rangle \cdot \langle 1, 4 \rangle = 1$. Done.

So, no matter how many new movies we encounter, we can dot product their feature vectors with uesrs’, and get the predicted ratings. Something important is happening, we can predict all other movies by just knowing 4 movies by matrix factorization. Users and movies get connected by the features via matrix factorization.

OK, in Netflix example, we used matrix factorization directly to produce predictions. In FM, matrix factorization is used to produce weights of feature combinations, and then the weights and features are used to produce predctions. By using matrix factorization, we extract the features of weights of feature combinations in FM.

After introducing FM to us, Rendle compared it with Support Vector Machines (SVM). To understand the comparisions, I have to dig into the details of SVM.

Support Vector Machines

Consider the following figure, we have positive and negetive samples, and we want to draw a line to seperate them. The blue line and yellow line both do the jobs. But the blue one does better as it has larger margins between positive and negetive samples. SVM is used to find such optimized lines.

When the optimized line is found, we will use it to classify an unknown dot $\overline{u}$ to positive ones or negetive ones. To do this, $\overline{u}$ is projected to $\overline{w}$ which is perpendicular to the optimized lines. In another word, $\overline{u}$ belongs to positive ones if the dot product $\overline{w} * \overline{u} \geq c$, $c$ is some constant.

Now the question is how to find $\overline{w}$ given the training set.

For positive samples, $\overline{w} \cdot \overline{x}_{+} \geq c$. We rewrite it to $\overline{w} \cdot \overline{x}_+ + b \geq 0$. As it’s in training set, we are more strict and want the left side of inequality greater than 1. The same applys to negetive samples. So we get:

\[\overline{w} \cdot \overline{x}_+ + b \geq 1 \\ \overline{w} \cdot \overline{x}_- + b \leq -1\]

For math convience, another variable $y$ is introduced such that $y = 1$ for positive samples and $y = -1$ for negetive samples. Then the above inequality gorup can be uniformed as below:

\[y_i \left(\overline{w} \cdot \overline{x}_i + b \right) \geq 1\]

For the samples on the margin lines:

\[y_i \left(\overline{w} \cdot \overline{x}_i + b \right) = 1\]

Now we need to choose some samples to make the margin as wide as possible. To find the width, we pick on the margin one positive sample $\overline{x}_+$ and substract it from another negetive sample $\overline{x}_-$, then project the result to a unit vector of $\overline{w}$.

The projection length is the width of margin, which can be expressed as:

\[WIDTH = \left( \overline{x}_{+} - \overline{x}_- \right) \cdot \frac{\overline{w}}{\parallel w \parallel}\]

As the samples is on the margin line, we have

\[\overline{x}_+ = \frac{1 - b}{\overline{w}} \\ \overline{x}_- = \frac{1 + b}{\overline{w}}\]

Bring them to the $WIDTH$ equation, we get:

\[WIDTH = \frac{2}{\parallel w \parallel}\]

So to find the max width, we need to find min of $\parallel w \parallel$, or for math convience, $\frac{1}{2} {\parallel w \parallel}^2$, and the minimum is under constraint of $y_i \left(\overline{w} \cdot \overline{x}_i + c \right) = 1$.

This is where Lagerange Multiplier shines. The minimum problem is equivalent to find the minimum of the following:

\[L = \frac{1}{2} {\parallel w \parallel}^2 - sum{\ \alpha_i \left[ y_i \left(\overline{w} \cdot \overline{x}_i + b \right) - 1 \right]}\]

In the equation, $\alpha_i$ equals 1 if we let margin line cross the sample $\overline{x}_i$, other wise 0. To find the minimum, we use derivatives.

\[\frac{\partial L}{\partial{\overline{w}}} = \overline{w} - sum{\ \alpha_i y_i \overline{x}_i} = 0 \quad \Rightarrow \quad \overline{w} = sum{\ \alpha_i y_i \overline{x}_i} \\ \frac{\partial L}{\partial{\overline{b}}} = - sum{\ \alpha_i y_i} = 0 \quad \Rightarrow \quad sum{\ \alpha_i y_i} = 0\]

Choosing the right samples to build the margin line is a quadratic optimization problem, I’ll not solve it and let the $\alpha_i$ remains in the solution. Now we find the minimum $\overline{w}$. To classify the unknown $\overline{u}$, we calculate:

\[\hat{y} = \overline{w} \cdot \overline{u} + b = sum{\ \alpha_i y_i \overline{x}_i} \overline{u} + b\]

If $\hat{y} > 0$, it’s positive else negetive. We can see that in the equation classifying unknown vectors, some samples in the training set remain, they are the ‘support vector’ of SVM. However, in FM, there is no such ‘support vector’. And unlike FM, the weight do not designed to share some features and cannot be factored.