A powerful recommendation system: Collaborative Filtering

Recommendation systems provide a list of items that users might be interested in, based on their past actions. In other words, they help users to identify the product that they most likely would be interested in or purchase. Both users and providers benefit from recommender systems because it makes it easier and faster to find the right product or content for the user. To build a recommender system, the most two popular approaches are Content-based and Collaborative Filtering. Today we will talk about the latter, but it is always a good idea to support your Collaborative Filtering with Content-based models.

One of the most outstanding recommender systems is provided by Netflix with a 93% retention rate. The system uses a mixture of many artificial intelligence and ML algorithms such as Personalized Video Ranking (PVR), Top-N Video Ranker (e.g. MAP@K, NDCG), Trending Now Ranker, RNNs in time-sensitive sequence prediction, Continue Watching Ranker, Page Generation, Video-Video Similarity Ranker (very similar to CF approach) and so on. But in essence, their main idea is that past similar preferences can inform future preferences and it can be enriched by identifying similar users in terms of their taste. And benefit collaboratively from these groups of similar users’ past actions to make a recommendation.

In this article, we will take a deep dive into one of Netflix’s traditional recommendation algorithms known as Collaborative Filtering. This algorithm can be used to suggest articles, products, songs, books .. and everything you could imagine as far as you have the data!

The data consist, but isn’t limited to: 

  • What item the user gave a rate
  • What item the user has searched for
  • What item the user has viewed
  • What item the user has clicked on
  • What item the user has added to their cart
  • What item the user has purchased before

 

If we have a rating system in our data, which can be a questionnaire or a survey or simply thumbs up/down, that means we have explicit data.

However, if we don’t have this information, no worries, we can still recommend relevant content based on customers’ behaviour such as searching/clicking/viewing an article and observing the time spent.. etc. this kind of data refers to implicit data. In these cases, we are not 100% sure if the customer likes the content as we were in explicit data, however, we can still get decent results by using adjusted CF algorithms for implicit data.

The key trick in CF is to identify a set of users that are similar to the user whom we make recommendations for. To do that we need to define a notion of similarity between users. Two ways to calculate similarity are Cosine Similarity or Pearson Correlation (Centered Cosine similarity).

Mathematically speaking, these metrics measure the similarity between two vectors in the inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. So, basically, we are treating the users as vectors in the space based on the rates they have given.

The problem with cosine similarity, given the user ratings from 1 to 5, is that the unknown values will be treated as zeros. And in our scale, 0 (which is missing ratings) will be treated as if they were negatively rated. To solve this issue with Pearson similarity, we are normalizing ratings for a given user by subtracting the average rating of the user from each rating except the missing ratings. So basically, we centred the ratings around 0. In this case, a product that is rated as 1 will most likely turn negative while missing values will stay at 0. Now, after normalizing our ratings, we are ready to apply cosine similarity to find out similar users. Pay attention that we are trying to find out what our user’s rate would be for a certain product. Therefore, these similar users should be selected from the ones that already rated this product. And based on their similarity rank to our users their rate should be weighted by their similarity score. We would like the most similar users to have more effect on the predicted rate than the least similar ones within the selected group of similar users. The technique we talked about so far is called user-to-user CF, because given a user we tried to find out other users that are similar to him/her and use the ratings of those users. The other approach is item-to-item CF. The idea is simple: instead of finding similar users, we will try to find out similar items based on the ratings given by users to them. And then we will try to estimate the rating for the item using the exactly same similarity and weighted average prediction model. In theory, these approaches are similar and should have similar performances but in reality, the item-to-item approach significantly outperforms user-to-user CF most of the time. The reason behind this is quite interesting: items are simpler than users. Items belong to a small set of genres, for example, you can take a movie and classify it as sci-fi whereas users tend to have pretty varied tastes. They might like sci-fi, thriller and comedy genres altogether. On the other hand, it is much rarer that a movie can belong to many genres. That is why item-to-item CF works much better than user-to-user collaborative filtering.