Information Retrieval Machine Learning Publications

Recommender Systems

Recommender Systems (RS) provide a user with recommendations about items as part of any given service. They are a subclass of Information Filtering. They have a variety of domains of application: movies, travel, financial services (insurance, loans), advertising… Using an RS presents several benefits: increased user loyalty and satisfaction, increased revenues and system efficiency…

Below, we present a general overview of RS, namely an introduction to the different types of RS and their core concept. In the last section, we will look at dimensionality reduction CF.

Types of RS

In a Recommender System we distinguish users and items, described by their attributes. Let us consider the example of movie recommendations (e.g. Netflix). In this specific case, the users are customers and the items are movies. The users’ attributes are for example sociodemographic information. And the items’ attributes are the movies’ genre (action, comedy…), the producer, the actors…

We distinguish four types of systems for recommendations:

  • Impersonal RS: Uses external sources or statistics about the quality/popularity of the items.
  • Content-based (CB) RS: Aims at recommending to a user new items similar to those he appreciated in the past. To this end, user models based on item features are built.
  • Collaborative Filtering (CF) RS: Makes recommendations to a given user based on items users who are similar to her have appreciated.
  • Other approaches including hybrid (Combination of Content-based, CF and other types of RS), and Interactive RS (critic and dialog-based).

In the remainder we will focus on Content-based and CF RS.

Content-based RS

This RS aims to model the preferences of users by modeling the items using their relevant features. Three main steps can describe this type of RS [1]:

  • Content analyzer: This step allows for extracting useful features (keywords, n-grams, concepts), particularly when the information is raw. This can for example use the common concept of Vector Space Model (VSM) with Term Frequency–Inverse Document Frequency (TF-IDF) weighting. The features of each item are stored in a vector.
  • Profile learner: It builds a user profile by applying a supervised machine learning technique to the feature vector generated by the content analyzer.
  • Filtering component: Given some new items represented by their feature vectors, the objective of this step is to compare these feature vectors to an active user profile to determine items most likely to interest him. Then a feedback updates the user’s profile.

Collaborative Filtering RS

The principle behind CF RS is to predict the rating a particular user will give an item by using the ratings the users most similar to her have given that item (user-user approach).

The usual similarity measures are Cosine Similarity [2] or Pearson Correlation [3].

Scalability problems can affect the user-user CF when the number of users increases considerably. Hence one can alternatively use an item-item approach, where the rating a specific user would give a particular item depends on the rating he gave items that are the most similar to the item under consideration.

Another family of CF, named Model-based CF, designs a predictive model from the  ratings matrix.

Dimensionality Reduction CF

Although neighborhood-based (memory-based) CF provides good results, it faces problems like sparsity. Indeed sparsity occurs when users rate few items. And the scalability problem comes from the fact that, in the real world, the user-item matrix can grow larger and larger very quickly. Hence Matrix Factorization (MF) becomes useful. A straightforward approach for this type of CF uses a truncated Singular Vector Decomposition (SVD).

The objective of such a CF is to approximate the initial rating matrix  as follows:

 where   and  are orthogonal matrices ()

We can rewrite  where  and Q .

The objective is to find the optimal  and  by minimizing the loss function:

The second member of the above equation with variable  (regularization parameter) is used to prevent overfitting. An Alternating Least Squares technique can find  and . The MF using SVD is also known as Latent Semantic Analysis (LSA). Hofmann introduced a probabilistic approach of LSA [3] to handle polysemy.



  • Document:  where training database  is a collection of documents
  • Word (Term): 
  • Hidden topic (latent variable): an observation  is associated to a hidden topic .

Hence the data is stored in a term-document (co-occurrence) matrix where  is the number of times term  occurred in document .

Fig.1 Term-document matrix


The PLSA is a generative model based on the following algorithm:

  1. Select a document  with probability 
  2. Pick a latent class  with the probability 
  3. Generate the word  with the probability 

Given a collection of documents, the objective of the training phase is to maximize the likelihood:

where  is the term frequency of . Otherwise, considering a pair of word and document , the joint probability  is:

Since  and  are independent conditioned on the state of the associated latent variable 

Then the log of the likelihood can be rewritten as follows:

The estimate of the parameters  that maximize the likelihood uses an Expectation Maximization (EM) algorithm. After initializing the parameters, the EM algorithm consists of two steps, which repeat until convergence. The first step, the Expectation (E) phase aims at computing the posterior probabilities given the parameters’ values resulting from the previous iteration:


The maximization (M) step updates the parameters based on the posterior probabilities computed in the previous step. The method of Lagrange multipliers is used:

Once the model is built, a query document  is represented in the latent topic space using a folding-in process to calculate . This process uses the EM algorithm where the Expectation phase is identical to that in the training phase. The M step only recalculates  while  remains constant.

Let us note using a modified version of EM called Tempered EM (TEM) can prevent overfitting.


The main drawback of PLSA is that there is no direct way of modeling new documents. Moreover, the number of parameters grows linearly with the number of documents. So Lei introduced Latent Dirichlet Allocation (LDA) [4] to overcome these issues. It is indeed similar to PLSA where the topic mixtures have prior Dirichlet Distribution.

Cold Start issue

We must note the CF is one of the most common RS; however it faces the “cold start issue.” This problem occurs when there is no information about an item or user (e.g. new item/user). To resolve this issue several approaches are feasible:

  • Combining several types of RS (e.g. CF and CB)
  • Context-aware (Climate, Time, Event…)
  • Using social network information…


[1] Kantor, Paul BRicci, FrancescoRokach, LiorShapira, Bracha. “Recommender systems handbook.” Springer, 2011.

[2] Singhal, Amit. “Modern Information Retrieval: A Brief Overview.” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2001.

[3] T. Hofmann, “Probabilistic latent semantic indexing.” SIGIR 1999.

[4] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. “Latent Dirichlet allocation.” Journal of Machine Learning Research 3 (4–5): pp. 993–1022.

You may also like
Noise Reduction
Big Data: Basics, MapReduce