What is collaborative filtering?
Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person. For example, a collaborative filtering recommendation system for preferences in television programming could make predictions about which television show a user should like given a partial list of that user's tastes.
In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data, etc.
What are types of collaborative filtering?
User-Based Collaborative Filtering (UB-CF)
User-based filtering measures the similarity between target users and other users, using logic and recommends items by finding similar users to the active user.
Item-Based Collaborative Filtering (IB-CF)
Item-based, which measures the similarity between the items that target users rate or interact with and other items.
How does Netflix use collaborative filtering?
Collaborative filtering tackles the similarities between the users and items to perform recommendations. Meaning that the algorithm constantly find the relationships between the users and in-turns does the recommendations. The algorithm learn the embeddings between the users without having to tune the features. The most common technique is by performing Matrix Factorization to find the embeddings or features that makes up the interest of a particular user.
Matrix Factorization
Matrix factorization is an embedding. Say we have a user-movie matrix or feedback matrix, Aᴺᴹ, the models learns to decompose into:
An user embedding vector U, where row N is the embedding for item M.
An item embedding vector V, where row M is the embedding for item N
The embedding vector is learned such that by performing UVᵀ, an approximation of the feedback matrix, A can be formed.
Loss Function
To approximate the feedback matrix, a loss function is needed. One of the intuitive loss function is using mean squared error (MSE). MSE computes the difference in the feedback matrix A and the approximated UVᵀ matrix.
Regularization Function
One of the most common problems with training the model is overfitting. Overfitting happens because the model is trying to learn the embedding of certain features that does not contribute to the accuracy of the model. If this particular outlier feature has large ‘amplitude’ or bias, then it is said that the model is over-fitted to these particular features.
Making recommendations
Generally, the steps (and functions) are listed below:
- Create a sparse tensor: tf.SparseTensor(), for U and V matrix with random initialisation
- Create the loss function and optimiser: tf.losses.mean_squared_error(), to estimate the total loss with regularization penalty and SGD as the optimiser
- Create the model: tf.Session(), Initialise hyperparams, learning rate and embeddings
- Train the model: tf.Session.run(), to learn the embeddings of the feedback matrix and return the v and k as the embedding vector
- Show recommendations: df.DataFrame(), to show the closest movie with respect to the user queried
How do you solve collaborative filtering?
Collaborative filtering systems have many forms, but many common systems can be reduced to two steps:
- Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
- Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user
This falls under the category of user-based collaborative filtering. A specific application of this is the user-based Nearest Neighbor algorithm.
Alternatively, item-based collaborative filtering (users who bought x also bought y), proceeds in an item-centric manner:
- Build an item-item matrix determining relationships between pairs of items
- Infer the tastes of the current user by examining the matrix and matching that user's data
Is collaborative filtering supervised learning?
No, collaborative filtering is an unsupervised learning which we make predictions from ratings supplied by people. Each rows represents the ratings of movies from a person and each column indicates the ratings of a movie.
In Collaborative Filtering, we do not know the feature set before hands. Instead, we try to learn those. Just like the handwritten digit recognition MNIST, we do not know what features to extract at the beginning but eventually the program learns those latent features (edge. corner, circle) itself.
What is the difference between content based and collaborative filtering?
Content-based filtering, makes recommendations based on user preferences for product features. Collaborative filtering mimics user-to-user recommendations. It predicts users preferences as a linear, weighted combination of other user preferences.
Both methods have limitations. Content-based filtering can recommend a new item, but needs more data of user preference in order to incorporate best match. Similar, collaborative filtering needs large dataset with active users who rated a product before in order to make accurate predictions. Combination of these different recommendation systems called hybrid systems.