Surely you bought a few books via online book stores as Amazon or Barnes and Noble. If you are a movie lover you may have come across the DVD rental service Netflix. It establishes the old principle of recommendation and collaborative-based word-of-mouth exchange in electronic form.
Collaborative filtering is one of the biggest sub-domains of information retrieval. In general collaborative filtering is composed out of the steps filtering and collaborating. Filtering describes the process of performing automated predictions. Input data for filtering is generated by the collaborating process where preference information from many users is put together. Collaborative filtering is characterized by the option to choose from many options, specific user preferences and the assumption that users who agreed on certain items in the past will continue to do so in the future, i.e. the intra-user preference correlations are static over time. Let us talk about the example Netflix.
Every day Netflix sends 1.6 millions DVDs to more than seven million customers. Subscribers can choose 90,000 DVD titles, 6,000 movies and TV episodes from 200+ genres. The core of Netflix is their movie rating system. This recommendation system, also known as collaborative filtering, works as follows: Customers may rate the seen movies on a 5-band star scale from strongly dislike to strongly like. These ratings are written to a huge database and are used to make movie recommendations for customers what they should rent next. Currently Netflix has about 2 billion member ratings with the average member having about 200 movie ratings. These predictions rely on how much customers are going to appreciate a movie based on their movie preferences. The secrets are the machine learning techniques that are used on individual ratings to forecast how much you like related movies and take advantages of those insights to suggest movies you are likely to enjoy and thus keep the customers happy and loyal. Customer satisfaction is at 90%. About 60% of the Netflix customers choose their movies based on their individual recommendations. The prediction accuracy mostly on the number of ratings movies have and the number of users with similar preferences. The performance depends also on how well Netflix can judge you by the number your rating you provided but also on the noise in the ratings, i.e. user ratings vary with mood, time and whether you entered ratings on behalf of your family members among other things. Scores by the Netflix recommendation system Cinematch are accurate to a half star approximately 75% of the time. To improve this performance Netflix launched the Netflix Prize which provides the winner with Grand Prize of $1 million who manages to improve the prediction accuracy by 10% but also annual progress cash awards of $50,000. The competition data set has 100,480,507 ratings on 17,770 movies from 480,189 unidentified Netflix members. The currently best contest entry managed an improvement of slightly less than 9%. Yehuda Koren, Robert Bell and Chris Volinsky from the AT&T Research Labs and progress award winner from 2007 used a collaborative filtering approach based matrix factorization and on the k-nearest neighbours algorithm where a customer-movie preference is interpolated from ratings of similar movies and customers.
The images above show tables representing the user item matrix, i.e. each row representing a user and each column representing the user rating for a particular movie. The goal is to predict those ratings the customer would assign to the movie once he has watched and evaluated it. Prediction is based on user profiles which indicate similiar interests, i.e. as inferred by the movies which were rated and which movies actually were watched. In the example above Jack and Tom share similiar movie preferences and we try to predict our own rating for movie I4 which turns out to be 4.5 for this particular case given that Jack and Tom voted 4 and 5 for this movie, respectively.