During my recent sabbatical I had the privilege of joining some of the lectures from the Masters in Artificial Intelligence at the University of Leuven. They say that ‘you haven’t understood something until you teach it’, so I thought I would try to share, via this blog, some of what I learnt in the ‘Data Mining’ course.
The first of these occasional posts is about recommendation systems, which is timely as I’ve recently started to enjoy the results of one of the most famous such systems, Netflix. In 2006 Netflix created a $1m prize for entrants to devise an algorithm that predicted recommendations for users based on a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies (wikipedia). Competitors used the data to devise an algorithm, and this was tested against an additional set of data retained by Netflix. The winning entrant improved accuracy over Netflix’s existing model by 10%, and presumably has been the cause of much ‘consumer surplus’ ever since.
There are two basic ways to approach the data:
- Content based filtering: Find patterns in an individual user’s data. For example, if someone reports that they like a series of films all of the ‘romantic comedy’ genre, then you might recommend more of the same.
- Collaborative filtering: Here we try to find correlations between users that have rated 1 or more of the same items. The calculation returns a value between -1 and 1, where -1 means opposite preferences, 1 means identical preferences and 0 indicates no correlation between the two. A recommendation is made from the (possibly weighted) aggregate scores given by the most similar users to items that the original user has not seen (and this might be augmented by looking the disliked films from anti-correlated users).
Content based filtering excels when an individual has unique tastes, as the ‘similarity’ test in the collaborative approach will find few matches.
Content filters also lead to recommendations that are easy to explain to users (“because you liked several ‘romantic comedies’ and this is another one”), while the results of the collaborative filter cannot readily be interpreted. On the other hand, you have to have the right meta information to be able to find the matches, whereas collaborative filters can work without this extra info.
One criticism of online systems is the so-called ‘filter bubble‘, and the parallel concern here could be that one ends being recommended very similar films. In practise, neither system is terrible, and neither perfect – a content filter will for example recommend a lot of ‘romantic comedies’ but it could include items, for example, in other languages. An aggressive collaborative filter might lead to a narrow set of suggestions, but aggregating the preferences of ‘broadly similar’ users could throw up some wonderful apparent serendipity.