Seminar: Multi-Label Learning with Millions of Categories

SpeakerManik Varma
AffiliationMicrosoft
DateMonday, 23 Jul 2012
Time11:00 - 12:00
LocationRoberts G08 Sir David Davies LT
Event seriesCSML Joint Seminar Series
Description

Our objective is to build an algorithm for classifying a data point into a set of labels when the output space contains millions of categories. This is a relatively novel setting in supervised learning and brings forth interesting challenges such as efficient training and prediction, learning from only positively labeled data with missing and incorrect labels and handling label correlations. We propose a random forest based solution for jointly tackling these issues. We develop a novel extension of random forests for multi-label classification which can learn from positive data alone and can scale to large data sets. We generate real valued beliefs indicating the state of labels and adapt our classifier to train on these belief vectors so as to compensate for missing and noisy labels. In addition, we modify the random forest cost function to avoid overfitting in high dimensional feature spaces and learn short, balanced trees. Finally, we write highly efficient training routines which let us train on problems with more than a hundred million data points, over a million dimensional sparse feature vector and over ten million categories. Extensive experiments reveal that our proposed solution is not only significantly better than other multi-label classification algorithms but also more than 10\% better than the state-of-the-art NLP based techniques for suggesting bid phrases for online search advertisers.

iCalendar csml_id_104.ics