Buscar en Mind w/o Soul

miércoles, abril 02, 2008

En minería de datos, cuanto más datos mejor

En el problema de recomendación de películas, un algoritmo simple con más datos funcionó mejor que uno con un algoritmo más potente.

Data mining.

Datawocky: More data usually beats better algorithms

Different student teams in my class adopted different approaches to
the problem, using both published algorithms and novel ideas. Of these,
the results from two of the teams illustrate a broader point. Team A
came up with a very sophisticated algorithm using the Netflix data.
Team B used a very simple algorithm, but they added in additional data
beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?



Team B got much better results, close to the best results on the
Netflix leaderboard!! I'm really happy for them, and they're going to
tune their algorithm and take a crack at the grand prize. But the
bigger point is, adding more, independent data usually beats out
designing ever-better algorithms to analyze an existing data set. I'm
often suprised that many people in the business, and even in academia,
don't realize this.

No hay comentarios: