The movie recommendation problem
Product recommendation is big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers.
Product recommendations have been researched for many years; however, the field gained a significant boost when Netflix ran their Netflix Prize between 2007 and 2009. This competition aimed to determine if anyone can predict a user's rating of a film better than Netflix was currently doing. The prize went to a team that was just over 10 percent better than the current solution. While this may not seem like a large improvement, such an improvement would net millions to Netflix in revenue from better movie recommendations.
Obtaining the dataset
Since the inception of the Netflix Prize, Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews.
The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this chapter is the MovieLens 1 million dataset. Download this dataset and unzip it in your data folder. Start a new IPython Notebook and type the following code:
import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k") ratings_filename = os.path.join(data_folder, "u.data")
Ensure that ratings_filename
points to the u.data
file in the unzipped folder.
Loading with pandas
The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv
that we need to make. To start with, the data is separated by tabs, not commas. Next, there is no heading line. This means the first line in the file is actually data and we need to manually set the column names.
When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None
), and set the column names. Let's look at the following code:
all_ratings = pd.read_csv(ratings_filename, delimiter="\t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"])
While we won't use it in this chapter, you can properly parse the date timestamp using the following line:
all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s')
You can view the first few records by running the following in a new cell:
all_ratings[:5]
The result will come out looking something like this:
Sparse data formats
This dataset is in a sparse format. Each row can be thought of as a cell in a large feature matrix of the type used in previous chapters, where rows are users and columns are inpidual movies. The first column would be each user's review of the first movie, the second column would be each user's review of the second movie, and so on.
There are 1,000 users and 1,700 movies in this dataset, which means that the full matrix would be quite large. We may run into issues storing the whole matrix in memory and computing on it would be troublesome. However, this matrix has the property that most cells are empty, that is, there is no review for most movies for most users. There is no review of movie #675
for user #213
though, and not for most other combinations of user and movie.
The format given here represents the full matrix, but in a more compact way. The first row indicates that user #196
reviewed movie #242
, giving it a ranking of 3 (out of five) on the December 4, 1997.
Any combination of user and movie that isn't in this database is assumed to not exist. This saves significant space, as opposed to storing a bunch of zeroes in memory. This type of format is called a sparse matrix format. As a rule of thumb, if you expect about 60 percent or more of your dataset to be empty or zero, a sparse format will take less space to store.
When computing on sparse matrices, the focus isn't usually on the data we don't have—comparing all of the zeroes. We usually focus on the data we have and compare those.