Finding Similar Movies with Python
- Dependencies
- Import & Preview the Data
- Pick A Movie
- Similar Movies By Ratings
- Create A "Pivoted" View Of the Data: Movie-Rating-By-User
- Extract Movie-of-Choice Only Ratings Data
- Similar-Scored-Movies: Correlate Movie-Of-Choice Ratings with Other Movie Ratings
- Sort Similar-Movie Correlation Scores
- Cleanup: Get Movie-Rating counts and average rating score
- Cleanup: Limiting By review Count
- Merge Rating-Score Data With Similarity-Score Data
- Review# Similar Movies
- Import movie data
- import user-movie-rating data
- "wrangle" the data a bit
- merge the two together
- create a "view" of the data where movie-ratings are organized by user-id
- Isolate a movie-of-choice in it's ratings by all people
- correlate a movie-of-choice rating with other movies
In [13]:
import pandas as pd
import numpy as npIn [14]:
r_cols = ['user_id', 'movie_id', 'rating']
# m_cols = ['movie_id', 'title']
m_cols = ["movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]
movieFileName = 'ml-100k/u.item'
ratingFileName = 'ml-100k/u.data'
ratings = pd.read_csv(ratingFileName, sep='\t', names=r_cols, usecols=range(len(r_cols)), encoding="ISO-8859-1")
movies = pd.read_csv(movieFileName, sep='|', names=m_cols, usecols=range(len(m_cols)), encoding="ISO-8859-1")
moviesAndRatings = pd.merge(movies, ratings)
moviesAndRatings.head()Out [14]:
Merging Observation
Each row represents a single user-to-movie rating row.Many rows per movie.
Many rows per user_id.
1 row per user/movie intersection.
In [15]:
# MORE inspection
print('RATINGS...')
print(pd.DataFrame(ratings).head())In [16]:
# MOVIE_OF_CHOICE = 'Star Wars (1977)'
MOVIE_OF_CHOICE = 'Toy Story (1995)'Create A "Pivoted" View Of the Data: Movie-Rating-By-User
Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.In [17]:
movieRatingsByUserId = moviesAndRatings.pivot_table(index=['user_id'],columns=['title'],values='rating')
# uncomment to preview
# users with no ratings on a movie will show NaN for that user/movie
movieRatingsByUserId.head()Out [17]:
In [18]:
movieOfChoiceRatingsByUser = movieRatingsByUserId[MOVIE_OF_CHOICE]
movieOfChoiceRatingsByUser.head()Out [18]:
Similar-Scored-Movies: Correlate Movie-Of-Choice Ratings with Other Movie Ratings
Pandas' corrwith can be used to compute the "pairwise correlation" (link tbd) of the chosen movies' vector of user rating with every other movie.In [19]:
movieSimilarityScores = movieRatingsByUserId.corrwith(movieOfChoiceRatingsByUser)
movieSimilarityScores = movieSimilarityScores.dropna()
# Temporary Data-Frame for previewing with head()
movieSimilarityScoresDF = pd.DataFrame(movieSimilarityScores)
movieSimilarityScoresDF.head(10)
# NOTE: The printed warning is safe to ignoreOut [19]:
Sort Similar-Movie Correlation Scores
Let's sort the results by similarity score, and we should have the movies most similar to Star Wars!In [20]:
movieSimilarityScores.sort_values(ascending=False)Out [20]:
Cleanup: Get Movie-Rating counts and average rating score
These results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like The Movie Of Choice.Here:
- count how many ratings exist for each movie
- remove movies that were only watched by a few people
- get average rating per movie (extra detail for now)
In [21]:
movieStats = moviesAndRatings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()Out [21]:
Cleanup: Limiting By review Count
Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left: 100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of."In [22]:
MINIMUM_NUMBER_OF_RATINGS = 100
popularMovies = movieStats['rating']['size'] >= MINIMUM_NUMBER_OF_RATINGS
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]Out [22]:
Merge Rating-Score Data With Similarity-Score Data
Joining The Data:- a dataset with
title, rating|size, rating|mean - a dataset with
title, similarity - those two merged
In [23]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != '' else f'{i}' for i,j in mappedColumnsMoviestat.columns]
# COLUMNS: title, rating|size, rating|mean
# mappedColumnsMoviestat.head()
similarityScoreDF = pd.DataFrame(movieSimilarityScores, columns=['similarity'])
# COLUMNS: title, similarity
# similarityScoreDF.head()
mappedColumnsMoviestatDF = mappedColumnsMoviestat.join(similarityScoreDF)
mappedColumnsMoviestatDF.head()Out [23]:
And, sort these new results by similarity score. That's more like it!
In [24]:
mappedColumnsMoviestatDF.sort_values(['similarity'], ascending=False)[:15]Out [24]:
Ideally we'd also filter out the movie we started from - of course The Movie-Of-Choice is 100% similar to itself. But otherwise these results aren't bad.
Review
Above, similar movies are calculated by....- movie rating
- a minimum "cutoff" number of ratings per movie (100)
Page Tags:
python
data-science
jupyter
learning
numpy