Finding More-Specific Similar Movies using Python
- Dependencies
- Import & Preview the Data
- Pick A Movie
- Similar Movies By Genre AND Ratings
- Create A "Pivoted" View Of the Data: Movie-Rating-By-User
- Extract Movie-of-Choice Only Ratings Data
- Correlate Movie-Of-Choice Ratings with Other Movie Ratings
- Sort Similar-Movie Correlation Scores
- Cleanup: Grouping
- Cleanup: Limiting By review Count
- Merge Rating-Score Data With Similarity-Score Data# Similar Movies
We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)
In [46]:
import pandas as pd
import numpy as npIn [47]:
r_cols = ['user_id', 'movie_id', 'rating']
# m_cols = ['movie_id', 'title']
m_cols = ["movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]
genre_column_names = ["unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]
movieFileName = 'ml-100k/u.item'
ratingFileName = 'ml-100k/u.data'
ratings = pd.read_csv(ratingFileName, sep='\t', names=r_cols, usecols=range(len(r_cols)), encoding="ISO-8859-1")
movies = pd.read_csv(movieFileName, sep='|', names=m_cols, usecols=range(len(m_cols)), encoding="ISO-8859-1")
print('MOVIES...')
print(pd.DataFrame(movies).head())In [48]:
# MORE inspection
print('RATINGS...')
print(pd.DataFrame(ratings).head())In [49]:
# MOVIE_OF_CHOICE = 'Star Wars (1977)'
MOVIE_OF_CHOICE = 'Toy Story (1995)'In [50]:
moviesDF = pd.DataFrame(movies)
moviesLessSelected = moviesDF[moviesDF['title'] != MOVIE_OF_CHOICE]
moviesLessSelected.head()Out [50]:
Get The Movie Genre(s)
The way that the movie data is laid out is such that- there are many "genre" columns, 1 column per genre name
- when the movie has a given genre, the genre column/cell value is 1, else 0
- movie's may have more-than-one genre
In [51]:
# Focus on the given movie
currentMovieRow = movies.loc[movies['title'] == MOVIE_OF_CHOICE]
genresWithOnes = currentMovieRow.apply(lambda row: currentMovieRow.columns[row == 1].tolist(), axis=1, result_type="reduce")
selected_movie_genres = list(filter(lambda x: (x != 'movie_id'), genresWithOnes[0]))
print(f'MOVIE:\t{MOVIE_OF_CHOICE}')
print(f'MOVIE Genres:\t{selected_movie_genres}')In [52]:
moviesOfMatchingGenres = pd.DataFrame()
for genreColumn in selected_movie_genres:
moviesOfMatchingGenres = pd.concat([moviesOfMatchingGenres, movies[movies[genreColumn] == 1]])
print(f'moviesOfMatchingGenres has {len(moviesOfMatchingGenres.index)} matching rows')
moviesOfMatchingGenres.head()Out [52]:
at-least-2 matching genres
below, movies will be included to have at-least-two matching genresIn [53]:
MATCHING_GENRE_REQUIRED_COUNT = len(selected_movie_genres) - 1
print(f'MATCHING_GENRE_REQUIRED_COUNT: {MATCHING_GENRE_REQUIRED_COUNT}')
moviesWithAtLeastXMatchingGenres = []
for index, row in movies.iterrows():
count_ones = sum([row[col] for col in selected_movie_genres])
if count_ones >= 2:
moviesWithAtLeastXMatchingGenres.append(row)
selected_df = pd.DataFrame(moviesWithAtLeastXMatchingGenres)
selected_df.head()Out [53]:
In [54]:
# based on 1 matching genre
# moviesAndRatings = pd.merge(moviesOfMatchingGenres, ratings)
# print(f'moviesAndRatings has {len(moviesAndRatings.index)} matching rows')
# moviesAndRatings.head()
# based on more-than-one matching genre
moviesAndRatings = pd.merge(pd.DataFrame(moviesWithAtLeastXMatchingGenres), ratings)
print(f'moviesAndRatings has {len(moviesAndRatings.index)} matching rows')
moviesAndRatings.head()Out [54]:
Create A "Pivoted" View Of the Data: Movie-Rating-By-User
Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.In [55]:
movieRatingsByUserId = moviesAndRatings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatingsByUserId.head()Out [55]:
In [56]:
movieOfChoiceRatingsByUser = movieRatingsByUserId[MOVIE_OF_CHOICE]
movieOfChoiceRatingsByUser.head()Out [56]:
Correlate Movie-Of-Choice Ratings with Other Movie Ratings
Pandas' corrwith can be used to compute the "pairwise correlation" (link tbd) of the chosen movies' vector of user rating with every other movie.In [57]:
movieSimilarityScores = movieRatingsByUserId.corrwith(movieOfChoiceRatingsByUser)
movieSimilarityScores = movieSimilarityScores.dropna()
# Temporary Data-Frame for previewing with head()
movieSimilarityScoresDF = pd.DataFrame(movieSimilarityScores)
movieSimilarityScoresDF.head(10)
# NOTE: The printed warning is safe to ignoreOut [57]:
Sort Similar-Movie Correlation Scores
Let's sort the results by similarity score, and we should have the movies most similar to Star Wars!In [58]:
movieSimilarityScores.sort_values(ascending=False)Out [58]:
Cleanup: Grouping
Those results make no sense at all! This is why it's important to know your data - clearly we missed something important. Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like The Movie Of Choice. So we need to get rid of movies that were only watched by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating while we're at it - that could also come in handy later.In [59]:
movieStats = moviesAndRatings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()Out [59]:
Cleanup: Limiting By review Count
Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left: 100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of."In [60]:
MINIMUM_NUMBER_OF_RATINGS = 100
popularMovies = movieStats['rating']['size'] >= MINIMUM_NUMBER_OF_RATINGS
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]Out [60]:
Merge Rating-Score Data With Similarity-Score Data
Joining The Data:- a dataset with
title, rating|size, rating|mean - a dataset with
title, similarity - those two merged
In [61]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != '' else f'{i}' for i,j in mappedColumnsMoviestat.columns]
# COLUMNS: title, rating|size, rating|mean
mappedColumnsMoviestat.head()
similarityScoreDF = pd.DataFrame(movieSimilarityScores, columns=['similarity'])
# COLUMNS: title, similarity
similarityScoreDF.head()
mappedColumnsMoviestatDF = mappedColumnsMoviestat.join(similarityScoreDF)
mappedColumnsMoviestat.head()Out [61]:
And, sort these new results by similarity score. That's more like it!
In [62]:
similarMoviesWithGenres = mappedColumnsMoviestatDF.sort_values(['similarity'], ascending=False)
similarMoviesWithGenres.index.name = 'Similar Moveis by Genre & Rating'
similarMoviesWithGenres[:15]Out [62]:
Page Tags:
python
data-science
jupyter
learning
numpy