Finding More-Specific Similar Movies using Python

We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)

In [46]:
import pandas as pd
import numpy as np
In [47]:
r_cols = ['user_id', 'movie_id', 'rating']
# m_cols = ['movie_id', 'title']
m_cols = ["movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]

genre_column_names = ["unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]

movieFileName = 'ml-100k/u.item'
ratingFileName = 'ml-100k/u.data'

ratings = pd.read_csv(ratingFileName, sep='\t', names=r_cols, usecols=range(len(r_cols)), encoding="ISO-8859-1")
movies = pd.read_csv(movieFileName, sep='|', names=m_cols, usecols=range(len(m_cols)), encoding="ISO-8859-1")
print('MOVIES...')
print(pd.DataFrame(movies).head())
MOVIES...
   movie_id              title release_date  video_release_date  \
0         1   Toy Story (1995)  01-Jan-1995                 NaN   
1         2   GoldenEye (1995)  01-Jan-1995                 NaN   
2         3  Four Rooms (1995)  01-Jan-1995                 NaN   
3         4  Get Shorty (1995)  01-Jan-1995                 NaN   
4         5     Copycat (1995)  01-Jan-1995                 NaN   

                                            IMDb_URL  unknown  Action  \
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0       0   
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...        0       1   
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...        0       0   
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...        0       1   
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)        0       0   

   Adventure  Animation  Children's  ...  Fantasy  Film-Noir  Horror  Musical  \
0          0          1           1  ...        0          0       0        0   
1          1          0           0  ...        0          0       0        0   
2          0          0           0  ...        0          0       0        0   
3          0          0           0  ...        0          0       0        0   
4          0          0           0  ...        0          0       0        0   

   Mystery  Romance  Sci-Fi  Thriller  War  Western  
0        0        0       0         0    0        0  
1        0        0       0         1    0        0  
2        0        0       0         1    0        0  
3        0        0       0         0    0        0  
4        0        0       0         1    0        0  

[5 rows x 24 columns]
In [48]:
# MORE inspection
print('RATINGS...')
print(pd.DataFrame(ratings).head())
RATINGS...
   user_id  movie_id  rating
0        0        50       5
1        0       172       5
2        0       133       1
3      196       242       3
4      186       302       3
In [49]:
# MOVIE_OF_CHOICE = 'Star Wars (1977)'
MOVIE_OF_CHOICE = 'Toy Story (1995)'
In [50]:
moviesDF = pd.DataFrame(movies)
moviesLessSelected = moviesDF[moviesDF['title'] != MOVIE_OF_CHOICE]
moviesLessSelected.head()
Out [50]:
movie_id title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1 2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(... 0 1 1 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%... 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%... 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
5 6 Shanghai Triad (Yao a yao yao dao waipo qiao) ... 01-Jan-1995 NaN http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai... 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 24 columns

Get The Movie Genre(s)

The way that the movie data is laid out is such that

  • there are many "genre" columns, 1 column per genre name
  • when the movie has a given genre, the genre column/cell value is 1, else 0
  • movie's may have more-than-one genre
In [51]:
# Focus on the given movie
currentMovieRow = movies.loc[movies['title'] == MOVIE_OF_CHOICE]

genresWithOnes = currentMovieRow.apply(lambda row: currentMovieRow.columns[row == 1].tolist(), axis=1, result_type="reduce")
selected_movie_genres = list(filter(lambda x: (x != 'movie_id'), genresWithOnes[0]))
print(f'MOVIE:\t{MOVIE_OF_CHOICE}')
print(f'MOVIE Genres:\t{selected_movie_genres}')
MOVIE:	Toy Story (1995)
MOVIE Genres:	['Animation', "Children's", 'Comedy']

Get Movies of Similar Genres

1 matching genre

Below, movies with 1 matching genre will be included

In [52]:
moviesOfMatchingGenres = pd.DataFrame()
for genreColumn in selected_movie_genres:
    moviesOfMatchingGenres = pd.concat([moviesOfMatchingGenres, movies[movies[genreColumn] == 1]])
print(f'moviesOfMatchingGenres has {len(moviesOfMatchingGenres.index)} matching rows')
moviesOfMatchingGenres.head()
moviesOfMatchingGenres has 669 matching rows
Out [52]:
movie_id title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
70 71 Lion King, The (1994) 01-Jan-1994 NaN http://us.imdb.com/M/title-exact?Lion%20King,%... 0 0 0 1 1 ... 0 0 0 1 0 0 0 0 0 0
94 95 Aladdin (1992) 01-Jan-1992 NaN http://us.imdb.com/M/title-exact?Aladdin%20(1992) 0 0 0 1 1 ... 0 0 0 1 0 0 0 0 0 0
98 99 Snow White and the Seven Dwarfs (1937) 01-Jan-1937 NaN http://us.imdb.com/M/title-exact?Snow%20White%... 0 0 0 1 1 ... 0 0 0 1 0 0 0 0 0 0
100 101 Heavy Metal (1981) 08-Mar-1981 NaN http://us.imdb.com/M/title-exact?Heavy%20Metal... 0 1 1 1 0 ... 0 0 1 0 0 0 1 0 0 0

5 rows × 24 columns

at-least-2 matching genres

below, movies will be included to have at-least-two matching genres

In [53]:
MATCHING_GENRE_REQUIRED_COUNT = len(selected_movie_genres) - 1
print(f'MATCHING_GENRE_REQUIRED_COUNT: {MATCHING_GENRE_REQUIRED_COUNT}')

moviesWithAtLeastXMatchingGenres = []
for index, row in movies.iterrows():
    count_ones = sum([row[col] for col in selected_movie_genres])
    if count_ones >= 2:
        moviesWithAtLeastXMatchingGenres.append(row)

selected_df = pd.DataFrame(moviesWithAtLeastXMatchingGenres)
selected_df.head()
MATCHING_GENRE_REQUIRED_COUNT: 2
Out [53]:
movie_id title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
7 8 Babe (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Babe%20(1995) 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
62 63 Santa Clause, The (1994) 01-Jan-1994 NaN http://us.imdb.com/M/title-exact?Santa%20Claus... 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
70 71 Lion King, The (1994) 01-Jan-1994 NaN http://us.imdb.com/M/title-exact?Lion%20King,%... 0 0 0 1 1 ... 0 0 0 1 0 0 0 0 0 0
90 91 Nightmare Before Christmas, The (1993) 01-Jan-1993 NaN http://us.imdb.com/M/title-exact?Nightmare%20B... 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 24 columns

In [54]:
# based on 1 matching genre
# moviesAndRatings = pd.merge(moviesOfMatchingGenres, ratings)
# print(f'moviesAndRatings has {len(moviesAndRatings.index)} matching rows')
# moviesAndRatings.head()



# based on more-than-one matching genre
moviesAndRatings = pd.merge(pd.DataFrame(moviesWithAtLeastXMatchingGenres), ratings)
print(f'moviesAndRatings has {len(moviesAndRatings.index)} matching rows')
moviesAndRatings.head()
moviesAndRatings has 5630 matching rows
Out [54]:
movie_id title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's ... Horror Musical Mystery Romance Sci-Fi Thriller War Western user_id rating
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 308 4
1 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 287 5
2 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 148 4
3 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 280 4
4 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 66 3

5 rows × 26 columns

Create A "Pivoted" View Of the Data: Movie-Rating-By-User

Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.

In [55]:
movieRatingsByUserId = moviesAndRatings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatingsByUserId.head()
Out [55]:
title 101 Dalmatians (1996) Air Bud (1997) Aladdin (1992) Aladdin and the King of Thieves (1996) Alice in Wonderland (1951) All Dogs Go to Heaven 2 (1996) Anastasia (1997) Angels in the Outfield (1994) Apple Dumpling Gang, The (1975) Aristocats, The (1970) ... Swan Princess, The (1994) Sword in the Stone, The (1963) That Darn Cat! (1965) That Darn Cat! (1997) Three Caballeros, The (1945) Toy Story (1995) Transformers: The Movie, The (1986) Willy Wonka and the Chocolate Factory (1971) Winnie the Pooh and the Blustery Day (1968) Wrong Trousers, The (1993)
user_id
1 2.0 1.0 4.0 NaN NaN 1.0 NaN NaN NaN 2.0 ... NaN NaN NaN NaN NaN 5.0 NaN 4.0 NaN 5.0
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN
5 2.0 NaN 4.0 4.0 3.0 NaN NaN NaN 1.0 3.0 ... NaN NaN NaN NaN NaN 4.0 3.0 3.0 NaN 5.0
6 NaN 3.0 2.0 NaN NaN NaN 2.0 NaN NaN NaN ... NaN NaN NaN NaN NaN 4.0 NaN 3.0 NaN 4.0
7 NaN NaN NaN NaN 5.0 NaN NaN 3.0 2.0 NaN ... NaN 3.0 NaN NaN 4.0 NaN NaN 4.0 NaN NaN

5 rows × 71 columns

In [56]:
movieOfChoiceRatingsByUser = movieRatingsByUserId[MOVIE_OF_CHOICE]
movieOfChoiceRatingsByUser.head()
Out [56]:
user_id
1    5.0
2    4.0
5    4.0
6    4.0
7    NaN
Name: Toy Story (1995), dtype: float64

Correlate Movie-Of-Choice Ratings with Other Movie Ratings

Pandas' corrwith can be used to compute the "pairwise correlation" (link tbd) of the chosen movies' vector of user rating with every other movie.

In [57]:
movieSimilarityScores = movieRatingsByUserId.corrwith(movieOfChoiceRatingsByUser)
movieSimilarityScores = movieSimilarityScores.dropna()

# Temporary Data-Frame for previewing with head()
movieSimilarityScoresDF = pd.DataFrame(movieSimilarityScores)
movieSimilarityScoresDF.head(10)

# NOTE: The printed warning is safe to ignore
Out [57]:
0
title
101 Dalmatians (1996) 0.232118
Air Bud (1997) 0.120034
Aladdin (1992) 0.411731
Aladdin and the King of Thieves (1996) -0.129334
Alice in Wonderland (1951) 0.249077
All Dogs Go to Heaven 2 (1996) 0.297753
Anastasia (1997) 0.266331
Angels in the Outfield (1994) 0.423242
Apple Dumpling Gang, The (1975) 0.006750
Aristocats, The (1970) 0.419263

Sort Similar-Movie Correlation Scores

Let's sort the results by similarity score, and we should have the movies most similar to Star Wars!

In [58]:
movieSimilarityScores.sort_values(ascending=False)
Out [58]:
title
Toy Story (1995)                       1.000000
Transformers: The Movie, The (1986)    0.753673
Mouse Hunt (1997)                      0.736826
Gumby: The Movie (1995)                0.717137
Home Alone 3 (1997)                    0.688875
                                         ...   
That Darn Cat! (1965)                 -0.130664
Herbie Rides Again (1974)             -0.213201
Jingle All the Way (1996)             -0.218227
Swan Princess, The (1994)             -0.262613
Three Caballeros, The (1945)          -0.346410
Length: 71, dtype: float64

Cleanup: Grouping

Those results make no sense at all! This is why it's important to know your data - clearly we missed something important. Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like The Movie Of Choice. So we need to get rid of movies that were only watched by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating while we're at it - that could also come in handy later.

In [59]:
movieStats = moviesAndRatings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()
Out [59]:
rating
size mean
title
101 Dalmatians (1996) 109 2.908257
Air Bud (1997) 43 2.558140
Aladdin (1992) 219 3.812785
Aladdin and the King of Thieves (1996) 26 2.846154
Alice in Wonderland (1951) 81 3.666667

Cleanup: Limiting By review Count

Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left: 100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of."

In [60]:
MINIMUM_NUMBER_OF_RATINGS = 100
popularMovies = movieStats['rating']['size'] >= MINIMUM_NUMBER_OF_RATINGS
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]
Out [60]:
rating
size mean
title
Close Shave, A (1995) 112 4.491071
Wrong Trousers, The (1993) 118 4.466102
Babe (1995) 219 3.995434
Toy Story (1995) 452 3.878319
Aladdin (1992) 219 3.812785
Beauty and the Beast (1991) 202 3.792079
Lion King, The (1994) 220 3.781818
Fantasia (1940) 174 3.770115
Mary Poppins (1964) 178 3.724719
Snow White and the Seven Dwarfs (1937) 172 3.709302
Pinocchio (1940) 101 3.673267
Willy Wonka and the Chocolate Factory (1971) 326 3.631902
Nightmare Before Christmas, The (1993) 143 3.587413
Cinderella (1950) 129 3.581395
Dumbo (1941) 123 3.495935

Merge Rating-Score Data With Similarity-Score Data

Joining The Data:

  • a dataset with title, rating|size, rating|mean
  • a dataset with title, similarity
  • those two merged
In [61]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != '' else f'{i}' for i,j in mappedColumnsMoviestat.columns]
# COLUMNS: title, rating|size, rating|mean
mappedColumnsMoviestat.head()


similarityScoreDF = pd.DataFrame(movieSimilarityScores, columns=['similarity'])
# COLUMNS: title, similarity
similarityScoreDF.head()


mappedColumnsMoviestatDF = mappedColumnsMoviestat.join(similarityScoreDF)
mappedColumnsMoviestat.head()
Out [61]:
rating|size rating|mean
title
101 Dalmatians (1996) 109 2.908257
Aladdin (1992) 219 3.812785
Babe (1995) 219 3.995434
Beauty and the Beast (1991) 202 3.792079
Beavis and Butt-head Do America (1996) 156 2.788462

And, sort these new results by similarity score. That's more like it!

In [62]:
similarMoviesWithGenres = mappedColumnsMoviestatDF.sort_values(['similarity'], ascending=False)
similarMoviesWithGenres.index.name = 'Similar Moveis by Genre & Rating'
similarMoviesWithGenres[:15]
Out [62]:
rating|size rating|mean similarity
Similar Moveis by Genre & Rating
Toy Story (1995) 452 3.878319 1.000000
Beauty and the Beast (1991) 202 3.792079 0.442960
Cinderella (1950) 129 3.581395 0.428372
Lion King, The (1994) 220 3.781818 0.426778
Aladdin (1992) 219 3.812785 0.411731
Dumbo (1941) 123 3.495935 0.387716
Hunchback of Notre Dame, The (1996) 127 3.377953 0.334852
Snow White and the Seven Dwarfs (1937) 172 3.709302 0.315292
Pinocchio (1940) 101 3.673267 0.304457
Home Alone (1990) 137 3.087591 0.273866
Babe (1995) 219 3.995434 0.247367
Beavis and Butt-head Do America (1996) 156 2.788462 0.233535
101 Dalmatians (1996) 109 2.908257 0.232118
George of the Jungle (1997) 162 2.685185 0.231549
Jungle2Jungle (1997) 132 2.439394 0.201407
Page Tags:
python
data-science
jupyter
learning
numpy