Finding Similar Movies with Python

In [13]:
import pandas as pd
import numpy as np
In [14]:
r_cols = ['user_id', 'movie_id', 'rating']
# m_cols = ['movie_id', 'title']
m_cols = ["movie_id",
"title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Children's",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film-Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci-Fi",
"Thriller",
"War",
"Western"]

movieFileName = 'ml-100k/u.item'
ratingFileName = 'ml-100k/u.data'

ratings = pd.read_csv(ratingFileName, sep='\t', names=r_cols, usecols=range(len(r_cols)), encoding="ISO-8859-1")
movies = pd.read_csv(movieFileName, sep='|', names=m_cols, usecols=range(len(m_cols)), encoding="ISO-8859-1")

moviesAndRatings = pd.merge(movies, ratings)
moviesAndRatings.head()
Out [14]:
movie_id title release_date video_release_date IMDb_URL unknown Action Adventure Animation Children's ... Horror Musical Mystery Romance Sci-Fi Thriller War Western user_id rating
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 308 4
1 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 287 5
2 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 148 4
3 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 280 4
4 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2... 0 0 0 1 1 ... 0 0 0 0 0 0 0 0 66 3

5 rows × 26 columns

Merging Observation

Each row represents a single user-to-movie rating row.
Many rows per movie.
Many rows per user_id.
1 row per user/movie intersection.

In [15]:
# MORE inspection
print('RATINGS...')
print(pd.DataFrame(ratings).head())
RATINGS...
   user_id  movie_id  rating
0        0        50       5
1        0       172       5
2        0       133       1
3      196       242       3
4      186       302       3
In [16]:
# MOVIE_OF_CHOICE = 'Star Wars (1977)'
MOVIE_OF_CHOICE = 'Toy Story (1995)'

Create A "Pivoted" View Of the Data: Movie-Rating-By-User

Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.

In [17]:
movieRatingsByUserId = moviesAndRatings.pivot_table(index=['user_id'],columns=['title'],values='rating')


# uncomment to preview
# users with no ratings on a movie will show NaN for that user/movie
movieRatingsByUserId.head()
Out [17]:
title 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996) 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996) 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968) 3 Ninjas: High Noon At Mega Mountain (1998) 39 Steps, The (1935) ... Yankee Zulu (1994) Year of the Horse (1997) You So Crazy (1994) Young Frankenstein (1974) Young Guns (1988) Young Guns II (1990) Young Poisoner's Handbook, The (1995) Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 2.0 5.0 NaN NaN 3.0 4.0 NaN NaN ... NaN NaN NaN 5.0 3.0 NaN NaN NaN 4.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1664 columns

In [18]:
movieOfChoiceRatingsByUser = movieRatingsByUserId[MOVIE_OF_CHOICE]
movieOfChoiceRatingsByUser.head()
Out [18]:
user_id
0    NaN
1    5.0
2    4.0
3    NaN
4    NaN
Name: Toy Story (1995), dtype: float64

Similar-Scored-Movies: Correlate Movie-Of-Choice Ratings with Other Movie Ratings

Pandas' corrwith can be used to compute the "pairwise correlation" (link tbd) of the chosen movies' vector of user rating with every other movie.

In [19]:
movieSimilarityScores = movieRatingsByUserId.corrwith(movieOfChoiceRatingsByUser)
movieSimilarityScores = movieSimilarityScores.dropna()

# Temporary Data-Frame for previewing with head()
movieSimilarityScoresDF = pd.DataFrame(movieSimilarityScores)
movieSimilarityScoresDF.head(10)

# NOTE: The printed warning is safe to ignore
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/numpy/lib/function_base.py:2846: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/numpy/lib/function_base.py:2705: RuntimeWarning: divide by zero encountered in divide
  c *= np.true_divide(1, fact)
Out [19]:
0
title
'Til There Was You (1997) 0.534522
101 Dalmatians (1996) 0.232118
12 Angry Men (1957) 0.334943
187 (1997) 0.651857
2 Days in the Valley (1996) 0.162728
20,000 Leagues Under the Sea (1954) 0.328472
2001: A Space Odyssey (1968) -0.069060
39 Steps, The (1935) 0.150055
8 1/2 (1963) -0.117259
8 Heads in a Duffel Bag (1997) 0.500000

Sort Similar-Movie Correlation Scores

Let's sort the results by similarity score, and we should have the movies most similar to Star Wars!

In [20]:
movieSimilarityScores.sort_values(ascending=False)
Out [20]:
title
Roseanna's Grave (For Roseanna) (1997)     1.0
Substance of Fire, The (1996)              1.0
Stranger, The (1994)                       1.0
Wooden Man's Bride, The (Wu Kui) (1994)    1.0
Newton Boys, The (1998)                    1.0
                                          ... 
Slingshot, The (1993)                     -1.0
Heavy (1995)                              -1.0
Stalker (1979)                            -1.0
Feast of July (1995)                      -1.0
Love and Death on Long Island (1997)      -1.0
Length: 1370, dtype: float64

Cleanup: Get Movie-Rating counts and average rating score

These results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like The Movie Of Choice.
Here:

  • count how many ratings exist for each movie
  • remove movies that were only watched by a few people
  • get average rating per movie (extra detail for now)
In [21]:
movieStats = moviesAndRatings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()
Out [21]:
rating
size mean
title
'Til There Was You (1997) 9 2.333333
1-900 (1994) 5 2.600000
101 Dalmatians (1996) 109 2.908257
12 Angry Men (1957) 125 4.344000
187 (1997) 41 3.024390

Cleanup: Limiting By review Count

Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left: 100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of."

In [22]:
MINIMUM_NUMBER_OF_RATINGS = 100
popularMovies = movieStats['rating']['size'] >= MINIMUM_NUMBER_OF_RATINGS
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]
Out [22]:
rating
size mean
title
Close Shave, A (1995) 112 4.491071
Schindler's List (1993) 298 4.466443
Wrong Trousers, The (1993) 118 4.466102
Casablanca (1942) 243 4.456790
Shawshank Redemption, The (1994) 283 4.445230
Rear Window (1954) 209 4.387560
Usual Suspects, The (1995) 267 4.385768
Star Wars (1977) 584 4.359589
12 Angry Men (1957) 125 4.344000
Citizen Kane (1941) 198 4.292929
To Kill a Mockingbird (1962) 219 4.292237
One Flew Over the Cuckoo's Nest (1975) 264 4.291667
Silence of the Lambs, The (1991) 390 4.289744
North by Northwest (1959) 179 4.284916
Godfather, The (1972) 413 4.283293

Merge Rating-Score Data With Similarity-Score Data

Joining The Data:

  • a dataset with title, rating|size, rating|mean
  • a dataset with title, similarity
  • those two merged
In [23]:
mappedColumnsMoviestat=movieStats[popularMovies]
mappedColumnsMoviestat.columns=[f'{i}|{j}' if j != '' else f'{i}' for i,j in mappedColumnsMoviestat.columns]
# COLUMNS: title, rating|size, rating|mean
# mappedColumnsMoviestat.head()


similarityScoreDF = pd.DataFrame(movieSimilarityScores, columns=['similarity'])
# COLUMNS: title, similarity
# similarityScoreDF.head()


mappedColumnsMoviestatDF = mappedColumnsMoviestat.join(similarityScoreDF)
mappedColumnsMoviestatDF.head()
Out [23]:
rating|size rating|mean similarity
title
101 Dalmatians (1996) 109 2.908257 0.232118
12 Angry Men (1957) 125 4.344000 0.334943
2001: A Space Odyssey (1968) 259 3.969112 -0.069060
Absolute Power (1997) 127 3.370079 0.318580
Abyss, The (1989) 151 3.589404 0.329058

And, sort these new results by similarity score. That's more like it!

In [24]:
mappedColumnsMoviestatDF.sort_values(['similarity'], ascending=False)[:15]
Out [24]:
rating|size rating|mean similarity
title
Toy Story (1995) 452 3.878319 1.000000
Craft, The (1996) 104 3.115385 0.549100
Down Periscope (1996) 101 2.702970 0.457995
Miracle on 34th Street (1994) 101 3.722772 0.456291
G.I. Jane (1997) 175 3.360000 0.454756
Amistad (1997) 124 3.854839 0.449915
Beauty and the Beast (1991) 202 3.792079 0.442960
Mask, The (1994) 129 3.193798 0.432855
Cinderella (1950) 129 3.581395 0.428372
That Thing You Do! (1996) 176 3.465909 0.427936
Lion King, The (1994) 220 3.781818 0.426778
Aladdin (1992) 219 3.812785 0.411731
Great Escape, The (1963) 124 4.104839 0.401238
African Queen, The (1951) 152 4.184211 0.397874
Dumbo (1941) 123 3.495935 0.387716

Ideally we'd also filter out the movie we started from - of course The Movie-Of-Choice is 100% similar to itself. But otherwise these results aren't bad.

Review

Above, similar movies are calculated by....

  • movie rating
  • a minimum "cutoff" number of ratings per movie (100)
Page Tags:
python
data-science
jupyter
learning
numpy