Using KNN to Find Similar Movies

As an example, let's look at some movie data:

  • try to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity

Scipy Spacial at the core

spatial.distance.cosine(arr1,arr2) is the function being used here to compare the "distance" between two flat arrays. This cosine function is used in the ComputeDistance function below.
In the case of this example, the two flat arrays are 0s and 1s that represent somewhat of a "one-hot-encoding" of each of two movie's genres, where 1 represents the presence of a genre.

In [1]:
import pandas as pd
import numpy as np
from scipy import spatial
import operator
In [2]:
ratingsCols = ['user_id', 'movie_id', 'rating']
ratingsDataFilePath = 'ml-100k/u.data'
ratingsData = pd.read_csv(ratingsDataFilePath, sep='\t', names=ratingsCols, usecols=range(3))
ratingsData.head()
Out [2]:
user_id movie_id rating
0 0 50 5
1 0 172 5
2 0 133 1
3 196 242 3
4 186 302 3

Group + Aggregate

- group everything by movie ID

  • compute the total number of ratings (each movie's popularity)
  • compute the the average rating for every movie
In [3]:
movieProperties = ratingsData.groupby('movie_id').agg({'rating': [np.size, 'mean']})
movieProperties.head()
Out [3]:
rating
size mean
movie_id
1 452 3.878319
2 131 3.206107
3 90 3.033333
4 209 3.550239
5 86 3.302326

Normalize Ratings: LAMBDA

The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings:

  • a value of 0 = nobody rated it
  • a value of 1 = the most popular movie
In [4]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNumRatings.head()
print("""
----------------------------
movieNumRatings
----------------------------
""")
print(movieNumRatings)

# normalize the number-of-ratings, the number of people who rated, between 0-1
def normalizeByX(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

# labmda-version
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

# movieNormalizedNumRatings = movieNumRatings.apply(normalizeByX)
print("""
----------------------------
movieNormalizedNumRatings
----------------------------
""")
movieNormalizedNumRatings.head()
----------------------------
movieNumRatings
----------------------------

          size
movie_id      
1          452
2          131
3           90
4          209
5           86
...        ...
1678         1
1679         1
1680         1
1681         1
1682         1

[1682 rows x 1 columns]

----------------------------
movieNormalizedNumRatings
----------------------------

Out [4]:
size
movie_id
1 0.773585
2 0.222985
3 0.152659
4 0.356775
5 0.145798

Re-Organize & Include Genres: LOOP

- the u.item file has 19 fields:

  • each corresponding to a specific genre
  • a value of '0' means it is not in that genre, and '1' means it is in that genre
  • A movie may have more than one genre associated with it

Next, create a Python dictionary called movieDict, where each entry will contain:

  • the movie name
  • a list of genre values
  • the normalized popularity score
  • the average rating for each movie
In [5]:
movieDict = {}
with open(r'ml-100k/u.item', encoding="ISO-8859-1") as f:
    for line in f:
        fieldsList = line.rstrip('\n').split('|')
        movieID = int(fieldsList[0])
        name = fieldsList[1]
        genres = fieldsList[5:25]
        genres = map(int, genres)
        listOfGenres = np.array(list(genres))
        movieDict[movieID] = (name, listOfGenres, movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

For example, here's the record we end up with for movie ID 1, "Toy Story":

In [6]:
print(movieDict[1])
('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)

Compute Rating "distance" Between Two movies: FN

Here, a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity is

In [7]:
# takes 2 movie IDs
def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    
    # meat & potatoes here
    genreDistance = spatial.distance.cosine(genresA, genresB)

    # popularity = rating
    popularityA = a[2]
    popularityB = b[2]
    
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance

# TESTING here with 2 movies:
print(f'the distance between {movieDict[2][0]} and {movieDict[4][0]} is {ComputeDistance(movieDict[2], movieDict[4])}')
print(movieDict[2])
print(movieDict[4])
the distance between GoldenEye (1995) and Get Shorty (1995) is 0.8004574042309892
('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076)
('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.3567753001715266, 3.550239234449761)

NOTE: the higher the distance, the less similar the movies are. Get Shorty & GoldenEye aren't that similar.

Get "nearest" neighbors: KNN + FN

Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:

In [13]:
MOVIE_INDEX_TO_FIND = 1;
KTen = 10
avgRating = 0

def getNeighbors(movieID, kCount):
    print(f'getting {kCount} KNN for {movieDict[movieID][0]}')
    # store KNN distances between current movie & other movies
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(kCount):
        neighbors.append(distances[x][0])
    return neighbors
    
neighborsK10 = getNeighbors(MOVIE_INDEX_TO_FIND, KTen)
print("""
---
K10 results:
---
""")
for neighbor in neighborsK10:
    thisMovie = movieDict[neighbor]
    avgRating += thisMovie[3]
    movieTitle = thisMovie[0]
    movieRating = thisMovie[3]
    print (movieTitle + ":\n\t\033[1m" + str(movieRating) + "\033[0m")
    
avgRating /= KTen
getting 10 KNN for Toy Story (1995)

---
K10 results:
---

Liar Liar (1997):
	3.156701030927835
Aladdin (1992):
	3.8127853881278537
Willy Wonka and the Chocolate Factory (1971):
	3.6319018404907975
Monty Python and the Holy Grail (1974):
	4.0664556962025316
Full Monty, The (1997):
	3.926984126984127
George of the Jungle (1997):
	2.685185185185185
Beavis and Butt-head Do America (1996):
	2.7884615384615383
Birdcage, The (1996):
	3.4436860068259385
Home Alone (1990):
	3.0875912408759123
Aladdin and the King of Thieves (1996):
	2.8461538461538463

While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:

In [9]:
avgRating
Out [9]:
3.3445905900235564

How does this compare to Toy Story's actual average rating?

In [10]:
movieDict[1]
Out [10]:
('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.7735849056603774,
 3.8783185840707963)
  • Arbitrary "K": K being 10 is arbitrary. Changing K ?may? impact the output
  • Arbitrary distance metric the cosine method is just one way to calculate a "distance"
Page Tags:
python
data-science
jupyter
learning
numpy