Using KNN to Find Similar Movies
- Scipy Spacial at the core
- Dependencies
- Load + Preview the Data
- Group + Aggregate
- Normalize Ratings: LAMBDA
- Re-Organize & Include Genres: LOOP
- Compute Rating "distance" Between Two movies: FN
- Get "nearest" neighbors: KNN + FN
- Notes# KNN (K-Nearest-Neighbors) KNN is a simple concept:
- define some "distance metric" between the items in your dataset
- find the "K" closest items, "nearest neighbors", to a point-of-interest
- use those items to predict some property of a test item, by having them somehow "vote" on the "test" item
As an example, let's look at some movie data:
- try to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity
Scipy Spacial at the core
spatial.distance.cosine(arr1,arr2)
is the function being used here to compare the "distance" between two flat arrays. This cosine
function is used in the ComputeDistance
function below.In the case of this example, the two flat arrays are
0
s and 1
s that represent somewhat of a "one-hot-encoding" of each of two movie's genres, where 1
represents the presence of a genre.In [1]:
import pandas as pd
import numpy as np
from scipy import spatial
import operator
In [2]:
ratingsCols = ['user_id', 'movie_id', 'rating']
ratingsDataFilePath = 'ml-100k/u.data'
ratingsData = pd.read_csv(ratingsDataFilePath, sep='\t', names=ratingsCols, usecols=range(3))
ratingsData.head()
Out [2]:
Group + Aggregate
- group everything by movie ID- compute the total number of ratings (each movie's popularity)
- compute the the average rating for every movie
In [3]:
movieProperties = ratingsData.groupby('movie_id').agg({'rating': [np.size, 'mean']})
movieProperties.head()
Out [3]:
Normalize Ratings: LAMBDA
The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings:- a value of 0 = nobody rated it
- a value of 1 = the most popular movie
In [4]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNumRatings.head()
print("""
----------------------------
movieNumRatings
----------------------------
""")
print(movieNumRatings)
# normalize the number-of-ratings, the number of people who rated, between 0-1
def normalizeByX(x):
return (x - np.min(x)) / (np.max(x) - np.min(x))
# labmda-version
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
# movieNormalizedNumRatings = movieNumRatings.apply(normalizeByX)
print("""
----------------------------
movieNormalizedNumRatings
----------------------------
""")
movieNormalizedNumRatings.head()
Out [4]:
Re-Organize & Include Genres: LOOP
- theu.item
file has 19 fields:
- each corresponding to a specific genre
- a value of '0' means it is not in that genre, and '1' means it is in that genre
- A movie may have more than one genre associated with it
Next, create a Python dictionary called movieDict
, where each entry will contain:
- the movie name
- a list of genre values
- the normalized popularity score
- the average rating for each movie
In [5]:
movieDict = {}
with open(r'ml-100k/u.item', encoding="ISO-8859-1") as f:
for line in f:
fieldsList = line.rstrip('\n').split('|')
movieID = int(fieldsList[0])
name = fieldsList[1]
genres = fieldsList[5:25]
genres = map(int, genres)
listOfGenres = np.array(list(genres))
movieDict[movieID] = (name, listOfGenres, movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))
For example, here's the record we end up with for movie ID 1, "Toy Story":
In [6]:
print(movieDict[1])
Compute Rating "distance" Between Two movies: FN
Here, a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity isIn [7]:
# takes 2 movie IDs
def ComputeDistance(a, b):
genresA = a[1]
genresB = b[1]
# meat & potatoes here
genreDistance = spatial.distance.cosine(genresA, genresB)
# popularity = rating
popularityA = a[2]
popularityB = b[2]
popularityDistance = abs(popularityA - popularityB)
return genreDistance + popularityDistance
# TESTING here with 2 movies:
print(f'the distance between {movieDict[2][0]} and {movieDict[4][0]} is {ComputeDistance(movieDict[2], movieDict[4])}')
print(movieDict[2])
print(movieDict[4])
NOTE: the higher the distance, the less similar the movies are. Get Shorty & GoldenEye aren't that similar.
Get "nearest" neighbors: KNN + FN
Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:In [13]:
MOVIE_INDEX_TO_FIND = 1;
KTen = 10
avgRating = 0
def getNeighbors(movieID, kCount):
print(f'getting {kCount} KNN for {movieDict[movieID][0]}')
# store KNN distances between current movie & other movies
distances = []
for movie in movieDict:
if (movie != movieID):
dist = ComputeDistance(movieDict[movieID], movieDict[movie])
distances.append((movie, dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(kCount):
neighbors.append(distances[x][0])
return neighbors
neighborsK10 = getNeighbors(MOVIE_INDEX_TO_FIND, KTen)
print("""
---
K10 results:
---
""")
for neighbor in neighborsK10:
thisMovie = movieDict[neighbor]
avgRating += thisMovie[3]
movieTitle = thisMovie[0]
movieRating = thisMovie[3]
print (movieTitle + ":\n\t\033[1m" + str(movieRating) + "\033[0m")
avgRating /= KTen
While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:
In [9]:
avgRating
Out [9]:
How does this compare to Toy Story's actual average rating?
In [10]:
movieDict[1]
Out [10]:
- Arbitrary "K": K being
10
is arbitrary. Changing K ?may? impact the output - Arbitrary distance metric the
cosine
method is just one way to calculate a "distance"
Page Tags:
python
data-science
jupyter
learning
numpy