Scipy Spacial at the core
Dependencies
Load + Preview the Data
Group + Aggregate
Normalize Ratings: LAMBDA
Re-Organize & Include Genres: LOOP
Compute Rating "distance" Between Two movies: FN
Get "nearest" neighbors: KNN + FN
Notes# KNN (K-Nearest-Neighbors) KNN is a simple concept:
define some "distance metric" between the items in your dataset
find the "K" closest items, "nearest neighbors", to a point-of-interest
use those items to predict some property of a test item, by having them somehow "vote" on the "test" item

As an example, let's look at some movie data:

try to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity

Scipy Spacial at the core

spatial.distance.cosine(arr1,arr2) is the function being used here to compare the "distance" between two flat arrays. This cosine function is used in the ComputeDistance function below.
In the case of this example, the two flat arrays are 0s and 1s that represent somewhat of a "one-hot-encoding" of each of two movie's genres, where 1 represents the presence of a genre.

Dependencies

In [1]:

import pandas as pd
import numpy as np
from scipy import spatial
import operator

Load + Preview the Data

In [2]:

ratingsCols = ['user_id', 'movie_id', 'rating']
ratingsDataFilePath = 'ml-100k/u.data'
ratingsData = pd.read_csv(ratingsDataFilePath, sep='\t', names=ratingsCols, usecols=range(3))
ratingsData.head()

Out [2]:

	user_id	movie_id	rating
0	0	50	5
1	0	172	5
2	0	133	1
3	196	242	3
4	186	302	3

Group + Aggregate

- group everything by movie ID

compute the total number of ratings (each movie's popularity)
compute the the average rating for every movie

In [3]:

movieProperties = ratingsData.groupby('movie_id').agg({'rating': [np.size, 'mean']})
movieProperties.head()

Out [3]:

	rating
	size	mean
movie_id
1	452	3.878319
2	131	3.206107
3	90	3.033333
4	209	3.550239
5	86	3.302326

Normalize Ratings: LAMBDA

The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings:

a value of 0 = nobody rated it
a value of 1 = the most popular movie

In [4]:

movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNumRatings.head()
print("""
----------------------------
movieNumRatings
----------------------------
""")
print(movieNumRatings)

# normalize the number-of-ratings, the number of people who rated, between 0-1
def normalizeByX(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

# labmda-version
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

# movieNormalizedNumRatings = movieNumRatings.apply(normalizeByX)
print("""
----------------------------
movieNormalizedNumRatings
----------------------------
""")
movieNormalizedNumRatings.head()

----------------------------
movieNumRatings
----------------------------

          size
movie_id      
1          452
2          131
3           90
4          209
5           86
...        ...
1678         1
1679         1
1680         1
1681         1
1682         1

[1682 rows x 1 columns]

----------------------------
movieNormalizedNumRatings
----------------------------

Out [4]:

	size
movie_id
1	0.773585
2	0.222985
3	0.152659
4	0.356775
5	0.145798

Re-Organize & Include Genres: LOOP

- the u.item file has 19 fields:

each corresponding to a specific genre
a value of '0' means it is not in that genre, and '1' means it is in that genre
A movie may have more than one genre associated with it

Next, create a Python dictionary called movieDict, where each entry will contain:

the movie name
a list of genre values
the normalized popularity score
the average rating for each movie

In [5]:

movieDict = {}
with open(r'ml-100k/u.item', encoding="ISO-8859-1") as f:
    for line in f:
        fieldsList = line.rstrip('\n').split('|')
        movieID = int(fieldsList[0])
        name = fieldsList[1]
        genres = fieldsList[5:25]
        genres = map(int, genres)
        listOfGenres = np.array(list(genres))
        movieDict[movieID] = (name, listOfGenres, movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

For example, here's the record we end up with for movie ID 1, "Toy Story":

In [6]:

print(movieDict[1])

('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)

Compute Rating "distance" Between Two movies: FN

Here, a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity is

In [7]:

# takes 2 movie IDs
def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    
    # meat & potatoes here
    genreDistance = spatial.distance.cosine(genresA, genresB)

    # popularity = rating
    popularityA = a[2]
    popularityB = b[2]
    
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance

# TESTING here with 2 movies:
print(f'the distance between {movieDict[2][0]} and {movieDict[4][0]} is {ComputeDistance(movieDict[2], movieDict[4])}')
print(movieDict[2])
print(movieDict[4])

the distance between GoldenEye (1995) and Get Shorty (1995) is 0.8004574042309892
('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076)
('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.3567753001715266, 3.550239234449761)

NOTE: the higher the distance, the less similar the movies are. Get Shorty & GoldenEye aren't that similar.

Get "nearest" neighbors: KNN + FN

Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:

In [13]:

MOVIE_INDEX_TO_FIND = 1;
KTen = 10
avgRating = 0

def getNeighbors(movieID, kCount):
    print(f'getting {kCount} KNN for {movieDict[movieID][0]}')
    # store KNN distances between current movie & other movies
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(kCount):
        neighbors.append(distances[x][0])
    return neighbors
    
neighborsK10 = getNeighbors(MOVIE_INDEX_TO_FIND, KTen)
print("""
---
K10 results:
---
""")
for neighbor in neighborsK10:
    thisMovie = movieDict[neighbor]
    avgRating += thisMovie[3]
    movieTitle = thisMovie[0]
    movieRating = thisMovie[3]
    print (movieTitle + ":\n\t\033[1m" + str(movieRating) + "\033[0m")
    
avgRating /= KTen

getting 10 KNN for Toy Story (1995)

---
K10 results:
---

Liar Liar (1997):
	[1m3.156701030927835[0m
Aladdin (1992):
	[1m3.8127853881278537[0m
Willy Wonka and the Chocolate Factory (1971):
	[1m3.6319018404907975[0m
Monty Python and the Holy Grail (1974):
	[1m4.0664556962025316[0m
Full Monty, The (1997):
	[1m3.926984126984127[0m
George of the Jungle (1997):
	[1m2.685185185185185[0m
Beavis and Butt-head Do America (1996):
	[1m2.7884615384615383[0m
Birdcage, The (1996):
	[1m3.4436860068259385[0m
Home Alone (1990):
	[1m3.0875912408759123[0m
Aladdin and the King of Thieves (1996):
	[1m2.8461538461538463[0m

While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:

In [9]:

avgRating

Out [9]:

3.3445905900235564

How does this compare to Toy Story's actual average rating?

In [10]:

movieDict[1]

Out [10]:

('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.7735849056603774,
 3.8783185840707963)

Notes

Arbitrary "K": K being 10 is arbitrary. Changing K ?may? impact the output
Arbitrary distance metric the cosine method is just one way to calculate a "distance"

Page Tags:

python

data-science

jupyter

learning

numpy