Classifying with distance measurements

Introduction

k-Nearest Neighbors
Pros: High accuracy, insensitive to outliers, no assumptions about data
Cons: Computationally expensive, requires a lot of memory Works with: Numeric values, nominal values

The first machine-learning algorithm we’ll look at is k-Nearest Neighbors (kNN). It works like this: we have an existing set of example data, our training set. We have labels for all of this data—we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data.
every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset; this is where the k comes from. (k is an integer and it’s usually less than 20.) Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify.

Example

Let’s run through a quick example classifying movies into romance or action movies. Someone watched a lot of movies and counted the number of kicks and kisses in each movie. I’ve plotted six movies by the number of kisses and kicks in each movie in figure 2.1. Now, you find a movie you haven’t seen yet and want to know if it’s a romance movie or an action movie. To determine this, we’ll use the kNN algorithm.

We find the movie in question and see how many kicks and kisses it has. It’s plotted as a large question mark along with a few other movies in figure 2.1. These values are listed in table. Description of the image

We don’t know what type of movie the question mark movie is, but we have a way of figuring that out. First, we calculate the distance to all the other movies.

Now that we have all the distances to our unknown movie, we need to find the k-nearest movies by sorting the distances in decreasing order. Let’s assume k=3. Then, the three closest movies are He’s Not Really into Dudes, Beautiful Woman, and California Man. The kNN algorithm says to take the majority vote from these three movies to determine the class of the mystery movie. Because all three movies are romances, we forecast that the mystery movie is a romance movie

We’ll work through a real machine learning algorithm in this chapter, and along the way I’ll introduce the Python tools and machine learning terminology. First, however, we’ll go over a simple example of the kNN algorithm to make sure we’re using the algorithm correctly.

        
                """"
                from numpy import *
                import operator
                def createDataSet():
                 group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
                 labels = ['A','A','B','B']
                 return group, labels

In this code, we import two modules. The first one is NumPy, which is our scientific computing package. The second module is the operator module, which is used later in the kNN algorithm for sorting.

Putting the kNN classification algorithm into action

, to run the kNN algorithm on one piece of data. I’ll first show the function in pseudocode and then in actual Python, followed by a detailed explanation of what everything in the code does. Remember, the goal of this function is to use the kNN algorithm to classify one piece of data called inX. Pseudocode for this function would look like this:
For every point in our dataset:
calculate the distance between inX and the current point sort the distances in increasing order take k items with lowest distances to inX find the majority class among these items return the majority class as our prediction for the class of inX

    
            """"
            def classify0(inX, dataSet, labels, k):
                dataSetSize = dataSet.shape[0]
                diffMat = tile(inX, (dataSetSize,1)) – dataSet
                sqDiffMat = diffMat**2 
                sqDistances = sqDiffMat.sum(axis=1) 
                distances = sqDistances**0.5 
                sortedDistIndicies = distances.argsort()
                classCount={} 
                for i in range(k):
                    voteIlabel = labels[sortedDistIndicies[i]] 
                    classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 
                    sortedClassCount = sorted(classCount.iteritems(), 
                    key=operator.itemgetter(1), reverse=True) 
                return sortedClassCount[0][0]

The function classify0() takes four inputs: the input vector to classify called inX, our full matrix of training examples called dataSet, a vector of labels called labels, and, finally, k, the number of nearest neighbors to use in the voting. The labels vector should have as many elements in it as there are rows in the dataSet matrix. You calculate the distances B using the Euclidian distance where the distance between two vectors, xA and xB, with two elements, is given by

Following the distance calculation, the distances are sorted from least to greatest (this is the default). Next, C the first k or lowest k distances are used to vote on the class of inX. The input k should always be a positive integer. Lastly, D you take the classCount dictionary and decompose it into a list of tuples and then sort the tuples by the second item in the tuple using the itemgetter method from the operator module imported in the second line of the program.
This sort is done in reverse so you have largest to smallest. Finally, you can return the label of the item occurring the most frequently. To predict the class, type the following text at the Python prompt:

    
            """"
            kNN.classify0([0,0], group, labels, 3)