Introduction

What happens in logistic regression is we have a bunch of data, and with the data we try to build an equation to do classification for us. The exact math behind this you’ll see in the next part of the book, but the regression aspects means that we try to find a best-fit set of parameters. Finding the best fit is similar to regression, and in this method it’s how we train our classifier. We’ll use optimization algorithms to find these best-fit parameters. This best-fit stuff is where the name regression comes from. We’ll talk about the math behind making this a classifier that puts out one of two values

General approach to logistic regression

1. Collect: Any method.

2. Prepare: Numeric values are needed for a distance calculation. A structured data format is best.

3. Analyze: Any method.

4. Train: We’ll spend most of the time training, where we try to find optimal coefficients to classify.

5. Test: Classification is quick and easy once the training step is done.

6. Use: This application needs to get some input data and output structured numeric values. Next, the application applies the simple regression calculation on this input data and determines which class the input data should belong to.

The application then takes some action on the calculated class

Using optimization to find the best regression coefficients
$z = w_{0} x_{0} + w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}$

The input to the sigmoid function described will be $z$ , where $z$ is given by the following:

In vector notation we can write this as $z = w^{T} x$ . All that means is that we have two vectors of numbers and we’ll multiply each element and add them up to get one number. The vector $x$ is our input data, and we want to find the best coefficients $w$ , so that this classifier will be as successful as possible. In order to do that, we need to consider some ideas from optimization theory.

We’ll first look at optimization with gradient ascent. We’ll then see how we can use this method of optimization to find the best parameters to model our dataset. Next, we’ll show how to plot the decision boundary generated with gradient ascent. This will help you visualize the successfulness of gradient ascent. Next, you’ll learn about stochastic gradient ascent and how to make modifications to yield better results.

Gradient ascent

The first optimization algorithm we’re going to look at is called gradient ascent. Gradient ascent is based on the idea that if we want to find the maximum point on a function, then the best way to move is in the direction of the gradient. We write the gradient with the symbol and the gradient of a function f(x,y) is given by the equation

$\nabla f (x, y) = \frac{\partial f (x, y)}{\partial x} \frac{\partial f (x, y)}{\partial y}$

This is one of the aspects of machine learning that can be confusing. The math isn’t difficult. You just need to keep track of what symbols mean. So this gradient means that we’ll move in the x direction by amount and in the y direction by amount . The function f(x,y) needs to be defined and differentiable around the points where it’s being evaluated

The gradient ascent algorithm moves in the direction of the gradient evaluated at each point. Starting with point P0, the gradient is evaluated and the function moves to the next point, P1. The gradient is then reevaluated at P1, and the function moves to P2. This cycle repeats until a stopping condition is met. The gradient operator always ensures that we’re moving in the best possible direction.

Let’s put this into action on our logistic regression classifier and some Python. First, we need a dataset.

Our simple dataset. We’re going to attempt to use gradient descent to find the best weights for a logistic regression classifier on this dataset.

                """"
            def loadDataSet():
                dataMat = []; labelMat = []
                fr = open('testSet.txt')
                for line in fr.readlines():
                lineArr = line.strip().split()
                dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
                labelMat.append(int(lineArr[2]))
                return dataMat,labelMat
           def sigmoid(inX):
             return 1.0/(1+exp(-inX))
           def gradAscent(dataMatIn, classLabels):
                dataMatrix = mat(dataMatIn) 
                labelMat = mat(classLabels).transpose() 
                m,n = shape(dataMatrix)
                alpha = 0.001
                maxCycles = 500
                weights = ones((n,1))
                for k in range(maxCycles): 
                h = sigmoid(dataMatrix*weights) 
                error = (labelMat - h) 
                weights = weights + alpha * dataMatrix.transpose()* error 
                return weights

The real work is done in the function gradAscent(), which takes two inputs. The first input, dataMatIn, is a 2D NumPy array, where the columns are the different features and the rows are the different training examples. Our example data has two features plus the 0th feature and 100 examples, so it will be a 100x3 matrix. In B you take the input arrays and convert them to NumPy matrices. This is the first time in this book where you’re using NumPy matrices, and if you’re not familiar with matrix math, then some calculations can seem strange.
NumPy can operate on both 2D arrays and matrices, and the results will be different if you assume the wrong data type. Please see appendix A for an introduction to NumPy matrices. The input classLabels is a 1x100 row vector, and for the matrix math to work, you need it to be a column vector, so you take the transpose of it and assign that to the variable labelMat. Next, you get the size of the matrix and set some parameters for our gradient ascent algorithm.