Introduction
Naïve BayesPros: Works with a small amount of data, handles multiple classes
Cons: Sensitive to how the input data is prepared
Works with: Nominal values
Naïve Bayes is a subset of Bayesian decision theory, so we need to talk about Bayesian decision theory quickly before we get to naïve Bayes.
We have the data shown in figure and we have a friend who read this book; she found
the statistical parameters of the two classes of data. We have
an equation for the probability of a piece of data belonging to Class 1 (the circles): p1(x,
y), and we have an equation for the class belonging to Class 2 (the triangles): p2(x, y).
To classify a new measurement with features (x, y),
we use the following rules:
If p1(x, y) > p2(x, y), then the class is 1.
If p2(x, y) > p1(x, y), then the class is 2.
Put simply, we choose the class with the higher probability. That’s Bayesian decision
theory in a nutshell: choosing the decision with the highest probability. Let’s get back
to the data in figure . If you can represent the data in six floating-point numbers,
and the code to calculate the probability is two lines in Python, which would you
rather do?
The decision tree wouldn’t be very successful, and kNN would require a lot of calculations compared to the simple probability calculation. Given this problem, the best choice would be the probability comparison we just discussed
We’re going to have to expand on the p1 and p1 probability measures I provided here. In order to be able to calculate p1 and p2, we need to discuss conditional probability. If you feel that you have a good handle on conditional probability, you can skip the next section.
Bayes?
This interpretation of probability that we use belongs to the category called Bayesian probability; it’s popular and it works well. Bayesian probability is named after Thomas Bayes, who was an eighteenth-century theologian. Bayesian probability allows prior knowledge and logic to be applied to uncertain statements. There’s another interpretation called frequency probability, which only draws conclusions from data and doesn’t allow for logic and prior knowledge.
Conditional probability
Let’s spend a few minutes talking about probability and conditional probability. If
you’re comfortable with the p(x,y|c1) symbol, you may want to skip this section.
Let’s assume for a moment that we have a jar containing seven stones. Three of these
stones are gray and four are black, as shown in figure .
If we stick a hand into this jar
and randomly pull out a stone, what are the chances that the stone will be gray? There
are seven possible stones and three are gray, so the probability is 3/7. What is the
probability of grabbing a black stone? It’s 4/7. We write the probability of gray as P(gray). We calculated the probability of drawing a gray stone P(gray) by counting the number of gray stones and dividing this by the total number of stones. What if the seven stones were in two buckets? This is shown in figure
This is known as conditional probability. We’re calculating the probability of a gray stone, given that the unknown stone comes from bucket B. We can write this as P(gray|bucketB), and this would be read as “the probability of gray given bucket B.” It’s not hard to see that P(gray|bucketA) is 2/4 and P(gray|bucketB) is 1/3
To formalize how to calculate the conditional probability, we can say
P(gray|bucketB) = P(gray and bucketB)/P(bucketB)
P(gray|bucketB) = P(gray and bucketB)/P(bucketB) = (1/7) / (3/7) = 1/3
This formal definition may seem like too much work for this simple example, but it will be useful when we have more features.
It’s also useful to have this formal definition if we ever need to algebraically manipulate the conditional probability.
Another useful way to manipulate conditional probabilities is known as Bayes’ rule. Bayes’ rule tells us how to swap the symbols in a conditional probability statement. If we have P(x|c) but want to have P(c|x), we can find it with the following: