Classifying with probability theory: naïve Bayes

Posted on JUNE 15, 2022 by Zulqarnain

Introduction

Naïve Bayes
Pros: Works with a small amount of data, handles multiple classes
Cons: Sensitive to how the input data is prepared
Works with: Nominal values

Naïve Bayes is a subset of Bayesian decision theory, so we need to talk about Bayesian decision theory quickly before we get to naïve Bayes.

We have the data shown in figure and we have a friend who read this book; she found the statistical parameters of the two classes of data. We have an equation for the probability of a piece of data belonging to Class 1 (the circles): p1(x, y), and we have an equation for the class belonging to Class 2 (the triangles): p2(x, y). To classify a new measurement with features (x, y),
we use the following rules:
If p1(x, y) > p2(x, y), then the class is 1.
If p2(x, y) > p1(x, y), then the class is 2.
Put simply, we choose the class with the higher probability. That’s Bayesian decision theory in a nutshell: choosing the decision with the highest probability. Let’s get back to the data in figure . If you can represent the data in six floating-point numbers, and the code to calculate the probability is two lines in Python, which would you rather do?

The decision tree wouldn’t be very successful, and kNN would require a lot of calculations compared to the simple probability calculation. Given this problem, the best choice would be the probability comparison we just discussed

We’re going to have to expand on the p1 and p1 probability measures I provided here. In order to be able to calculate p1 and p2, we need to discuss conditional probability. If you feel that you have a good handle on conditional probability, you can skip the next section.

Bayes?

This interpretation of probability that we use belongs to the category called Bayesian probability; it’s popular and it works well. Bayesian probability is named after Thomas Bayes, who was an eighteenth-century theologian. Bayesian probability allows prior knowledge and logic to be applied to uncertain statements. There’s another interpretation called frequency probability, which only draws conclusions from data and doesn’t allow for logic and prior knowledge.

Conditional probability

Let’s spend a few minutes talking about probability and conditional probability. If you’re comfortable with the p(x,y|c1) symbol, you may want to skip this section. Let’s assume for a moment that we have a jar containing seven stones. Three of these stones are gray and four are black, as shown in figure .
If we stick a hand into this jar and randomly pull out a stone, what are the chances that the stone will be gray? There are seven possible stones and three are gray, so the probability is 3/7. What is the

probability of grabbing a black stone? It’s 4/7. We write the probability of gray as P(gray). We calculated the probability of drawing a gray stone P(gray) by counting the number of gray stones and dividing this by the total number of stones. What if the seven stones were in two buckets? This is shown in figure

If you want to calculate the P(gray) or P(black), would knowing the bucket change the answer? If you wanted to calculate the probability of drawing a gray stone from bucket B, you could probably figure out how do to that.
This is known as conditional probability. We’re calculating the probability of a gray stone, given that the unknown stone comes from bucket B. We can write this as P(gray|bucketB), and this would be read as “the probability of gray given bucket B.” It’s not hard to see that P(gray|bucketA) is 2/4 and P(gray|bucketB) is 1/3

To formalize how to calculate the conditional probability, we can say

P(gray|bucketB) = P(gray and bucketB)/P(bucketB)

Let’s see if that makes sense: P(gray and bucketB) = 1/7. This was calculated by taking the number of gray stones in bucket B and dividing by the total number of stones. Now, P(bucketB) is 3/7 because there are three stones in bucket B of the total seven stones. Finally,
P(gray|bucketB) = P(gray and bucketB)/P(bucketB) = (1/7) / (3/7) = 1/3
This formal definition may seem like too much work for this simple example, but it will be useful when we have more features.
It’s also useful to have this formal definition if we ever need to algebraically manipulate the conditional probability.
Another useful way to manipulate conditional probabilities is known as Bayes’ rule. Bayes’ rule tells us how to swap the symbols in a conditional probability statement. If we have P(x|c) but want to have P(c|x), we can find it with the following:

next post for Classifying with conditional probabilities