Naive Bayes Algorithm Explained with an Interesting Example: Step-by-Step Guide

The Naive Bayes algorithm is a simple yet powerful machine learning technique used for classification problems. It is based on Bayes' Theorem, leveraging probabilities to predict class membership. Despite its simplicity, Naive Bayes is widely used in spam detection, sentiment analysis, and medical diagnosis, among other fields.

What is the Naive Bayes Algorithm?

Naive Bayes is a probabilistic classifier that assumes all features are conditionally independent, given the class label. While this "naive" assumption may not hold in all scenarios, the algorithm still performs remarkably well in practice.

Key Features of Naive Bayes:

Fast and scalable for large datasets.
Works well with categorical and text data.
Handles multi-class classification efficiently.

Mathematics of Naive Bayes

Naive Bayes is based on Bayes’ Theorem:

$P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}$

Where:

$P(C|X)$ : Posterior probability of class $C$ given feature $X$ .
$P(X|C)$ : Likelihood of feature $X$ given class $C$ .
$P(C)$ : Prior probability of class $C$ .
$P(X)$ : Evidence (total probability of $X$ ).

The naive assumption simplifies the likelihood calculation:

$P(X|C) = P(x_1|C) \cdot P(x_2|C) \cdot \ldots \cdot P(x_n|C)$

Thus, the posterior probability becomes:

$P (C ∣ X) \propto P (C) \prod_{i = 1}^{n} P (x_{i} ∣ C)$

Step-by-Step Example: Classifying Emails as Spam or Not Spam

Let’s walk through an example to classify emails as spam or not spam using the Naive Bayes algorithm.

Dataset

We have the following training data, where the features are words in the email:

Email	Contains "Free"?	Contains "Win"?	Contains "Offer"?	Spam?
Email 1	Yes	Yes	Yes	Yes
Email 2	Yes	No	Yes	Yes
Email 3	No	Yes	No	No
Email 4	Yes	No	No	No
Email 5	No	No	Yes	No

Goal: Predict whether an email containing the words "Free" and "Offer" (but not "Win") is spam.

Step 1: Calculate Prior Probabilities

The prior probabilities represent the proportion of each class in the dataset:

$P(\text{Spam}) = \frac{\text{Number of Spam Emails}}{\text{Total Emails}} = \frac{2}{5} = 0.4$ $P (Not Spam) = \frac{Number of Not Spam Emails}{Total Emails} = \frac{3}{5} = 0.6$

Step 2: Calculate Likelihoods

The likelihood represents the probability of each feature given the class. For example:

$P(\text{Free}|\text{Spam}) = \frac{\text{Number of Spam Emails with "Free"}}{\text{Total Spam Emails}} = \frac{2}{2} = 1.0$

Similarly:

$P(\text{Free}|\text{Not Spam}) = \frac{\text{Number of Not Spam Emails with "Free"}}{\text{Total Not Spam Emails}} = \frac{1}{3} = 0.33$

Repeat this process for each word:

| Feature | $P(\text{Feature}|\text{Spam})$ | $P (Feature ∣ Not Spam) | Free | 1.0 | 0.33 |
| Win | 0.5 | 0.33 |
| Offer | 1.0 | 0.33 |$

Step 3: Apply Bayes’ Theorem

We are predicting for an email with the following features:

Contains "Free": Yes
Contains "Win": No
Contains "Offer": Yes

Using the Naive Bayes formula:

$P(\text{Spam}|X) \propto P(\text{Spam}) \cdot P(\text{Free}|\text{Spam}) \cdot P(\text{Win}|\text{Spam}) \cdot P(\text{Offer}|\text{Spam})$

Substitute the probabilities:

$P(\text{Spam}|X) \propto 0.4 \cdot 1.0 \cdot 0.5 \cdot 1.0 = 0.2$

For $P(\text{Not Spam}|X)$ :

$P(\text{Not Spam}|X) \propto P(\text{Not Spam}) \cdot P(\text{Free}|\text{Not Spam}) \cdot P(\text{Win}|\text{Not Spam}) \cdot P(\text{Offer}|\text{Not Spam})$

Substitute the probabilities:

$P(\text{Not Spam}|X) \propto 0.6 \cdot 0.33 \cdot 0.66 \cdot 0.33 \approx 0.043$

Step 4: Normalize Probabilities

To make the probabilities sum to 1:

$P(\text{Spam}|X) = \frac{0.2}{0.2 + 0.043} \approx 0.82$ $P(\text{Not Spam}|X) = \frac{0.043}{0.2 + 0.043} \approx 0.18$

The email is classified as Spam because $P (Spam ∣ X) > P (Not Spam ∣ X).$

Python Implementation

Here’s how to implement this example in Python:


from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Training data
emails = [
    "Free Win Offer",  # Spam
    "Free Offer",      # Spam
    "Win",             # Not Spam
    "Free",            # Not Spam
    "Offer",           # Not Spam
]
labels = ["Spam", "Spam", "Not Spam", "Not Spam", "Not Spam"]

# Convert text to numerical features
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(emails)

# Train Naive Bayes model
model = MultinomialNB()
model.fit(X, labels)

# Predict for a new email
new_email = ["Free Offer"]
new_email_features = vectorizer.transform(new_email)
prediction = model.predict(new_email_features)
print(f"Prediction: {prediction[0]}")

Conclusion

The Naive Bayes algorithm is a powerful yet intuitive approach to classification tasks. Its reliance on probabilities and assumptions of feature independence make it both computationally efficient and interpretable. By following the step-by-step breakdown, you can apply Naive Bayes to a variety of datasets confidently.

Search This Blog

Creative World