Day 15 of Advent of cyber 2023!
Backstory:
Recently, the people working at Best Festival Company have been getting a lot of annoying and potentially harmful emails. These emails are trying to trick employees into clicking on links and sharing their login information. The tool that's supposed to catch these spam emails seems to have been turned off or broken on purpose. People think it might be McGreedy, who's not too pleased about the company merger, that's behind it.
Problem Statement:
McSkidy has a job to create a machine learning-based system for identifying spam emails. She has a dataset from various places that she'll use to teach the machine learning model how to recognize spam.
Learning Objectives:
In this task, we will explore:
Different steps in a generic Machine Learning pipeline
Machine Learning classification and training models
How to split the dataset into training and testing data
How to prepare the Machine Learning model
How to evaluate the model's effectiveness
Certainly! Let's dive into a detailed exploration of the Jupyter Notebook and the process of building a spam email detection model using Machine Learning.
Introduction to Jupyter Notebook:
Jupyter Notebook is a powerful tool that provides an interactive environment for writing and executing code in real-time. It's widely used for tasks like data analysis, machine learning, and scientific research. The interface is divided into cells, each of which can contain code, text, or visualizations. To execute code, you can use the run button or the shortcut Shift+Enter.
Exploring Machine Learning Pipeline:
A Machine Learning pipeline encompasses the steps involved in constructing and deploying an ML model. These steps facilitate the smooth transition of data from its raw form to generating predictions and insights. The typical pipeline includes collecting data, preprocessing it, performing feature extraction, splitting it into training and testing sets, and applying ML models for predictions.
Step 0: Importing Required Libraries: Before diving into data collection, essential libraries are imported. In this case, Numpy and Pandas are imported using the following code:
import numpy as np
import pandas as pd
Step 1: Data Collection: Data collection involves gathering raw data from various sources. The Pandas library is used to load data from a CSV file in this example. The dataset comprises spam and ham (non-spam) emails.
data = pd.read_csv("emails_dataset.csv")
Step 2: Data Preprocessing: Data preprocessing is crucial for converting raw data into a format suitable for ML. Techniques include cleaning, normalization, and feature extraction. The example showcases the use of CountVectorizer to transform text data into a numerical format.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Message'])
Step 3: Train/Test Split Dataset: To evaluate the model's performance, the dataset is split into training and testing sets. The code below achieves this using the train_test_split
function from sklearn:
from sklearn.model_selection import train_test_split
y = df['Classification']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Step 4: Model Training: The selected model, Naive Bayes in this case, is trained using the MultinomialNB class from sklearn:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
Step 5: Model Evaluation: The model's performance is evaluated using metrics like precision, recall, and accuracy. The classification_report
function from sklearn is employed for this purpose.
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
Step 6: Testing the Model: Finally, the trained model is applied to classify new messages. A sample message is used for demonstration:
message = vectorizer.transform(["Today's Offer! Claim ur £150 worth of discount vouchers! Text YES to 85023 now! SavaMob, member offers mobile! T Cs 08717898035. £3.00 Sub. 16 . Unsub reply X "])
prediction = clf.predict(message)
print("The email is :", prediction[0])
Testing with New Data: McSkidy wants to test the model with new emails from a file named test_emails.csv
. The process involves loading the new data, transforming it with CountVectorizer
, and making predictions:
test_data = pd.read_csv("test_emails.csv")
X_new = vectorizer.transform(test_data['Messages'])
new_predictions = clf.predict(X_new)
results_df = pd.DataFrame({'Messages': test_data['Messages'], 'Prediction': new_predictions})
print(results_df)
Conclusion: Building a spam email detector involves a series of steps, from data collection to model evaluation and testing. The chosen Naive Bayes model, trained on the provided dataset, showcases its predictive power. Continuous monitoring, user feedback, and potential deployment into production are essential considerations for the effectiveness and reliability of the model.
Task 1:
What is the key first step in the Machine Learning pipeline?
Answer: data collection
Task 2:
Which data preprocessing feature is used to create new features or modify existing ones to improve model performance?
Answer: feature engineering
Task 3:
During the data splitting step, 20% of the dataset was split for testing. What is the percentage weightage avg of precision of spam detection?
Answer: 0.98
Task 4:
How many of the test emails are marked as spam?
Answer: 3
Task 5:
One of the emails that is detected as spam contains a secret code. What is the code?
Answer: I_HaTe_Best_FestiVal