AI Email Classifier
1. Introduction
Email classification is a vital application of Natural Language Processing (NLP) that helps organize emails into categories such as spam, work, personal, and more. This project uses machine learning techniques to build an AI-powered email classifier.
2. Prerequisites
• Python: Install Python 3.x from the official Python
website.
• Required Libraries:
- sklearn: Install using pip install
scikit-learn
- nltk: Install using pip install nltk
- pandas: Install using pip install
pandas
• Basic knowledge of Python, machine learning, and NLP.
3. Project Setup
1. Create a Project Directory:
- Name your project folder, e.g., `AIEmailClassifier`.
- Inside this folder, create the Python script file (`email_classifier.py`).
2. Install Required Libraries:
Ensure sklearn, nltk, and pandas are installed using `pip`.
4. Writing the Code
Below is the Python code for the AI Email Classifier:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import nltk
# Sample dataset
data = {
'email': [
'Win a free lottery now!',
'Meeting scheduled at 10 AM',
'Hi Mom, how are you?',
'Earn money from home, no
investment needed!',
'Project updates are due
tomorrow',
'Dinner plans for tonight?'
],
'category': ['spam', 'work',
'personal', 'spam', 'work', 'personal']
}
# Load data into DataFrame
df = pd.DataFrame(data)
# Preprocessing
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(df['email'],
df['category'], test_size=0.3, random_state=42)
# Building the pipeline
pipeline = Pipeline([
('vectorizer',
CountVectorizer(stop_words=stop_words)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
# Training the model
pipeline.fit(X_train, y_train)
# Predictions and evaluation
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# Testing the classifier with new email
new_email = ["Your Amazon order has been shipped."]
prediction = pipeline.predict(new_email)
print(f"Category: {prediction[0]}")
5. Key Components
• Text Preprocessing: Cleans the email text using stopwords
removal and tokenization.
• Feature Extraction: Uses CountVectorizer and TfidfTransformer for vectorizing
text.
• Classification: Multinomial Naive Bayes classifier is used for categorizing
emails.
6. Testing
1. Run the script:
python email_classifier.py
2. Evaluate the classification performance on the test dataset.
3. Test the model with new emails to observe predictions.
7. Enhancements
• Expand Dataset: Use a larger, real-world dataset for
improved accuracy.
• Add Categories: Include more categories like promotions, social, etc.
• Use Advanced Models: Experiment with deep learning models like LSTMs or
Transformers.
8. Troubleshooting
• Poor Accuracy: Ensure proper data preprocessing and
balanced dataset.
• Module Not Found: Verify all required libraries are installed.
• Stopwords Errors: Ensure nltk stopwords corpus is downloaded.
9. Conclusion
This project demonstrates how to build an AI-powered email classifier using Python. With proper enhancements, it can serve as the backbone for automated email management systems.