Engineeering & IT Projects and Resources: AI Email Classifier

AI Email Classifier

1. Introduction

Email classification is a vital application of Natural Language Processing (NLP) that helps organize emails into categories such as spam, work, personal, and more. This project uses machine learning techniques to build an AI-powered email classifier.

2. Prerequisites

• Python: Install Python 3.x from the official Python website.
• Required Libraries:
- sklearn: Install using pip install scikit-learn
- nltk: Install using pip install nltk
- pandas: Install using pip install pandas
• Basic knowledge of Python, machine learning, and NLP.

3. Project Setup

1. Create a Project Directory:

- Name your project folder, e.g., `AIEmailClassifier`.
- Inside this folder, create the Python script file (`email_classifier.py`).

2. Install Required Libraries:

Ensure sklearn, nltk, and pandas are installed using `pip`.

4. Writing the Code

Below is the Python code for the AI Email Classifier:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import nltk

# Sample dataset
data = {
    'email': [
        'Win a free lottery now!',
        'Meeting scheduled at 10 AM',
        'Hi Mom, how are you?',
        'Earn money from home, no investment needed!',
        'Project updates are due tomorrow',
        'Dinner plans for tonight?'
    ],
    'category': ['spam', 'work', 'personal', 'spam', 'work', 'personal']
}

# Load data into DataFrame
df = pd.DataFrame(data)

# Preprocessing
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(df['email'], df['category'], test_size=0.3, random_state=42)

# Building the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words=stop_words)),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

# Training the model
pipeline.fit(X_train, y_train)

# Predictions and evaluation
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Testing the classifier with new email
new_email = ["Your Amazon order has been shipped."]
prediction = pipeline.predict(new_email)
print(f"Category: {prediction[0]}")

5. Key Components

• Text Preprocessing: Cleans the email text using stopwords removal and tokenization.
• Feature Extraction: Uses CountVectorizer and TfidfTransformer for vectorizing text.
• Classification: Multinomial Naive Bayes classifier is used for categorizing emails.

6. Testing

1. Run the script:

python email_classifier.py

2. Evaluate the classification performance on the test dataset.

3. Test the model with new emails to observe predictions.

7. Enhancements

• Expand Dataset: Use a larger, real-world dataset for improved accuracy.
• Add Categories: Include more categories like promotions, social, etc.
• Use Advanced Models: Experiment with deep learning models like LSTMs or Transformers.

8. Troubleshooting

• Poor Accuracy: Ensure proper data preprocessing and balanced dataset.
• Module Not Found: Verify all required libraries are installed.
• Stopwords Errors: Ensure nltk stopwords corpus is downloaded.

9. Conclusion

This project demonstrates how to build an AI-powered email classifier using Python. With proper enhancements, it can serve as the backbone for automated email management systems.

Pages