NLP Support Ticket Auto-Routing System

Production-Ready ML Application for Intelligent Ticket Classification

Python PyTorch NLP Flask Deep Learning
View Live Demo View on GitHub

Live Demo: https://nlp-support-ticket.up.railway.app
GitHub: https://github.com/ijonathans/nlp-support-ticket

Executive Summary

This project delivers a production-ready NLP system that automatically routes customer support tickets to the correct department with 67.2% accuracy and 1.28ms inference latency. The system implements a human-in-the-loop strategy using confidence thresholding, achieving 78% accuracy on 70% of tickets while routing uncertain cases to human agents.

67.2%
Overall Accuracy
1.28ms
Inference Latency
70%
Automation Rate

Key Achievements

  • End-to-end ML pipeline from data exploration to deployment
  • Production web application with modern UI deployed on Railway
  • Real-time predictions with sub-2ms latency
  • Balanced training approach addressing severe class imbalance (21:1 ratio)
  • Operational risk management through confidence thresholding

Problem Statement

Business Context

Manual ticket routing in customer support is:

  • Slow: Human agents take 30-60 seconds per ticket
  • Expensive: Requires dedicated routing staff
  • Inconsistent: Subject to human error and bias
  • Unscalable: Bottleneck during high-volume periods

Technical Challenge

Build a text classification system that:

  • Routes tickets to 7 departments with high accuracy
  • Handles severe class imbalance (21:1 ratio)
  • Provides sub-second inference latency
  • Manages operational risk through confidence-based routing
  • Deploys as a production-ready web application

Dataset Analysis

Dataset Overview

  • Source: Multilingual support ticket dataset
  • Language: English tickets only
  • Total Samples: 16,338 tickets
  • Features: Subject + Body text
  • Text Length: 35-1,189 characters (avg: 403 chars)

Class Distribution

Department Count Percentage Imbalance Ratio
Tech Support 7,343 44.9% 1.00x (majority)
Product Support 3,073 18.8% 2.39x
Customer Service 2,646 16.2% 2.77x
Billing 1,595 9.8% 4.60x
Returns 820 5.0% 8.95x
Sales 513 3.1% 14.31x
HR 348 2.1% 21.10x (minority)
Key Challenge: Severe class imbalance with 21:1 ratio between majority and minority classes required specialized training techniques.

Model Architecture

TextCNN Architecture

Input Text (max 200 tokens) ↓ Embedding Layer (vocab_size=6,490, embed_dim=200) ↓ Parallel CNN Layers (filters=256, kernels=[3,4,5]) ↓ Batch Normalization + ReLU + Max Pooling ↓ Concatenate Features (768 dims) ↓ Dropout (0.3) ↓ Fully Connected (768 → 384) ↓ Dropout (0.3) ↓ Output Layer (384 → 7 classes)

Architecture Details

  • Embedding Layer: 6,490 vocabulary size with 200-dimensional embeddings
  • Convolutional Layers: 256 filters per kernel with sizes [3, 4, 5] to capture 3-5 word phrases
  • Batch Normalization: Applied after each convolution for training stability
  • Fully Connected Layers: 768 → 384 → 7 with ReLU activation and dropout
  • Total Parameters: ~2.5M parameters

Training Strategy

  • Class Weighting: Square-root softened weights to address imbalance
  • Optimizer: Adam with learning rate 0.001
  • Batch Size: 32
  • Epochs: 30 with early stopping (patience 5)
  • Loss Function: Cross-entropy with class weights

Model Performance

Overall Metrics

  • Test Accuracy: 67.2%
  • Macro F1-Score: 58.2%
  • Inference Latency: 1.28ms per sample
  • Throughput: ~780 predictions/second

Per-Department Performance

Department Precision Recall F1-Score Performance
Billing 0.91 0.75 0.82 ⭐⭐⭐⭐⭐ Excellent
Tech Support 0.71 0.84 0.77 ⭐⭐⭐⭐ Strong
Customer Service 0.53 0.56 0.54 ⭐⭐⭐ Moderate
Product Support 0.56 0.51 0.53 ⭐⭐⭐ Moderate
HR 1.00 0.36 0.53 ⚠ Low Recall
Returns 0.70 0.38 0.49 ⚠ Low Recall
Sales 0.65 0.27 0.39 ⚠ Poor

Comparison: Baseline vs Deep Learning

Metric Baseline (TF-IDF + LR) Deep Learning (CNN) Improvement
Accuracy 49.2% 67.2% +18.0%
Macro F1 48.9% 58.2% +9.3%
Latency N/A 1.28ms ⚡ Fast

Human-in-the-Loop Strategy

The system routes low-confidence predictions to human agents, balancing automation and accuracy.

Confidence Thresholding Analysis

Threshold Coverage Reject Rate Auto Accuracy Strategy
0.50 92.4% 7.6% 70.2% Aggressive automation
0.60 83.1% 16.9% 73.6% Moderate automation
0.70 74.5% 25.5% 76.2% Balanced
0.75 70.0% 30.0% 78.1% ⭐ Recommended
0.80 63.7% 36.3% 79.4% Conservative
Recommended Configuration (Threshold = 0.75):
  • Automates 70% of tickets
  • Achieves 78.1% accuracy on automated tickets
  • Routes 30% to humans for review
  • Reduces routing workload by 70%

Production Deployment

Technology Stack

  • Backend Framework: Flask 3.0.0 - Web server & API
  • Production Server: Gunicorn 21.2.0 - WSGI server
  • ML Framework: PyTorch 2.5.1+cpu - Model inference
  • Numerical Computing: NumPy 2.0.2 - Array operations
  • Frontend: Vanilla JavaScript - Interactive UI
  • Deployment Platform: Railway - Cloud hosting

Deployment Challenges & Solutions

Challenge 1: Docker Image Size (5.1 GB > 4 GB limit)

Problem: PyTorch with CUDA support is 2.5 GB

Solution: Switched to PyTorch CPU-only (205 MB)

Result: Image size reduced to ~2 GB

Challenge 2: Model Architecture Mismatch

Problem: app.py had wrong parameters (embed_dim=150, filters=128)

Solution: Updated to match trained model (embed_dim=200, filters=256)

Result: Model loads successfully

Challenge 3: 'NoneType' Object Not Subscriptable

Problem: Gunicorn workers didn't inherit global variables

Solution: Added --preload flag and lazy loading check

Result: Model accessible in all workers

Production Metrics

  • Image Size: ~2 GB (under 4 GB limit)
  • Build Time: 3-5 minutes
  • Cold Start: ~5 seconds
  • Memory Usage: ~500 MB
  • Inference Latency: 1.28ms per prediction
  • Throughput: ~780 predictions/second (theoretical)

Business Impact

Operational Improvements

Before (Manual Routing):

  • 30-60 seconds per ticket
  • 70-80% accuracy (human error)
  • Bottleneck during peak hours
  • Requires dedicated routing staff

After (AI-Assisted Routing):

  • 1.28ms per ticket (automated)
  • 78% accuracy on automated tickets
  • Scales linearly with compute
  • 70% workload reduction

Workforce Impact

  • Routing staff: Reduced by 70% or reassigned to complex cases
  • Human agents: Focus on 30% uncertain cases (higher value work)
  • Quality: More consistent routing decisions
  • Speed: Instant routing vs 30-60 second delays

Future Enhancements

Short-term (1-3 months)

  • Collect more data for minority classes (HR, Sales, Returns)
  • Implement data augmentation (back-translation, paraphrasing)
  • Experiment with pre-trained embeddings (Word2Vec, GloVe)
  • Add monitoring and logging (MLflow, Weights & Biases)
  • Implement A/B testing framework

Medium-term (3-6 months)

  • Fine-tune transformer models (DistilBERT, RoBERTa)
  • Implement multi-task learning (urgency + department)
  • Extract metadata features (time of day, ticket length)
  • Add attention visualization for explainability
  • Implement LIME/SHAP for prediction explanations

Long-term (6-12 months)

  • Multi-language support (German, French, Spanish)
  • Active learning with human-corrected examples
  • Auto-suggest responses based on ticket content
  • Predict resolution time
  • Identify duplicate/related tickets

Key Learnings

Technical Lessons

  • Class Imbalance is Critical: Naive training fails on imbalanced data; class weighting significantly improves minority class performance
  • Simple Models Can Be Competitive: TF-IDF + Logistic Regression achieved 49% accuracy; deep learning improved to 67% (+18%)
  • Deployment is Non-Trivial: Docker image size constraints require optimization; CPU-only PyTorch is sufficient for inference
  • Confidence Thresholding is Powerful: Enables risk management in production and provides clear business value

Business Lessons

  • Human-in-the-Loop is Pragmatic: 100% automation is unrealistic; confidence thresholding manages risk
  • Measurable Impact: Clear efficiency gains with 70% automation rate

Skills Demonstrated

Machine Learning

Text classification, class imbalance handling, model evaluation

Deep Learning

CNN architecture, PyTorch, training optimization

Software Engineering

Flask API, frontend development, Git version control

MLOps & Deployment

Docker, cloud deployment, production optimization

Data Science

EDA, preprocessing, feature engineering, visualization

Product Thinking

Problem scoping, ROI analysis, risk management

Conclusion

This project successfully delivers a production-ready NLP system that automates 70% of support ticket routing with 78% accuracy. The system demonstrates technical excellence through an end-to-end ML pipeline, delivers clear business value through significant efficiency gains, and maintains production quality with sub-2ms latency.

Project Success Criteria

Criterion Target Achieved Status
Accuracy >60% 67.2% Exceeded
Latency <10ms 1.28ms Exceeded
Deployment Production-ready Deployed on Railway Complete
Automation >50% 70% Exceeded
Code Quality Clean, documented Well-structured Complete

This project showcases my ability to deliver complete machine learning solutions from conception to production deployment, demonstrating end-to-end data science and software engineering capabilities.

Try Live Application View Source Code