NLP Support Ticket System - Ignatius Jonathan Sugijono

Executive Summary

This project delivers a production-ready NLP system that automatically routes customer support tickets to the correct department with 67.2% accuracy and 1.28ms inference latency. The system implements a human-in-the-loop strategy using confidence thresholding, achieving 78% accuracy on 70% of tickets while routing uncertain cases to human agents.

67.2%

Overall Accuracy

1.28ms

Inference Latency

70%

Automation Rate

Key Achievements

End-to-end ML pipeline from data exploration to deployment
Production web application with modern UI deployed on Railway
Real-time predictions with sub-2ms latency
Balanced training approach addressing severe class imbalance (21:1 ratio)
Operational risk management through confidence thresholding

Problem Statement

Business Context

Manual ticket routing in customer support is:

Slow: Human agents take 30-60 seconds per ticket
Expensive: Requires dedicated routing staff
Inconsistent: Subject to human error and bias
Unscalable: Bottleneck during high-volume periods

Technical Challenge

Build a text classification system that:

Routes tickets to 7 departments with high accuracy
Handles severe class imbalance (21:1 ratio)
Provides sub-second inference latency
Manages operational risk through confidence-based routing
Deploys as a production-ready web application

Dataset Analysis

Dataset Overview

Source: Multilingual support ticket dataset
Language: English tickets only
Total Samples: 16,338 tickets
Features: Subject + Body text
Text Length: 35-1,189 characters (avg: 403 chars)

Class Distribution

Department	Count	Percentage	Imbalance Ratio
Tech Support	7,343	44.9%	1.00x (majority)
Product Support	3,073	18.8%	2.39x
Customer Service	2,646	16.2%	2.77x
Billing	1,595	9.8%	4.60x
Returns	820	5.0%	8.95x
Sales	513	3.1%	14.31x
HR	348	2.1%	21.10x (minority)

                    Key Challenge: Severe class imbalance with 21:1 ratio between majority and minority classes required specialized training techniques.
                

Model Architecture

TextCNN Architecture

Input Text (max 200 tokens) ↓ Embedding Layer (vocab_size=6,490, embed_dim=200) ↓ Parallel CNN Layers (filters=256, kernels=[3,4,5]) ↓ Batch Normalization + ReLU + Max Pooling ↓ Concatenate Features (768 dims) ↓ Dropout (0.3) ↓ Fully Connected (768 → 384) ↓ Dropout (0.3) ↓ Output Layer (384 → 7 classes)

Architecture Details

Embedding Layer: 6,490 vocabulary size with 200-dimensional embeddings
Convolutional Layers: 256 filters per kernel with sizes [3, 4, 5] to capture 3-5 word phrases
Batch Normalization: Applied after each convolution for training stability
Fully Connected Layers: 768 → 384 → 7 with ReLU activation and dropout
Total Parameters: ~2.5M parameters

Training Strategy

Class Weighting: Square-root softened weights to address imbalance
Optimizer: Adam with learning rate 0.001
Batch Size: 32
Epochs: 30 with early stopping (patience 5)
Loss Function: Cross-entropy with class weights

Model Performance

Overall Metrics

Test Accuracy: 67.2%
Macro F1-Score: 58.2%
Inference Latency: 1.28ms per sample
Throughput: ~780 predictions/second

Per-Department Performance

Department	Precision	Recall	F1-Score	Performance
Billing	0.91	0.75	0.82	⭐⭐⭐⭐⭐ Excellent
Tech Support	0.71	0.84	0.77	⭐⭐⭐⭐ Strong
Customer Service	0.53	0.56	0.54	⭐⭐⭐ Moderate
Product Support	0.56	0.51	0.53	⭐⭐⭐ Moderate
HR	1.00	0.36	0.53	⚠ Low Recall
Returns	0.70	0.38	0.49	⚠ Low Recall
Sales	0.65	0.27	0.39	⚠ Poor

Comparison: Baseline vs Deep Learning

Metric	Baseline (TF-IDF + LR)	Deep Learning (CNN)	Improvement
Accuracy	49.2%	67.2%	+18.0%
Macro F1	48.9%	58.2%	+9.3%
Latency	N/A	1.28ms	⚡ Fast

Human-in-the-Loop Strategy

The system routes low-confidence predictions to human agents, balancing automation and accuracy.

Confidence Thresholding Analysis

Threshold	Coverage	Reject Rate	Auto Accuracy	Strategy
0.50	92.4%	7.6%	70.2%	Aggressive automation
0.60	83.1%	16.9%	73.6%	Moderate automation
0.70	74.5%	25.5%	76.2%	Balanced
0.75	70.0%	30.0%	78.1%	⭐ Recommended
0.80	63.7%	36.3%	79.4%	Conservative

                    Recommended Configuration (Threshold = 0.75):
                    Automates 70% of tickets
Achieves 78.1% accuracy on automated tickets
Routes 30% to humans for review
Reduces routing workload by 70%

                

Production Deployment

Technology Stack

Backend Framework: Flask 3.0.0 - Web server & API
Production Server: Gunicorn 21.2.0 - WSGI server
ML Framework: PyTorch 2.5.1+cpu - Model inference
Numerical Computing: NumPy 2.0.2 - Array operations
Frontend: Vanilla JavaScript - Interactive UI
Deployment Platform: Railway - Cloud hosting

Deployment Challenges & Solutions

Challenge 1: Docker Image Size (5.1 GB > 4 GB limit)

Problem: PyTorch with CUDA support is 2.5 GB

Solution: Switched to PyTorch CPU-only (205 MB)

Result: Image size reduced to ~2 GB

Challenge 2: Model Architecture Mismatch

Problem: app.py had wrong parameters (embed_dim=150, filters=128)

Solution: Updated to match trained model (embed_dim=200, filters=256)

Result: Model loads successfully

Challenge 3: 'NoneType' Object Not Subscriptable

Problem: Gunicorn workers didn't inherit global variables

Solution: Added --preload flag and lazy loading check

Result: Model accessible in all workers

Production Metrics

Image Size: ~2 GB (under 4 GB limit)
Build Time: 3-5 minutes
Cold Start: ~5 seconds
Memory Usage: ~500 MB
Inference Latency: 1.28ms per prediction
Throughput: ~780 predictions/second (theoretical)

Business Impact

Operational Improvements

Before (Manual Routing):

30-60 seconds per ticket
70-80% accuracy (human error)
Bottleneck during peak hours
Requires dedicated routing staff

After (AI-Assisted Routing):

1.28ms per ticket (automated)
78% accuracy on automated tickets
Scales linearly with compute
70% workload reduction

Workforce Impact

Routing staff: Reduced by 70% or reassigned to complex cases
Human agents: Focus on 30% uncertain cases (higher value work)
Quality: More consistent routing decisions
Speed: Instant routing vs 30-60 second delays

Future Enhancements

Short-term (1-3 months)

Collect more data for minority classes (HR, Sales, Returns)
Implement data augmentation (back-translation, paraphrasing)
Experiment with pre-trained embeddings (Word2Vec, GloVe)
Add monitoring and logging (MLflow, Weights & Biases)
Implement A/B testing framework

Medium-term (3-6 months)

Fine-tune transformer models (DistilBERT, RoBERTa)
Implement multi-task learning (urgency + department)
Extract metadata features (time of day, ticket length)
Add attention visualization for explainability
Implement LIME/SHAP for prediction explanations

Long-term (6-12 months)

Multi-language support (German, French, Spanish)
Active learning with human-corrected examples
Auto-suggest responses based on ticket content
Predict resolution time
Identify duplicate/related tickets

Key Learnings

Technical Lessons

Class Imbalance is Critical: Naive training fails on imbalanced data; class weighting significantly improves minority class performance
Simple Models Can Be Competitive: TF-IDF + Logistic Regression achieved 49% accuracy; deep learning improved to 67% (+18%)
Deployment is Non-Trivial: Docker image size constraints require optimization; CPU-only PyTorch is sufficient for inference
Confidence Thresholding is Powerful: Enables risk management in production and provides clear business value

Business Lessons

Human-in-the-Loop is Pragmatic: 100% automation is unrealistic; confidence thresholding manages risk
Measurable Impact: Clear efficiency gains with 70% automation rate

Skills Demonstrated

Machine Learning

Text classification, class imbalance handling, model evaluation

Deep Learning

CNN architecture, PyTorch, training optimization

Software Engineering

Flask API, frontend development, Git version control

MLOps & Deployment

Docker, cloud deployment, production optimization

Data Science

EDA, preprocessing, feature engineering, visualization

Product Thinking

Problem scoping, ROI analysis, risk management

Conclusion

This project successfully delivers a production-ready NLP system that automates 70% of support ticket routing with 78% accuracy. The system demonstrates technical excellence through an end-to-end ML pipeline, delivers clear business value through significant efficiency gains, and maintains production quality with sub-2ms latency.

Project Success Criteria

Criterion	Target	Achieved	Status
Accuracy	>60%	67.2%	Exceeded
Latency	<10ms	1.28ms	Exceeded
Deployment	Production-ready	Deployed on Railway	Complete
Automation	>50%	70%	Exceeded
Code Quality	Clean, documented	Well-structured	Complete

This project showcases my ability to deliver complete machine learning solutions from conception to production deployment, demonstrating end-to-end data science and software engineering capabilities.

Try Live Application View Source Code

NLP Support Ticket Auto-Routing System