Harsh Gupta
Open to Full-Time Roles
> whoami

Harsh Gupta

// |

I build the systems between raw data and real decisions: pipelines that process millions of records, models that surface signal from noise, and AI agents that automate what used to take hours.

Ex-Deloitte Ex-Samsung Research MS Data Science @ IU · 4.0 GPA
// by the numbers

Impact at Scale

$0M+
Revenue Impact
pricing elasticity model across 5,000+ stores · Deloitte
0M
Records / Day
nightly batch ETL pipeline · AWS Glue + PySpark · Deloitte
0%
Forecast Error Cut
dining swipe predictions from ±2,000 to ±500/day · $2K–$3K daily savings · IU
Pipeline Throughput
distributed Azure compute · forecasting pipeline · IU
0+
Staff Served
RAG knowledge base · lookup time cut from 25 min to <10 min · IU
0K
YouTube Views
Streamlit tutorial series · 13.5K hrs watch time
// full stack flow

What I Build

End-to-end systems, from raw ingestion to deployed intelligence.

Layer 0

Ingestion & ETL

Reliable pipelines to ingest, clean, and unify data from structured and unstructured sources at scale.

AWS GluePySparkADFAirflow
Layer 1

Processing & Analytics

Transforming raw data into meaningful features, metrics, and insights that drive downstream modeling.

SparkSQLTableauPower BI
Layer 2

Modeling & Intelligence

Training ML models to predict, classify, and forecast outcomes, from elasticity models to time-series forecasting.

Scikit-learnPyTorchXGBoostTime Series
Layer 3

GenAI & Agentic Systems

LLM-powered apps, RAG pipelines, and multi-agent systems that reason, act, and automate workflows end-to-end.

LangChainLangGraphRAGMCP
// background

Education

Indiana University Bloomington
MS in Data Science
Aug 2024 – May 2026
GPA: 4.0 / 4.0
Bloomington, IN, USA
Manipal Institute of Technology
B.Tech — Electronics & Communication
Minor in Data Science
Jul 2017 – May 2021
GPA: 8.13 / 10
Manipal, India
// work history

Experience

Indiana University Bloomington
Oct 2024 – Present
Current
Part-Time AI/ML Engineer
AzurePythonLangChainLLMRAGNLP
  • Built and deployed Chat AIU, a multi-tenant RAG platform powering chatbots on 10+ IU department websites, with self-service knowledge ingestion from Confluence, documents, sitemaps, and URLs.
  • Improved retrieval quality via semantic reranking and LLM evaluation pipelines; extended with MCP, A2A, and OpenAPI agent integrations for tool-augmented conversational AI.
  • Building an ML/LLM decision support system for parking appeals: classifying violations, extracting policy clauses, generating reviewer rationale via fine-tuned SmolLM (135M), cutting adjudication time from weeks to days.
Indiana University Campus Auxiliaries
Jun 2025 – Aug 2025
AI & Data Engineer Intern
Azure AI SearchAzure MLPythonFastAPILangChainTime Series
  • Designed and built a document ingestion pipeline onboarding 12,000+ Confluence pages into Azure AI Search, automating extraction, chunking, embedding, and index updates.
  • Improved retrieval accuracy from 76% → 93% via metadata filtering & reranking; deployed RAG system enabling 250+ staff to cut document lookup from 20–25 min to <10 min.
  • Built time-series forecasting models reducing dining footfall forecast error from ~2,000 swipes to ±500/day, driving $2K–$3K daily operational cost savings.
Deloitte – Strategy & Analytics
Sep 2021 – Jul 2024
Data Engineer – FSI Domain
AWS GluePySparkOracleDBPostgresAirflow
  • Engineered nightly batch ETL pipelines processing 12–15M insurance records/batch into Landing → Staging → ODS architecture with row-level audit tracking and reusable transformation modules.
  • Orchestrated pipelines with AWS Glue Workflows + EventBridge + SNS alerting, reducing issue resolution time by 25–30%.
  • Built CodETL, a platform-agnostic ETL engine enabling 25+ transformations and reducing development effort by 40%. Presented at Deloitte AI & DE Summit.
Data Scientist – Customer Strategy & Pricing
PythonRMLAWS S3Tableau
  • Developed a linear mixed-effects regression model across 5,000+ store locations, generating a $1M–$3M profit increase while limiting customer churn to 1%.
  • Built DataLens, an LLM-powered RAG portal for natural language querying over structured and unstructured enterprise data.
  • Earned 4+ firm awards for successful delivery of 3 internal systems.
Samsung Research Institute – PRISM
Jan 2021 – Jun 2021
Computer Vision Research Intern
PythonOpenCVNumPyImage Processing
  • Developed a CV algorithm to enhance low-light astrophotography on mobile phones, improving signal-to-noise ratio using advanced image processing techniques.
  • Co-authored IEEE research paper; received Samsung Excellence Award for outstanding contributions to the PRISM internship program.
// selected work

Projects & Publications

DataLens
GenAI · Enterprise

DataLens

LLM-powered RAG portal for natural language querying over structured and unstructured enterprise data. Presented at Deloitte AI & DE Summit.

LangChainAWS S3FlaskPostgres
SprintlessAI
GenAI · DevTools

SprintlessAI

Generates Agile user stories from requirements docs + codebase context via RAG. Outputs structured stories and supports upload to Jira and GitHub.

RAGPythonFAISSStreamlit
Semantic Intent Router
GenAI · Multi-Agent

Semantic Intent Router

FAISS-based multi-agent routing pipeline that classifies user intent, retrieves the relevant domain, and dispatches to the correct agent using open-source embeddings.

FAISSEmbeddingsPythonLangChain
CodETL
Data Eng · Tooling

CodETL

Platform-agnostic ETL engine with Airflow-orchestrated topologically sorted schedules, achieving a 40% reduction in dev time across 25+ transformations.

PySparkAirflowFlaskPostgres
Retail Pricing
ML · Pricing

Retail Sales Price Optimization

Price elasticity modeling across 5,000+ store locations driving $1M–$3M profit increase while limiting churn to 1%.

PythonRMixed-Effects RegressionAWS S3
Dashboard
Data Eng · Visualization

Retail Store Pricing Dashboard

Power BI dashboard visualizing product and city-level pricing trends and scenarios across retail locations.

Power BIPythonSQL
Sign Recognition
ML · Time Series

Online Sign Recognition

Time-series handwriting recognition and fraud detection using sequential ML models on pen-stroke data.

PythonTime SeriesScikit-learn
Music Emotion
ML · Audio

Music Emotion Recognition

ML classifier mapping audio spectral features to emotional categories using deep learning.

PythonTensorFlowAudio DSP
Flight Price Prediction
ML · Prediction

Flight Price Prediction

Scraped flight data and trained ensemble models to predict ticket prices across routes and date ranges.

PythonScikit-learnXGBoost
Astrophotography
Computer Vision · Research

Astrophotography Enhancement

Samsung R&D collaboration on a CV algorithm for low-light mobile astrophotography. IEEE published. Samsung Excellence Award.

PythonOpenCVImage Processing
Flood Estimation
Computer Vision · Research

Flood Region Estimation

Cross-geography generalization of ML models for classifying flooded regions in UAV aerial imagery. arXiv published.

PythonDeep LearningSegmentation
Drowsy Driver
Computer Vision · Safety

Drowsy Driver Assistant

Real-time drowsiness detection that autonomously guides the car to the shoulder and sends an emergency alert on critical detection.

PythonOpenCVDeep Learning
Virtual Mouse
Computer Vision · HCI

Virtual Mouse

Gesture-based virtual mouse using OpenCV hand tracking for full cursor and click control without hardware.

PythonOpenCVMediaPipe
YouTube
Content · Education

Streamlit Tutorial Series

210K+ views and 13.5K hrs watch time on a YouTube series covering how to build data apps with Streamlit.

StreamlitPythonYouTube
Sneaker Bot
Tools · Automation

Sneaker Update Discord Bot

Real-time Discord bot for sneaker drops, supporting a client's resale business that generated $300K in sales across 1,000+ pairs.

PythonDiscord APIWeb Scraping
University Chatbot
NLP · Chatbot

University Assistant Chatbot

NLP-powered FAQ chatbot for university queries. Custom dataset + multi-model intent classification. Served clients across 5+ countries.

PythonNLPIntent Classification
// tech stack

Skills

Programming Foundations

PythonSQLPySpark RTypeScript

LLM & AI Systems

RAGLangChainLangGraph Vector SearchHybrid Search (BM25 + Vector) Cross-Encoder RerankingEmbeddings Azure AI SearchSemantic Kernel MCPOllamaHugging Face

Data Engineering & Platforms

Apache SparkETL / ELT Pipelines AWS GlueAzure Data Factory AirflowdbtKafka SnowflakeMongoDB Data Modeling (Fact / Dim)Data Quality & Lineage

Machine Learning & MLOps

Scikit-learnPyTorchTensorFlow Time Series ForecastingRegression & Clustering Computer VisionA/B Testing Statistical InferenceModel Fine-tuning DockerCI/CD

Cloud, Storage & Analytics

AWS (Glue, S3, Redshift, EC2) Azure (ML, ADF, AI Search) GCPOracleDB PostgresAmazon Aurora HadoopTableauPower BI

Backend & APIs

FastAPIREST APIs FlaskGit / GitHubOpenAPI

Actively seeking full-time roles in Data Engineering, ML Engineering, AI/LLM Systems, or Data Science.