Optimizing LLM Training and Inference Performance on GPUs
This 1-hour lecture-based workshop explores how to optimize GPU performance and reduce operational costs when training and serving large language models. Attendees will learn practical strategies for high-throughput training and low-latency inference, including modern parallelism techniques and disaggregated serving architectures used in production-scale LLM systems.

Technical Marketing Engineer - NVIDIA
Drives optimization of large-scale LLM inference and AI systems performance. M.S. in Computer Science from the University of Chicago, with thesis research on GNN-based cache system acceleration, with related work presented at PyTorch Conference 2025. Currently working in NVIDIA AI Platform Software team, focused on LLM optimization, inference systems, and developer-facing AI infrastructure.
Workshop Overview
Part 1
— Optimizing Distributed LLM Training on GPUs
This section focuses on achieving peak training efficiency for large language models using modern distributed training techniques. We’ll explore how Data Parallelism (DP), Tensor Parallelism (TP), Sequence Parallelism (SP), Expert Parallelism (EP), and Context Parallelism work together in systems like Megatron-LM. You’ll gain an intuition for when and why to combine these strategies, how they impact memory, communication, and throughput, and how to choose the right parallelism mix as models and context lengths scale. The goal is to build a clear mental model for designing cost-efficient, high-performance LLM training pipelines on multi-GPU and multi-node systems.
Part 2
— High-Performance LLM Inference & Disaggregated Serving
The second section shifts to inference optimization, breaking down the KV cache lifecycle across prefill and decode phases and how these phases stress GPU resources differently. We’ll analyze how modern inference engines optimize these stages and why naive serving leads to prefill-decode interference and poor tail latency. You’ll then learn Disaggregated Serving with Dynamo, where prefill and decode are separated onto specialized GPU pools. This architecture dramatically reduces P99 latency, improves throughput, and maximizes hardware utilization—especially for long-context and multi-tenant workloads. We’ll also discuss how context parallelism and caching strategies evolve as context windows grow.
What You’ll Walk Away With
-
A practical framework for combining DP, TP, SP, EP, and context parallelism
-
A clear understanding of KV cache behavior during prefill vs decode
-
When and why to use disaggregated serving in production LLM systems
-
Architectural intuition for scaling LLM training and inference efficiently on GPUs
Time and Location
March 31, 2026
3:15pm - 4:15pm
Cobb Galleria
Workshop Requirements
-
ML & MLOps Engineers training and serving large language models in production
-
AI Infrastructure & Platform Engineers optimizing GPU performance, cost, and scalability
-
Systems Engineers & CUDA Practitioners working with distributed training and inference
-
AI Architects & Technical Leads designing efficient, low-latency LLM systems
