Optimizing LLM Training and Inference Performance on GPUs

This 1-hour lecture-based workshop explores how to optimize GPU performance and reduce operational costs when training and serving large language models. Attendees will learn practical strategies for high-throughput training and low-latency inference, including modern parallelism techniques and disaggregated serving architectures used in production-scale LLM systems.

Technical Marketing Engineer - NVIDIA

Drives optimization of large-scale LLM inference and AI systems performance. M.S. in Computer Science from the University of Chicago, with thesis research on GNN-based cache system acceleration, with related work presented at PyTorch Conference 2025. Currently working in NVIDIA AI Platform Software team, focused on LLM optimization, inference systems, and developer-facing AI infrastructure.

Zeyuan (Faradawn) Yang

Workshop Overview

Part 1

— Optimizing Distributed LLM Training on GPUs
This section focuses on achieving peak training efficiency for large language models using modern distributed training techniques. We’ll explore how Data Parallelism (DP), Tensor Parallelism (TP), Sequence Parallelism (SP), Expert Parallelism (EP), and Context Parallelism work together in systems like Megatron-LM. You’ll gain an intuition for when and why to combine these strategies, how they impact memory, communication, and throughput, and how to choose the right parallelism mix as models and context lengths scale. The goal is to build a clear mental model for designing cost-efficient, high-performance LLM training pipelines on multi-GPU and multi-node systems.

Part 2

— High-Performance LLM Inference & Disaggregated Serving
The second section shifts to inference optimization, breaking down the KV cache lifecycle across prefill and decode phases and how these phases stress GPU resources differently. We’ll analyze how modern inference engines optimize these stages and why naive serving leads to prefill-decode interference and poor tail latency. You’ll then learn Disaggregated Serving with Dynamo, where prefill and decode are separated onto specialized GPU pools. This architecture dramatically reduces P99 latency, improves throughput, and maximizes hardware utilization—especially for long-context and multi-tenant workloads. We’ll also discuss how context parallelism and caching strategies evolve as context windows grow.

What You’ll Walk Away With

A practical framework for combining DP, TP, SP, EP, and context parallelism
A clear understanding of KV cache behavior during prefill vs decode
When and why to use disaggregated serving in production LLM systems
Architectural intuition for scaling LLM training and inference efficiently on GPUs

Time and Location

March 31, 2026

3:15pm - 4:15pm

Cobb Galleria

Workshop Requirements

ML & MLOps Engineers training and serving large language models in production
AI Infrastructure & Platform Engineers optimizing GPU performance, cost, and scalability
Systems Engineers & CUDA Practitioners working with distributed training and inference
AI Architects & Technical Leads designing efficient, low-latency LLM systems