Model Parallelism: Building and Deploying Large Neural Networks

Very large deep neural networks (DNNs), whether applied to natural language processing (e.g., GPT-3), computer vision (e.g., huge Vision Transformers), or speech AI (e.g., Wave2Vec 2) have certain properties that set them apart from their smaller counterparts. As DNNs become larger and are trained on progressively larger datasets, they can adapt to new tasks with just a handful of training examples, accelerating the route toward general artificial intelligence. Training models that contain tens to hundreds of billions of parameters on vast datasets isn’t trivial and requires a unique combination of AI, high-performance computing (HPC), and systems knowledge.

Mark Moyou, PhD

Sr. Data Scientist

NVIDIA

Time and Location

April 16

9:00am - 3:30pm

Cobb Galleria

Curriculum

Learning Objectives

In this workshop, participants will learn how to:

Train neural networks across multiple servers
Use techniques such as activation checkpointing, gradient accumulation, and various forms of model parallelism to overcome the challenges associated with large-model memory footprint
Capture and understand training performance characteristics to optimize model architecture
Deploy very large multi-GPU models to production using NVIDIA Triton™ Inference Server

Topics Covered
The goal of this course is to demonstrate how to train the largest of neural networks and deploy them to production.

Course Outline
The below is a suggested timeline for the course. Please work with the instructor to find the best timeline for your session.

Introduction (15 mins)

Meet the instructor.
Create an account at courses.nvidia.com/join

Introduction to Training of Large Models (1.5 hours)

Learn about the motivation behind and key challenges of training large models.
Get an overview of the basic techniques and tools needed for large-scale training.
Get an introduction to distributed training and the Slurm job scheduler.
Train a GPT model using data parallelism.
Profile the training process and understand execution performance.

Break (15 mins)

Model Parallelism: Advanced Topics (2 hours)

Increase the model size using a range of memory-saving techniques.
Get an introduction to tensor and pipeline parallelism.
Go beyond natural language processing and get an introduction to DeepSpeed.
Auto-tune model performance.
Learn about mixture-of-experts models.

Break (1 hour)

Inference of Large Models (2 hours)

Understand the challenges of deployment associated with large models.
Explore techniques for model reduction.
Learn how to use TensorRT-LLM.
Learn how to use Triton Inference Server.
Understand the process of deploying GPT checkpoint to production.
See an example of prompt engineering.

Final Review (15 minutes)

Review key learnings and answer questions.
Complete the assessment and earn a certificate.
Complete the workshop survey.

Workshop Requirements

Familiarity with:
Good understanding of PyTorch
Good understanding of deep learning and data parallel training concepts
Practice with deep learning and data parallel are useful, but optional
Tools, libraries, frameworks used: PyTorch, Megatron-LM, DeepSpeed, Slurm, Triton Inference Server

Those with a good understanding of Deep Neural Networks and would like to enhance their understanding of scaling training and inference

Who is your Instructor?

Dr. Mark Moyou Senior Data Scientist at NVIDIA working with enterprise clients on AI strategy and deploying machine learning applications to production. He is the host of the The AI Portfolio Podcast, Caribbean Tech Pioneers Podcast, and the Director of the Optimized AI Conference.

Connect with Instructor