Performance#

As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.

This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations. The recipes to reproduce these runs, in yaml file form, can be found under this folder.

Nomenclature#

  • GBS: Global Batch Size

  • MBS: Micro Batch Size

  • TP: Tensor Parallel Size

  • PP: Pipeline Parallel Size

  • CP: Context Parallel Size

  • VP: Virtual Pipeline Parallel Size

  • EP: Expert Parallel Size

  • T-: Training related

  • G-: Generation related

  • Training backend: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend.

Performance Metrics#

Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:

  • Step time: Time for each step, which includes training, generation, policy logprobs, and refit time.

  • Tokens/sec/GPU: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:

    \[ \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} \]
  • Training MFU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available here.

The performance data includes:

  • RL Performance: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).

  • System Configurations: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)

  • Precision Options: Performance comparisons between different precision modes (BF16, FP8)


Nemo RL v0.5#

H100 BF16 Benchmarks#

  • GRPO Dataset: OpenMathInstruct-2; DAPO dataset: DAPOMath17k

  • System: DGX-H100

  • Precision: Training BF16, Generation BF16

  • Training Backend: Megatron-core.

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

LLAMA3.1_8B

On policy

4,096

1,019

16

2,048

512

[1,1]

[1,1,1,1,1,2,n/a]

1,581

92.8

GRPO

LLAMA3.1_8B

1-step Off

4,096

1,123

16

2,048

512

[1,1]

[1,1,1,1,1,1,n/a]

2,478

64.8

GRPO

DeepSeek V3

On policy

1,536

744

256

512

512

[32,1]

[1,1,16,16,n/a]

12.7

134

GRPO

DeepSeek V3

1-step Off

1,536

738

512

512

512

[32,1]

[1,1,16,16,n/a]

13.1

64.9

DAPO

DeepSeek V3

On policy

1,536

974

512

512

512

[64,1]

[8,4,32,8,n/a]

2.45

974

GRPO

Qwen3-235B

On policy

8,192

5,700

128

512

512

[16,1]

[2,2,16,8,n/a]

54.1

431

GRPO

Qwen3-235B

1-step Off

8,192

5,707

256

512

512

[8,1]

[4,1,16,8,n/a]

58.7

203

GRPO

Qwen3-30B3A

On policy

4,096

3,196

32

2,048

512

[2,1]

[1,1,8,1,n/a]

1066

198

GRPO

Qwen3-30B3A

1-step Off

4,096

3,201

32

2,048

512

[2,1]

[1,1,8,2,n/a]

1391

154

GRPO

Qwen3-32B

On policy

4,096

3,251

32

2,048

512

[4,1]

[4,1,1,4,n/a]

571

376

GRPO

Qwen3-32B

1-step Off

4,096

3,252

64

2,048

512

[4,1]

[4,1,1,4,n/a]

538

200

H100 FP8 Benchmarks#

  • GRPO Dataset: OpenMathInstruct-2

  • System: DGX-H100

  • Precision: Generation FP8, Training FP8

  • Training Backend: Megatron-core.

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

LLAMA3.1_8B

1-step Off

4,096

1,128

16

2,048

512

[1,1]

[1,1,1,1,1,1,n/a]

3,052

53.0

GRPO

DeepSeek V3

1-step Off

1,536

761

512

512

512

[16,1]

[1,1,16,16,n/a]

14.1

67.6

GB200 BF16 Benchmarks#

  • GRPO Dataset: OpenMathInstruct-2

  • System: GB200-NVL72

  • Precision: Training BF16, Generation BF16

  • Training Backend: Megatron-core.

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

LLAMA3.1_8B

On policy

4,096

1,066

8

2,048

512

[1,1]

[1,1,1,1,1,1,n/a]

3,359

91.0

GRPO

LLAMA3.1_8B

1-step Off

4,096

1,107

8

2,048

512

[1,1]

[1,1,1,1,1,1,n/a]

4,463

71.1

GRPO

DeepSeek V3

On policy

1,536

996

128

512

512

[32,1]

[1,1,16,8,n/a]

34.3

128

GRPO

DeepSeek V3

1-step Off

1,536

994

256

512

512

[16,1]

[1,1,16,8,n/a]

31.7

64.5

GRPO

Qwen3-235B

On policy

8,192

5,711

64

512

512

[8,1]

[2,2,16,4,n/a]

140

332

GRPO

Qwen3-235B

1-step Off

8,192

5,711

128

512

512

[8,1]

[4,1,16,4,n/a]

87.9

268

GRPO

Qwen3-30B3A

On policy

4,096

3,198

16

2,048

512

[1,1]

[1,1,16,1,n/a]

1,822

232

GRPO

Qwen3-30B3A

1-step Off

4,096

3,204

32

2,048

512

[1,1]

[1,1,16,1,n/a]

1,558

136

GRPO

Qwen3-32B

On policy

4,096

3,253

16

2,048

512

[1,1]

[2,1,1,1,n/a]

1,127

381

GRPO

Qwen3-32B

1-step Off

4,096

3,258

32

2,048

512

[1,1]

[2,1,1,1,n/a]

1,025

210

Note:

  • All Mixture-of-expert (MoE) model training uses token drop-less.

  • The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small.