Available Environments#

NeMo Gym includes a curated collection of environments for training and evaluation across multiple domains. This page is generated from docs/data/environments.yaml. To update it, run:

python scripts/generate_environments_yaml.py

Example Environment Patterns#

Multi Step example

Multi-step tool calling

Session State Mgmt example

Session state management (in-memory)

Single Tool Call example

Basic single-step tool calling

Environments for Training & Evaluation#

Arc Agi 1 config
knowledge
arc_agi.yamlknowledgevalidation
readme README
domain knowledge
Aviary 4 configs
agent coding math
aviary.yamlmathtrainvalidationApache 2.0
readme README
domain math
readme README
domain coding
gsm8k_aviary.yamlmathtrainvalidationApache 2.0
readme README
domain math
hotpotqa_aviary.yamlagenttrainvalidationApache 2.0
readme README
domain agent
Calendar 1 config
agent
calendar.yamlagenttrainvalidationApache 2.0
Circle Click 1 config
other
readme README
domain other
description Click on circles in images
Code Gen 1 config
coding
code_gen.yamlcodingtrainvalidationApache 2.0
Equivalence Llm Judge 4 configs
agent knowledge
readme README
domain knowledge
description Short answer questions with LLM-as-a-judge
value Improve knowledge-related benchmarks like GPQA / HLE
lc.yamlknowledge
config lc.yaml
readme README
domain knowledge
lc_judge.yamlknowledge
readme README
domain knowledge
nl2bash-equivalency.yamlagenttrainvalidationGNU General Public License v3.0
readme README
domain agent
description Short bash command generation questions with LLM-as-a-judge
value Improve foundational bash and IF capabilities
Ether0 1 config
knowledge
ether0.yamlknowledgevalidation
readme README
domain knowledge
description ether0 chemistry benchmark verifiers
value Evalutate chemistry knowledge and reasoning with ether0 benchmark
Genrm Compare 1 config
Google Search 1 config
agent
google_search.yamlagenttrainApache 2.0
readme README
domain agent
description Multi-choice question answering problems with search tools integrated
value Improve knowledge-related benchmarks with search tools
Instruction Following 1 config
instruction_following
instruction_following.yamlinstruction_followingtrainApache 2.0
readme README
domain instruction_following
description Instruction following datasets targeting IFEval and IFBench style instruction following capabilities
value Improve IFEval and IFBench
Jailbreak Detection 1 config
safety
readme README
domain safety
description Jailbreak detection with Nemotron judge + combined reward
Math Advanced Calculations 1 config
agent
readme README
domain agent
description An instruction following math environment with counter-intuitive calculators
value Improve instruction following capabilities in specific math environments
Math Formal Lean 6 configs
math
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
readme README
domain math
description Lean4 formal proof verification environment with multi-turn self-correction
value Improve formal theorem proving capabilities
nemotron_clean_easy.yamlmathtrainApache 2.0
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
nemotron_first_try_hard.yamlmathtrainApache 2.0
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
nemotron_medium_500.yamlmathtrainApache 2.0
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
nemotron_very_easy.yamlmathtrainApache 2.0
readme README
domain math
description Lean4 formal proof verification environment
value Improve formal theorem proving capabilities
Math With Code 1 config
math
math_with_code.yamlmathtrainApache 2.0
readme README
domain math
Math With Judge 7 configs
math
bytedtsinghua_dapo17k.yamlmathtrainvalidationApache 2.0
readme README
domain math
dapo17k.yamlmathtrainvalidationApache 2.0
readme README
domain math
dapo17k_filtered_qwen330ba3binstruct.yamlmathtrainvalidationApache 2.0
math_stack_overflow.yamlmathtrainvalidationCreative Commons Attribution-ShareAlike 4.0 International
math_with_judge.yamlmathtrainvalidationCreative Commons Attribution 4.0 International
readme README
domain math
description Math dataset with math-verify and LLM-as-a-judge
value Improve math capabilities including AIME 24 / 25
readme README
domain math
Mcqa 1 config
knowledge
mcqa.yamlknowledgetrainvalidationApache 2.0
config mcqa.yaml
readme README
domain knowledge
description Multi-choice question answering problems
value Improve benchmarks like MMLU / GPQA / HLE
Mini Swe Agent 1 config
coding
mini_swe_agent.yamlcodingtrainvalidationMIT
readme README
domain coding
description A software development with mini-swe-agent orchestration
value Improve software development capabilities, like SWE-bench
dataset SWE-Gym
Multichallenge 2 configs
knowledge
multichallenge.yamlknowledgetrainTBD
readme README
domain knowledge
description MultiChallenge benchmark evaluation with LLM judge
multichallenge_nrl.yamlknowledgetrainTBD
readme README
domain knowledge
description MultiChallenge benchmark evaluation with LLM judge
Newton Bench 1 config
math
newton_bench.yamlmathtrainApache 2.0
readme README
domain math
Ns Tools 1 config
agent
readme README
domain agent
description NeMo Skills tool execution with math verification
Over Refusal Detection 3 configs
safety
readme README
domain safety
description Over-refusal detection - monitors if model responds helpfully to safe prompts
Reasoning Gym 2 configs
knowledge
reasoning_gym.yamlknowledgetrainApache 2.0
readme README
domain knowledge
readme README
domain knowledge
Structured Outputs 1 config
instruction_following
structured_outputs_json.yamlinstruction_followingtrainvalidationApache 2.0
readme README
domain instruction_following
description Check if responses are following structured output requirements in prompts
value Improve instruction following capabilities
Swerl Gen 1 config
coding
swerl_gen.yamlcodingtrainvalidationApache 2.0
readme README
domain coding
description Running sandboxed evaluation for SWE-style tasks (either patch generation or reproduction test generation)
value Improve SWE capabilities useful for benchmarks like SWE-bench
Swerl Llm Judge 1 config
coding
swerl_llm_judge.yamlcodingtrainvalidationMIT
readme README
domain coding
description SWE-style multiple-choice LLM-judge tasks scored via <solution>...</solution> choice.
value Improve SWE capabilities useful for benchmarks like SWE-bench
Tavily Search 2 configs
agent
tavily_search_judge_openai_model.yamlagenttrainvalidationApache 2.0
tavily_search_judge_vllm_model.yamlagenttrainvalidationApache 2.0
Terminus Judge 2 configs
agent
terminus_judge.yamlagenttrainvalidationApache 2.0
readme README
domain agent
description single-step terminal based task (rubrics v4 judge prompt)
value Improve on terminal-style tasks
terminus_judge_simple.yamlagenttrainvalidationApache 2.0
readme README
domain agent
description single-step terminal based task (simple judge prompt)
value Improve on terminal-style tasks
Text To Sql 1 config
coding
readme README
domain coding
description Text-to-SQL generation with LLM-as-a-judge equivalence checking
value Improve text-to-SQL capabilities across multiple dialects
Workplace Assistant 1 config
agent
workplace_assistant.yamlagenttrainvalidationApache 2.0
readme README
domain agent
description Workplace assistant multi-step tool-using environment
value Improve multi-step tool use capability
Xlam Fc 1 config
agent
xlam_fc.yamlagenttrainvalidationApache 2.0
readme README
domain agent