nemo_gym.train_data_utils#
Module Contents#
Classes#
Prepare and validate training data, generating metrics and statistics for datasets. |
|
Functions#
Combines misc items (those other than response/response create params) into current metrics |
|
Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics |
|
Check if required env variables are present for the chosen backend |
|
API#
- class nemo_gym.train_data_utils.TrainDataProcessorConfig(/, **data: typing.Any)[source]#
Bases:
nemo_gym.config_types.BaseNeMoGymCLIConfigPrepare and validate training data, generating metrics and statistics for datasets.
Examples:
config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\ responses_api_models/openai_model/configs/openai_model.yaml" ng_prepare_data "+config_paths=[${config_paths}]" +output_dirpath=data/example_multi_step +mode=example_validation
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- output_dirpath: str#
‘Field(…)’
- mode: Union[Literal[train_preparation], Literal[example_validation]]#
‘Field(…)’
- should_download: bool#
‘Field(…)’
- data_source: Literal[gitlab, huggingface]#
‘Field(…)’
- property in_scope_dataset_types: List[nemo_gym.config_types.DatasetType]#
- class nemo_gym.train_data_utils.Accumulator(/, **data: typing.Any)[source]#
Bases:
pydantic.BaseModel- is_aggregated: bool#
‘Field(…)’
- class nemo_gym.train_data_utils.AvgMinMax(/, **data: typing.Any)[source]#
Bases:
nemo_gym.train_data_utils.Accumulator- model_config#
‘ConfigDict(…)’
- total: int#
‘Field(…)’
- average: float#
‘Field(…)’
- min: float#
‘Field(…)’
- max: float#
‘Field(…)’
- median: float#
‘Field(…)’
- stddev: float#
‘Field(…)’
- mean: float#
‘Field(…)’
- M2: float#
‘Field(…)’
- tdigest: tdigest.TDigest#
‘Field(…)’
T-Digest is used to estimate the Median without storing and sorting all values. The Median is essentially an approximation using the 50th percentile, which is very close to the true Median.
- class nemo_gym.train_data_utils.StringMetrics(/, **data: typing.Any)[source]#
Bases:
pydantic.BaseModel- unique_count: int#
None
- total_count: int#
None
- class nemo_gym.train_data_utils.DatasetMetrics(/, **data: typing.Any)[source]#
Bases:
nemo_gym.train_data_utils.Accumulator- model_config#
‘ConfigDict(…)’
- number_of_examples: int#
‘Field(…)’
- number_of_tools: nemo_gym.train_data_utils.AvgMinMax#
‘Field(…)’
- json_dumped_number_of_words: nemo_gym.train_data_utils.AvgMinMax#
‘Field(…)’
- number_of_turns: nemo_gym.train_data_utils.AvgMinMax#
‘Field(…)’
- temperature: nemo_gym.train_data_utils.AvgMinMax#
‘Field(…)’
- nemo_gym.train_data_utils.aggregate_other_metrics(
- metrics: Dict[str, Any],
- sample: Dict[str, Any],
Combines misc items (those other than response/response create params) into current metrics
- nemo_gym.train_data_utils.postprocess_other_metrics(
- metrics: nemo_gym.train_data_utils.DatasetMetrics,
- other_metrics: Dict[str, Any],
Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics
- nemo_gym.train_data_utils.compute_sample_metrics(
- sample_dict_str: str,
- class nemo_gym.train_data_utils.DatasetValidatorState(/, **data: typing.Any)[source]#
Bases:
pydantic.BaseModel- model_config#
‘ConfigDict(…)’
- metrics: nemo_gym.train_data_utils.DatasetMetrics#
‘Field(…)’
- key_counts: collections.Counter#
‘Field(…)’
- offending_example_idxs: List[int]#
‘Field(…)’
- other_metrics: Dict[str, Any]#
‘Field(…)’
- class nemo_gym.train_data_utils.TrainDataProcessor(/, **data: typing.Any)[source]#
Bases:
pydantic.BaseModel- run(global_config_dict: omegaconf.DictConfig)[source]#
See the README section “How To: Prepare and validate data for PR submission or RL training”
- load_and_validate_server_instance_configs(
- config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
- global_config_dict: omegaconf.DictConfig,
- load_datasets(
- config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
- server_instance_configs: List[nemo_gym.config_types.ServerInstanceConfig],
- _validate_samples_and_aggregate_metrics_single_sample(
- state: nemo_gym.train_data_utils.DatasetValidatorState,
- sample_idx: int,
- sample_dict_str: str,
- _iter_dataset_lines(
- dataset_config: nemo_gym.config_types.DatasetConfig,
- _validate_samples_and_aggregate_metrics_single_dataset(
- dataset_config: nemo_gym.config_types.DatasetConfig,
- _validate_aggregate_metrics(
- aggregate_metrics_dict: Dict,
- metrics_fpath: pathlib.Path,
Returns the conflicting metrics fpath if invalid. Else returns None
- validate_samples_and_aggregate_metrics(
- server_instance_configs: List[nemo_gym.config_types.ServerInstanceConfig],
- _collate_samples_single_type(
- type: nemo_gym.config_types.DatasetType,
- server_instance_configs: List[nemo_gym.config_types.ServerInstanceConfig],
- collate_samples(
- config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
- server_instance_configs: List[nemo_gym.config_types.ServerInstanceConfig],
- dataset_type_to_aggregate_metrics: Dict[str, nemo_gym.train_data_utils.DatasetMetrics],