NvTritonExt#

NVIDIA Triton Inference components. This extension is intended to be used with Triton 2.49.0 (x86_64) and 2.40.0 (Jetpack 6.1).

Refer to the official NVIDIA Triton documentation for support matrix and more.

  • UUID: a3c95d1c-c06c-4a4e-a2f9-8d9078ab645c

  • Version: 0.5.0

  • Author: NVIDIA

  • License: Proprietary

Components#

nvidia::triton::TritonServer#

Triton inference server component using the Triton C API.

  • Component ID: 26228984-ffc4-4162-9af5-6e3008aa2982

  • Base Type: nvidia::gxf::Component

Parameters#

log_level

Logging level for Triton.

Valid values:

0: Error

1: Warn

2: Info

3+: Verbose

  • Flags: GXF_PARAMETER_FLAGS_NONE (1 = default)

  • Type: GXF_PARAMETER_TYPE_UINT32


enable_strict_model_config

Enables strict model configuration to enforce presence of config. If disabled, TensorRT, TensorFlow saved-model, and ONNX models do not require a model configuration file. Triton can derive all the required settings automatically.

  • Flags: GXF_PARAMETER_FLAGS_NONE (true = default)

  • Type: GXF_PARAMETER_TYPE_BOOL


min_compute_capability

Minimum Compute Capability for GPU. Refer to https://developer.nvidia.com/cuda-gpus.

  • Flags: GXF_PARAMETER_FLAGS_NONE (6.0 = default)

  • Type: GXF_PARAMETER_TYPE_FLOAT64


model_repository_paths

List of Triton Model Repository Paths. Refer to bytedance/triton-inference-server

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


tf_gpu_memory_fraction

The portion of GPU memory to be reserved for TensorFlow models.

  • Flags: GXF_PARAMETER_FLAGS_NONE (0.0 = default)

  • Type: GXF_PARAMETER_TYPE_FLOAT64


tf_disable_soft_placement_

Allow Tensorflow to use CPU operation when GPU implementation is not available.

  • Flags: GXF_PARAMETER_FLAGS_NONE (true = default)

  • Type: GXF_PARAMETER_TYPE_BOOL


backend_directory_path

Path to Triton backend directory.

  • Flags: GXF_PARAMETER_FLAGS_NONE (”” = default)

  • Type: GXF_PARAMETER_TYPE_STRING


model_control_mode

Triton model control mode.

Valid values:

  • “none”: Load all models in the model repository at startup.

  • “explicit”: Allow models to load when needed.

  • Flags: GXF_PARAMETER_FLAGS_NONE (“explicit” = default)

  • Type: GXF_PARAMETER_TYPE_STRING


backend_configs

Triton backend configurations in the format: backend,setting=value. Refer to Backend specific documentation: triton-inference-server/tensorflow_backend, triton-inference-server/python_backend’.

  • Flags: GXF_PARAMETER_FLAGS_OPTIONAL

  • Type: GXF_PARAMETER_TYPE_STRING


nvidia::triton::TritonInferencerInterface#

Helper component that provides an interface for Triton inferencing.

  • Component ID: 1661c015-6b1c-422d-a6f0-248cdc197b1a

  • Base Type: nvidia::gxf::Component

nvidia::triton::TritonInferencerImpl#

Component that implements the TritonInferencerInterface to obtain inferences from the TritonServer component or from an external Triton instance.

  • Component ID: b84cf267-b223-4df5-ac82-752d9fae1014

  • Base Type: nvidia::triton::TritonInferencerInterface

Parameters#

server

Triton server. This optional handle must be specified if the inference_mode of this component is Direct.

  • Flags: GXF_PARAMETER_FLAGS_OPTIONAL

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::triton::TritonServer


model_name

Triton model name to run inference.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


model_version

Triton model version of the model name to run inference.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_INT64


max_batch_size

Max batch size to run inference. This should match the value in the Triton model repository.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_UINT32


num_concurrent_requests

Maximum number of concurrent inference requests for this model version. This is used to define a pool of requests.

  • Flags: GXF_PARAMETER_FLAGS_NONE (1 = default)

  • Type: GXF_PARAMETER_TYPE_UINT32


async_scheduling_term

Asynchronous scheduling term that determines when a response is ready.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::gxf::AsynchronousSchedulingTerm


inference_mode

Triton inferencing mode.

Valid values:

Direct: This mode requires a TritonServer component handle to be passed to the optional server parameter.

RemoteGrpc: This mode requires the optional server_endpoint point to an external Triton gRPC server URL.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


server_endpoint

Server endpoint URL for an external Triton instance. This optional string must be specified if the inference_mode of this component is of the Remote variety.

  • Flags: GXF_PARAMETER_FLAGS_OPTIONAL

  • Type: GXF_PARAMETER_TYPE_STRING


nvidia::triton::TritonInferenceRequest#

Generic codelet that requests a Triton Inference. This will use a handle to an InferencerImpl to interface with Triton.

  • Component ID: 34395920-232c-446f-b5b7-46f642ce84df

  • Base Type: nvidia::gxf::Codelet

Parameters#

inferencer

Handle to Triton inference implementation. This is used to request an inference.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::triton::TritonInferencerInterface


rx

List of receivers to take input tensors.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::gxf::Receiver


input_tensor_names

Names of input tensors that exist in the ordered receivers in rx.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


input_binding_names

Names of input bindings corresponding to Triton’s config inputs in the same order of what is provided in input_tensor_names.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


nvidia::triton::TritonInferenceResponse#

Generic codelet that obtains a response from a Triton Inference. This will use a handle to an InferencerImpl to interface with Triton.

  • Component ID: 4dd957a7-aa55-4117-90d3-9a98e31ee176

  • Base Type: nvidia::gxf::Codelet

Parameters#

inferencer

Handle to Triton inference implementation. This is used to request an inference.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::triton::TritonInferencerInterface


output_tensor_names

Names of output tensors in the order to be retrieved from the model.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


output_binding_names

Names of output bindings in the model in the same order of of what is provided in output_tensor_names.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_STRING


tx

Single transmitter to publish output tensors.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::gxf::Transmitter


nvidia::triton::TritonOptions#

Generic struct that represent Triton Inference Options for model control and sequence control.

  • Component ID: 087696ed-229d-4199-876f-05b92d3887f0

nvidia::triton::TritonRequestReceptiveSchedulingTerm#

Triton Scheduling Term that schedules Request Codelet when the inferencer can accept a new request.

  • Component ID: f8602412-1242-4e43-9dbf-9c559d496b84

  • Base Type: nvidia::gxf::SchedulingTerm

Parameters#

inferencer

Handle to Triton inference implementation. This is used to check the accecptability of a new request.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_HANDLE

  • Handle Type: nvidia::triton::TritonInferencerInterface