Is this page helpful?

Performance#

DeepStream application is benchmarked across various NVIDIA TAO Toolkit and open source models. The measured performance represents end-to-end performance of the entire video analytic application considering video capture and decode, pre-processing, batching, inference, and post-processing to generate metadata. The output rendering is turned off to achieve peak inference performance. For information on disabling the output rendering, see DeepStream Reference Application - deepstream-app chapter.

TAO Pre-trained models#

TAO toolkit has a set of pretrained models listed in the table below. If the models below satisfy your requirement, you should start with one of them. These could be used for various applications in smart city or smart places. If your application is beyond the scope of these models, you may re-train one of the popular model architecture using TAO toolkit. The table below shows the end-to-end performance on highly accurate pre-trained models from TAO toolkit. All models are available on NGC. These models are natively integrated with DeepStream and the instructions to run these models are in /opt/nvidia/deepstream/deepstream/samples/configs/tao_pretrained_models/. The following numbers are obtained with sample_1080p_h265.mp4.

Performance jetson- pretrained models#

Jetson

Thor

Model Arch

Inference resolution

Precision

Tracker

GPU

(FPS)

C-RADIO-B

3*224*224

FP16

No Tracker

1258

C-RADIO-L

3*224*224

FP16

No Tracker

547

NV-DinoV2-L

3*224*224

FP16

No Tracker

431

RT-DETR

3*640*640

FP16

No Tracker

195

RT-DETR

3*640*640

FP16

NvDCF Tracker

171

RT-DETR

3*640*640

FP16

MV3DT Tracker

90

Peoplenet 2.6.3

3*640*640

FP16

MV3DT Tracker

363

Grounding-DINO

3*544*960

FP16

No Tracker

23

TrafficCamnet Transformer Lite

3*544*960

FP16

NvDCF Tracker

144

SegFormer

3*640*640

FP16

No Tracker

253

Mask2Former + SWIN

3*800*800

FP16

No Tracker

26

Performance dgx-spark - pretrained models#
				DGX Spark
Model Arch	Inference resolution	Precision	Tracker	GPU (FPS)
C-RADIO-B	3224224	FP16	No Tracker	969
C-RADIO-L	3224224	FP16	No Tracker	337
NV-DinoV2-L	3224224	FP16	No Tracker	239
RT-DETR	3640640	FP16	No Tracker	160
RT-DETR	3640640	FP16	NvDCF Tracker	153
RT-DETR	3640640	FP16	MV3DT Tracker	89
Peoplenet 2.6.3	3640640	FP16	MV3DT Tracker	350
Grounding-DINO	3544960	FP16	No Tracker	21
TrafficCamnet Transformer Lite	3544960	FP16	NvDCF	139
SegFormer	3640640	FP16	No Tracker	201
Mask2Former + SWIN	3800800	FP16	No Tracker	29

Performance dgpu- pretrained models#
					RTX 4500	PRO 6000 WS	PRO 6000 SE	L40s	B200	GB200
Model Arch	Inference resolution	Precision	Inference Engine	Tracker	GPU (FPS)	GPU (FPS)	GPU (FPS)	GPU (FPS)	GPU (FPS)	GPU (FPS)
C-RADIO-B	3224224	FP16	TRT	No Tracker	2050	4131	3754	2989	8586	9531
C-RADIO-L	3224224	FP16	TRT	No Tracker	647	1497	1303	1097	3664	4077
NV-DinoV2-L	3224224	FP16	TRT	No Tracker	533	1292	1173	873	3626	3594
RT-DETR	3640640	FP16	TRT	No Tracker	345	659	978	649	1328	1405
RT+DETR	3640640	FP16	TRT	NvDCF Tracker	316	640	955	615	1298	1395
RT+DETR	3640640	FP16	TRT	MV3DT Tracker	257	645	537	237	932	1010
Peoplenet 2.6.3	3640640	FP16	TRT	MV3DT Tracker	595	1225	763	724	2971	3224
Grounding-DINO	3544960	FP16	TRT	No Tracker	63	103	102	101	219	230
TrafficCamNet Transformer Lite	3544960	FP16	TRT	NvDCF Tracker	369	650	918	665	1138	1180
SegFormer	3640640	FP16	TRT	No Tracker	435	1236	1061	1008	1320	1348
Mask2Former + SWIN	3800800	FP16	TRT	No Tracker	70	75	99	76	118	188
MaskGroundingDINO V2	3544960	FP16	TRT	No Tracker	63	102	101	102	220	230

TAO Fine-tuned models#

TAO Toolkit Finetuning Microservice provides a new interface for accelerating model training and automating model fine-tuning flows. The fine-tuned models can be used by DeepStream SDK 9.0 out-of-the-box via Inference Builder. The table mentioned in the TAO Pre-trained models section shows the end-to-end performance on fine-tuned models from TAO toolkit.

DeepStream reference model and tracker#

DeepStream SDK ships with a reference DetectNet_v2-ResNet10 model and three ResNet18 classifier models. The detailed instructions to run these models with DeepStream are provided in the next section. DeepStream provides four reference trackers: IOU, NvSORT, NvDeepSORT and NvDCF. For more information about trackers, See the Gst-nvtracker section.

Configuration File Settings for Performance Measurement#

To achieve peak performance, make sure the devices are properly cooled. For Turing and Ampere GPUs, make sure you use a server that meets the thermal and airflow requirements. Along with the hardware setup, a few other options in the config file need to be set to achieve the published performance. Make the required changes to one of the config files from DeepStream SDK to replicate the peak performance.

Turn off output rendering, OSD, and tiler

OSD (on-screen display) is used to display bounding box, masks, and labels on the screen. If output rendering is disabled, creating bounding boxes is not required unless the output needs to be streamed over RTSP or saved to disk. Tiler is used to display the output in NxM tiled grid. It is not needed if rendering is disabled. Output rendering, OSD and tiler use some percentage of compute resources, so it can reduce the inference performance.

To disable OSD, tiled display and output sink, make the following changes in the DeepStream config file.
To disable OSD, change enable to 0
[osd]
enable=0
To disable tiling, change enable to 0
[tiled-display]
enable=0
To turn-off output rendering, change the sink to fakesink.
[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File
type=1
sync=0

Use the max_perf setting for tracker

DeepStream SDK 6.2 onwards introduces a new reference low-level tracker library, NvMultiObjectTracker, along with a set of configuration files:

config_tracker_IOU.yml

config_tracker_NvDCF_max_perf.yml

config_tracker_NvDCF_perf.yml

config_tracker_NvDCF_accuracy.yml

To achieve the peak performance shown in the table above when using the NvDCF tracker, make sure the max_perf configuration is used with video frame resolution matched to that of the inference module. If the inference module uses 480x272 resolution, for example, it would be recommended to use a reduced resolution (e.g., 480x288) for the tracker module like the following:

[tracker]
enable=1
tracker-width=480
tracker-height=288
ll-lib-file=/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
#ll-config-file=/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/config_tracker_IOU.yml
ll-config-file=/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/config_tracker_NvDCF_max_perf.yml
gpu-id=0
enable-batch-process=1
display-tracking-id=1

When the IOU tracker is used, the video frame resolution doesn’t matter, and the default config_tracker_IOU.yml can be used.

CudaDeviceScheduleBlockingSync flag is set by default on dGPU

On dGPU only, cudaDeviceScheduleBlockingSync flag is set by default on the GPU where the Deepstream pipeline runs. In general, for pipelines with multiple streams, this helps in reducing the CPU utilization without affecting the performance much.

Setting cudaDeviceScheduleBlockingSync flag when sub batches are enabled in the tracker, results in significant reduction in CPU utilization with similar or negligible dip in performance.

When the environment variable NVDS_DISABLE_CUDADEV_BLOCKINGSYNC is set to 1, cudaDeviceScheduleBlockingSync flag is not set by default.

There is a remote possibility that setting cudaDeviceScheduleBlockingSync flag might affect the pipeline performance negatively when the pipeline already runs with GPU utilization close to 100%. Hence, when the user encounters a situation where a Deepstream pipeline is GPU bound and the GPU utilization does not reach close to 100%, then the user may experiment with setting NVDS_DISABLE_CUDADEV_BLOCKINGSYNC to 1 and check if it helps in improving the performance of the pipeline.