GXF Stream Sync#

GXF Stream Sync is responsible for synchronization across two CUDA codelets without involving CPU wait. When two CUDA codelets are used, the first CUDA codelet that generates the data or triggers the CUDA kernel is called as the signaler. The second CUDA codelet that waits for the data or for the CUDA job that was submitted by the upstream codelet is called as the waiter. Signaling and waiting is based on a single synchronization object. Signaler and waiter both make use of the same synchronization object. CUDA stream is associated with the signaler and the waiter. The synchronization object provides APIs for signaling and waiting mechanisms.

Signaler#

The signaler codelet upon submitting all the work on a specific CUDA stream, will call the signalSemaphore API of synchronization object. Internally GXF stream sync will make use of a fence to track the completion of the tasks submitted on the CUDA stream. Signaling happens asynchronously on the GPU and the signalSemaphore API returns immediately. signalSemaphore will make use of the same CUDA stream on which the work was submitted. Signaler is also responsible for allocating the synchronization object and passes the same as message entity to the waiter.

Waiter#

The waiter codelet will issue a call to waitSemaphore and submit its own work to the same CUDA stream on which the signaler codelet submitted the work or it may make use of another CUDA stream. GXF stream sync will wait until the fence is signaled which ensures that the work submitted by the signaler codelet is complete. Waiting happens asynchronously on the GPU and the waitSemaphore API returns immediately.

The below figure depicts concept of signaler and waiter

GXF Stream sync

Figure: Synchronization across two CUDA codelets

GxfStreamExtension#

Extension for synchronization across two CUDA modules without a CPU wait.

  • UUID: 918e6ad7-8e1a-43aa-9b49-251d4b6072b0

  • Version: 0.5.0

  • Author: NVIDIA

  • License: LICENSE

Components#

nvidia::gxf::GxfStreamSync#

Component which helps to achieve synchronization across two CUDA codelets without involving CPU wait. Holds a synchronization object that can be used by the signaler and the waiter.

  • Component ID: 0011bee7-5d53-43ee-aafa-61485a436bc4

  • Base Type: nvidia::gxf::Component

  • Defined in: gxf/stream/stream_nvscisync.hpp

Parameters#

signaler

Parameter indicating the type of signaler.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_INT32


waiter

Parameter indicating the type of waiter.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_INT32


signaler_device_id

Device id on which signaler is running.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_INT32


waiter_device_id

Device id on which waiter is running.

  • Flags: GXF_PARAMETER_FLAGS_NONE

  • Type: GXF_PARAMETER_TYPE_INT32

GXF Stream Sync Workflow#

Cuda to Cuda codelet communication happens with the help of message.

At the Signaler codelet#

  • Add StreamSync handle to the mesage.

  • Get the streamsync Handle.

  • Initiatlize streamsync

  • Allocate Sync Object based on the signaler and waiter

  • Set cuda Stream for signaler and waiter

  • Submit work of signaler codelet on CUDA stream.

  • Signal Semaphore (Asynchronous call)

  • Publish message

At the Waiter Codelet#

  • Receive the message

  • Find the streamsync handle

  • Wait Semaphore (Asynchronous call)

  • Submit the work of waiter codelet on CUDA stream.

  • Now wait will happen on the GPU asynchronously

Example#

Below example describes on how to make use of GXF Stream Sync in the application.

Yaml file#

 1---
 2name: global
 3components:
 4- name: cuda_dot_pool
 5   type: nvidia::gxf::BlockMemoryPool
 6   parameters:
 7     storage_type: 1 # cuda
 8     block_size: 16384
 9     num_blocks: 10
10- name: stream_sync_cuda_to_cuda
11type: nvidia::gxf::StreamSync
12parameters:
13   signaler: 1 # Cuda signaler
14   waiter: 3   # Cuda waiter
15---
16name: stream_tensor_generator
17components:
18- name: cuda_out
19type: nvidia::gxf::DoubleBufferTransmitter
20- name: generator
21type: nvidia::gxf::stream::test::StreamTensorGeneratorNew
22parameters:
23   cuda_tx: cuda_out
24   cuda_tensor_pool: global/cuda_pool
25   stream_sync: global/stream_sync_cuda_to_cuda
26- type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
27parameters:
28   transmitter: cuda_out
29   min_size: 1
30- type: nvidia::gxf::CountSchedulingTerm
31parameters:
32   count: 50
33---
34components:
35- type: nvidia::gxf::Connection
36parameters:
37   source: stream_tensor_generator/cuda_out
38   target: cuda_dotproduct/rx
39---
40name: cuda_dotproduct
41components:
42- name: rx
43type: nvidia::gxf::DoubleBufferReceiver
44parameters:
45   capacity: 2
46- name: tx
47type: nvidia::gxf::DoubleBufferTransmitter
48- type: nvidia::gxf::MessageAvailableSchedulingTerm
49parameters:
50   receiver: rx
51   min_size: 1
52- type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
53parameters:
54   transmitter: tx
55   min_size: 1
56- type: nvidia::gxf::stream::test::CublasDotProductNew
57parameters:
58   rx: rx
59   tx: tx
60   tensor_pool: global/cuda_dot_pool
61---