UCX Extension
Description
The UCX extension leverages the Unified Communication X (UCX) library to disaggregate a graph in the GXF framework. This extension facilitates graph distribution across multiple hosts, enabling the utilization of distributed GPU resources. UCX, an open-source library, is known for its capability to speed up data across high-performance networks. It can tap into GPUDirect RDMA technology to optimize network latencies and maximize distributed GPU traffic throughput. As a result, users of this extension can harness the combined processing power of multiple GPUs across diverse hosts. This can lead to substantial improvements in the speed and efficiency of workflows. For more UCX details, visit https://openucx.org.
For Example
The subsequent diagram illustrates a disaggregated graph, composed of two tensor generators and a tensor comparator. This tensor comparator assesses the outputs produced by these tensor generators. The UcxExtension offers the capability to execute each entity on a distinct host.
For this, every graph that uses the UCX extension needs a UcxContext component. This component hosts the UCP context and takes care of all connections, manages the data, and ensures that all operations close properly at deinitialization. When you’re setting up your graph, replace your entity’s standard transmitter and receiver with the UcxTransmitter and UcxReceiver components. Be sure to configure all the parameters, including the IP, port, and others, to establish the connection properly.
Currently, UCX supports sending messages of same type of memory (host or device). This is the limitation of UCX not of the extension.
UUID: 525f8a1a-dfb5-426b-8ddb-00c3ac839994
Version: 0.7.0
Author: NVIDIA
License: LICENSE
Requirements
NVIDIA ConnectX6-DX NIC or later.
For more information on installing and configuring NICs, see: https://docs.nvidia.com/networking/display/ConnectX6VPI/Introduction
Mellanox Open Fabrics Enterprise Distribution (MLNX_OFED) - version 5.5 or later, see https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
For installation instructions, see https://docs.nvidia.com/networking/display/MLNXOFEDv551032/Installing+MLNX_OFED
If installing the Mellanox OFED within a container:
Make sure to install the kernel drivers in the host OS by passing the
--all
flag to mlnxofedinstall script.In the container you can only install the user space libraries using the
--user-space-only
flag to the mlnxofedinstall script.
UCX - version 1.13 or later - needs to be compiled with CUDA support or use CUDA-enabled UCX packages from the git repository directly, see https://github.com/openucx/ucx/releases
For installation instructions, follow the Release build instructions from here: https://github.com/openucx/ucx#release-builds.
Note that UCX library should be compiled with CUDA as follows:: .. code-block:: bash
$ ./contrib/configure-release –prefix=/install/path –enable-examples –with-java=no –with-cuda=/path/to/cuda –enable-mt
Components
UcxContext
UcxContext is essential within the GXF UCX extension. It’s responsible for initializing the UCX context, running listeners, and managing connection requests and data receipts for UcxReceivers. UcxContext also sets up UcxTransmitter connections and resources. All connections - for both UcxReceivers and UcxTransmitter - are managed within UcxContext. Upon completion of the graph, UcxContext takes the lead in closing all connections and releasing all resources.
Component ID: 755d20a5-d794-467d-a86c-290eb2c32052
Base Type: nvidia::gxf::NetworkContext
Defined in: extensions/ucx/ucx_context.hpp
Parameters
serializer
The entity serializer used by the component. Should use UcxComponentSerializer type.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: nvidia::gxf::EntitySerializer
reconnect
Try to reconnect if a connection is closed during run. For UcxReceiver it would wait for a new connect request to establish new connection. For UcxTransmitter it would send new connect request to the server to establish new connection.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_BOOL
Default: true
Optional GPU device resource
Optional resource for GPU device.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: nvidia::gxf::GPUDevice
UcxTransmitter
Transmitter component for the GXF UCX extension. This component is used as a transmitter of an entity. At the initilization stage it would send connect request for connection establishment. When the Network Router executes the SyncOutbox function, it invokes the sync_io method of the UcxTransmitter. This method, in turn, transmits the message leveraging the UCX Active Message Rendezvous protocol.
Component ID: 58165d03-78b7-4696-b200-71621f90aee7
Base Type: nvidia::gxf::Transmitter
Defined in: extensions/ucx/ucx_transmitter.hpp
Parameters
capacity
Queue’s capacity of the transmitter.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_UINT64
policy
Queue’s policy for handling data. Valid values:
0: pop 1: reject 2: fault
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_UINT64
receiver_address
Receiver address to connect to.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_STRING
port
Port of the receiver.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32
buffer
Serialization Buffer to hold serialized data.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: Handle<UcxSerializationBuffer>
maximum_connection_retries
Maximum retries for connection establishment.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32
gpu_device
Optional GPU device resource.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: Handle<GPUDevice>
UcxReceiver
Receives data in the GXF UCX extension. This component replace a receiver of an entity. When an entity sends a message to this receiver, the UCXContext receives the message header, prompting the router to execute the SyncInbox function. The SyncInbox function subsequently triggers the sync_io method of the UcxReceiver. This method utilizes the UCX Active Message Rendezvous protocol to receive the data content of the message.
Component ID: e961132b-45d5-48b8-ac5d-2bb1a4a42279
Base Type: nvidia::gxf::Receiver
Defined in: extensions/ucx/ucx_receiver.hpp
Parameters
capacity
Queue’s capacity of the receiver.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_UINT64
Default: 10
policy
Queue’s policy for handling data. 0: pop, 1: reject, 2: fault
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_UINT64
Default: 2
address
Listener address to receive data.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_STRING
Default: “0.0.0.0”
port
Listener’s port for receiving data.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32
Default: 13337
buffer
Serialization Buffer to hold serialized data.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: UcxSerializationBuffer
Optional GPU device resource
Optional resource for GPU device.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: nvidia::gxf::GPUDevice
UcxComponentSerializer
Serializer for the components in the GXF UCX extension. Currently supports serializaing Timestamps, Tensors, Video Buffer, Audio Buffer and integer components. Valid for sharing data between devices with the same endianness.
Component ID: 64994305-4260-4f5c-ac5f-69da6dd6cfa5
Base Type: nvidia::gxf::ComponentSerializer
Defined in: extensions/ucx/ucx_component_serializer.hpp
Parameters
allocator
Memory allocator for tensor components.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: nvidia::gxf::Allocator
UcxEntitySerializer
Serializer for the entities in the GXF UCX extension.
Component ID: 14997aa4-4a01-4cd4-86ab-687f85a13f10
Base Type: nvidia::gxf::EntitySerializer
Defined in: extensions/ucx/ucx_entity_serializer.hpp
Parameters
component_serializers
List of serializers for serializing and deserializing components.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: FixedVector<nvidia::gxf::Handle<nvidia::gxf::ComponentSerializer>, kMaxTempComponents>
verbose_warning
Whether or not to print verbose warning.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_BOOL
Default: true
UcxSerializationBuffer
Serialization buffer for the GXF UCX extension.
Component ID: 1d9fcaf7-1db1-4992-93ec-714979f7d78d Base Type: nvidia::gxf::Endpoint Defined in: extensions/ucx/ucx_serialization_buffer.hpp
Parameters
allocator
Memory allocator for tensor components.
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_HANDLE
Handle Type: nvidia::gxf::Handle<nvidia::gxf::Allocator>
buffer_size
Size of the buffer in bytes (4kB by default).
Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_SIZE
Default: 4096 (4kB)
Example
This section provides an example of utilizing the UCX extension within a simple graph. This graph comprises two subgraphs, interconnected through the UCX extension. The configuration details for both the server and client side are encapsulated in their respective YAML files, which are shared below for your reference.
Server side - test_ping_rx.yaml file:
name: rx
components:
- name: allocator
type: nvidia::gxf::test::MockAllocator
- name: serialization_buffer
type: nvidia::gxf::UcxSerializationBuffer
parameters:
allocator: allocator
- name: signal
type: nvidia::gxf::UcxReceiver
parameters:
address: 5.5.5.5
port: 13337
buffer: serialization_buffer
- type: nvidia::gxf::MessageAvailableSchedulingTerm
parameters:
receiver: signal
min_size: 1
- type: nvidia::gxf::PingRx
parameters:
signal: signal
- type: nvidia::gxf::test::StepCount
parameters:
expected_count: 10
- type: nvidia::gxf::CountSchedulingTerm
parameters:
count: 10
---
name: ucx
components:
- name: allocator
type: nvidia::gxf::test::MockAllocator
- name: component_serializer
type: nvidia::gxf::UcxComponentSerializer
parameters:
allocator: allocator
- name: entity_serializer
type: nvidia::gxf::UcxEntitySerializer
parameters:
component_serializers: [ component_serializer ]
- name: ucx_context
type: nvidia::gxf::UcxContext
parameters:
serializer: entity_serializer
---
name: scheduler
components:
- name: clock
type: nvidia::gxf::RealtimeClock
- type: nvidia::gxf::GreedyScheduler
parameters:
max_duration_ms: 1000000
clock: clock
stop_on_deadlock: False
---
name: gpu_resource_entity_0
components:
- type: nvidia::gxf::GPUDevice
name: gpu_resource_0
parameters:
dev_id: 0
---
EntityGroups:
- name: entity_group_0
target:
- "rx"
- "ucx"
- "gpu_resource_entity_0"
Client side - test_ping_tx.yaml file:
name: tx
components:
- name: allocator
type: nvidia::gxf::test::MockAllocator
- name: serialization_buffer
type: nvidia::gxf::UcxSerializationBuffer
parameters:
allocator: allocator
- name: signal
type: nvidia::gxf::UcxTransmitter
parameters:
receiver_address: 5.5.5.5
port: 13337
buffer: serialization_buffer
- type: nvidia::gxf::PingTx
parameters:
signal: signal
- type: nvidia::gxf::CountSchedulingTerm
parameters:
count: 10
- type: nvidia::gxf::test::StepCount
parameters:
expected_count: 10
---
name: ucx
components:
- name: allocator
type: nvidia::gxf::test::MockAllocator
- name: component_serializer
type: nvidia::gxf::UcxComponentSerializer
parameters:
allocator: allocator
- name: entity_serializer
type: nvidia::gxf::UcxEntitySerializer
parameters:
component_serializers: [ component_serializer ]
- name: ucx_context
type: nvidia::gxf::UcxContext
parameters:
serializer: entity_serializer
---
name: scheduler
components:
- name: clock
type: nvidia::gxf::RealtimeClock
- type: nvidia::gxf::GreedyScheduler
parameters:
stop_on_deadlock: false
max_duration_ms: 1000000
clock: clock
---
name: gpu_resource_entity_0
components:
- type: nvidia::gxf::GPUDevice
name: gpu_resource_0
parameters:
dev_id: 0
---
EntityGroups:
- name: entity_group_0
target:
- "tx"
- "ucx"
- "gpu_resource_entity_0"