============================ UCX Extension ============================ Description ============ The UCX extension leverages the Unified Communication X (UCX) library to disaggregate a graph in the GXF framework. This extension facilitates graph distribution across multiple hosts, enabling the utilization of distributed GPU resources. UCX, an open-source library, is known for its capability to speed up data across high-performance networks. It can tap into GPUDirect RDMA technology to optimize network latencies and maximize distributed GPU traffic throughput. As a result, users of this extension can harness the combined processing power of multiple GPUs across diverse hosts. This can lead to substantial improvements in the speed and efficiency of workflows. For more UCX details, visit https://openucx.org. **For Example** The subsequent diagram illustrates a disaggregated graph, composed of two tensor generators and a tensor comparator. This tensor comparator assesses the outputs produced by these tensor generators. The UcxExtension offers the capability to execute each entity on a distinct host. .. image:: /content/Ucx_extension_example.png :align: center :alt: Graph Example UCX Extension For this, every graph that uses the UCX extension needs a UcxContext component. This component hosts the UCP context and takes care of all connections, manages the data, and ensures that all operations close properly at deinitialization. When you're setting up your graph, replace your entity's standard transmitter and receiver with the UcxTransmitter and UcxReceiver components. Be sure to configure all the parameters, including the IP, port, and others, to establish the connection properly. Currently, UCX supports sending messages of same type of memory (host or device). This is the limitation of UCX not of the extension. * UUID: 525f8a1a-dfb5-426b-8ddb-00c3ac839994 * Version: 0.0.5 * Author: NVIDIA * License: LICENSE Requirements ============ * NVIDIA ConnectX6-DX NIC or later. For more information on installing and configuring NICs, see: https://docs.nvidia.com/networking/display/ConnectX6VPI/Introduction * Mellanox Open Fabrics Enterprise Distribution (MLNX_OFED) - version 5.5 or later, see https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/ * For installation instructions, see https://docs.nvidia.com/networking/display/MLNXOFEDv551032/Installing+MLNX_OFED * If installing the Mellanox OFED within a container: * Make sure to install the kernel drivers in the host OS by passing the ``--all`` flag to mlnxofedinstall script. * In the container you can only install the user space libraries using the ``--user-space-only`` flag to the mlnxofedinstall script. * UCX - version 1.13 or later - needs to be compiled with CUDA support or use CUDA-enabled UCX packages from the git repository directly, see https://github.com/openucx/ucx/releases * For installation instructions, follow the Release build instructions from here: https://github.com/openucx/ucx#release-builds. Note that UCX library should be compiled with CUDA as follows:: .. code-block:: bash $ ./contrib/configure-release --prefix=/install/path --enable-examples --with-java=no --with-cuda=/path/to/cuda --enable-mt Components ========== UcxContext ^^^^^^^^^^ UcxContext is essential within the GXF UCX extension. It's responsible for initializing the UCX context, running listeners, and managing connection requests and data receipts for UcxReceivers. UcxContext also sets up UcxTransmitter connections and resources. All connections - for both UcxReceivers and UcxTransmitter - are managed within UcxContext. Upon completion of the graph, UcxContext takes the lead in closing all connections and releasing all resources. * Component ID: 755d20a5-d794-467d-a86c-290eb2c32052 * Base Type: nvidia::gxf::NetworkContext * Defined in: extensions/ucx/ucx_context.hpp Parameters ++++++++++++ **serializer** The entity serializer used by the component. Should use UcxComponentSerializer type. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: nvidia::gxf::EntitySerializer | **reconnect** Try to reconnect if a connection is closed during run. For UcxReceiver it would wait for a new connect request to establish new connection. For UcxTransmitter it would send new connect request to the server to establish new connection. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_BOOL * Default: true | **Optional GPU device resource** Optional resource for GPU device. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: nvidia::gxf::GPUDevice UcxTransmitter ^^^^^^^^^^^^^^^^^ Transmitter component for the GXF UCX extension. This component is used as a transmitter of an entity. At the initilization stage it would send connect request for connection establishment. When the Network Router executes the SyncOutbox function, it invokes the sync_io method of the UcxTransmitter. This method, in turn, transmits the message leveraging the UCX Active Message Rendezvous protocol. * Component ID: 58165d03-78b7-4696-b200-71621f90aee7 * Base Type: nvidia::gxf::Transmitter * Defined in: extensions/ucx/ucx_transmitter.hpp Parameters ++++++++++++ **capacity** Queue's capacity of the transmitter. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_UINT64 | **policy** Queue's policy for handling data. Valid values: 0: pop 1: reject 2: fault * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_UINT64 | **receiver_address** Receiver address to connect to. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_STRING | **port** Port of the receiver. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 | **buffer** Serialization Buffer to hold serialized data. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: Handle<UcxSerializationBuffer> | **maximum_connection_retries** Maximum retries for connection establishment. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 | **gpu_device** Optional GPU device resource. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: Handle<GPUDevice> UcxReceiver ^^^^^^^^^^^^^^^ Receives data in the GXF UCX extension. This component replace a receiver of an entity. When an entity sends a message to this receiver, the UCXContext receives the message header, prompting the router to execute the SyncInbox function. The SyncInbox function subsequently triggers the sync_io method of the UcxReceiver. This method utilizes the UCX Active Message Rendezvous protocol to receive the data content of the message. * Component ID: e961132b-45d5-48b8-ac5d-2bb1a4a42279 * Base Type: nvidia::gxf::Receiver * Defined in: extensions/ucx/ucx_receiver.hpp Parameters ++++++++++ **capacity** Queue's capacity of the receiver. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_UINT64 * Default: 10 | **policy** Queue's policy for handling data. 0: pop, 1: reject, 2: fault * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_UINT64 * Default: 2 | **address** Listener address to receive data. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_STRING * Default: "0.0.0.0" | **port** Listener's port for receiving data. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 * Default: 13337 | **buffer** Serialization Buffer to hold serialized data. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: UcxSerializationBuffer | **Optional GPU device resource** Optional resource for GPU device. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: nvidia::gxf::GPUDevice UcxComponentSerializer ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Serializer for the components in the GXF UCX extension. Currently supports serializaing Timestamps, Tensors, Video Buffer, Audio Buffer and integer components. Valid for sharing data between devices with the same endianness. * Component ID: 64994305-4260-4f5c-ac5f-69da6dd6cfa5 * Base Type: nvidia::gxf::ComponentSerializer * Defined in: extensions/ucx/ucx_component_serializer.hpp Parameters ++++++++++ **allocator** Memory allocator for tensor components. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: nvidia::gxf::Allocator UcxEntitySerializer ^^^^^^^^^^^^^^^^^^^^^^^^^^ Serializer for the entities in the GXF UCX extension. * Component ID: 14997aa4-4a01-4cd4-86ab-687f85a13f10 * Base Type: nvidia::gxf::EntitySerializer * Defined in: extensions/ucx/ucx_entity_serializer.hpp Parameters ++++++++++ **component_serializers** List of serializers for serializing and deserializing components. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: FixedVector<nvidia::gxf::Handle<nvidia::gxf::ComponentSerializer>, kMaxTempComponents> | **verbose_warning** Whether or not to print verbose warning. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_BOOL * Default: true UcxSerializationBuffer ^^^^^^^^^^^^^^^^^^^^^^^^^^ Serialization buffer for the GXF UCX extension. Component ID: 1d9fcaf7-1db1-4992-93ec-714979f7d78d Base Type: nvidia::gxf::Endpoint Defined in: extensions/ucx/ucx_serialization_buffer.hpp Parameters ++++++++++ **allocator** Memory allocator for tensor components. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_HANDLE * Handle Type: nvidia::gxf::Handle<nvidia::gxf::Allocator> | **buffer_size** Size of the buffer in bytes (4kB by default). * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_SIZE * Default: 4096 (4kB) Example ======== This section provides an example of utilizing the UCX extension within a simple graph. This graph comprises two subgraphs, interconnected through the UCX extension. The configuration details for both the server and client side are encapsulated in their respective YAML files, which are shared below for your reference. **Server side - test_ping_rx.yaml file:** .. code-block:: yaml name: rx components: - name: allocator type: nvidia::gxf::test::MockAllocator - name: serialization_buffer type: nvidia::gxf::UcxSerializationBuffer parameters: allocator: allocator - name: signal type: nvidia::gxf::UcxReceiver parameters: address: 5.5.5.5 port: 13337 buffer: serialization_buffer - type: nvidia::gxf::MessageAvailableSchedulingTerm parameters: receiver: signal min_size: 1 - type: nvidia::gxf::PingRx parameters: signal: signal - type: nvidia::gxf::test::StepCount parameters: expected_count: 10 - type: nvidia::gxf::CountSchedulingTerm parameters: count: 10 --- name: ucx components: - name: allocator type: nvidia::gxf::test::MockAllocator - name: component_serializer type: nvidia::gxf::UcxComponentSerializer parameters: allocator: allocator - name: entity_serializer type: nvidia::gxf::UcxEntitySerializer parameters: component_serializers: [ component_serializer ] - name: ucx_context type: nvidia::gxf::UcxContext parameters: serializer: entity_serializer --- name: scheduler components: - name: clock type: nvidia::gxf::RealtimeClock - type: nvidia::gxf::GreedyScheduler parameters: max_duration_ms: 1000000 clock: clock stop_on_deadlock: False --- name: gpu_resource_entity_0 components: - type: nvidia::gxf::GPUDevice name: gpu_resource_0 parameters: dev_id: 0 --- EntityGroups: - name: entity_group_0 target: - "rx" - "ucx" - "gpu_resource_entity_0" **Client side - test_ping_tx.yaml file:** .. code-block:: yaml name: tx components: - name: allocator type: nvidia::gxf::test::MockAllocator - name: serialization_buffer type: nvidia::gxf::UcxSerializationBuffer parameters: allocator: allocator - name: signal type: nvidia::gxf::UcxTransmitter parameters: receiver_address: 5.5.5.5 port: 13337 buffer: serialization_buffer - type: nvidia::gxf::PingTx parameters: signal: signal - type: nvidia::gxf::CountSchedulingTerm parameters: count: 10 - type: nvidia::gxf::test::StepCount parameters: expected_count: 10 --- name: ucx components: - name: allocator type: nvidia::gxf::test::MockAllocator - name: component_serializer type: nvidia::gxf::UcxComponentSerializer parameters: allocator: allocator - name: entity_serializer type: nvidia::gxf::UcxEntitySerializer parameters: component_serializers: [ component_serializer ] - name: ucx_context type: nvidia::gxf::UcxContext parameters: serializer: entity_serializer --- name: scheduler components: - name: clock type: nvidia::gxf::RealtimeClock - type: nvidia::gxf::GreedyScheduler parameters: stop_on_deadlock: false max_duration_ms: 1000000 clock: clock --- name: gpu_resource_entity_0 components: - type: nvidia::gxf::GPUDevice name: gpu_resource_0 parameters: dev_id: 0 --- EntityGroups: - name: entity_group_0 target: - "tx" - "ucx" - "gpu_resource_entity_0"