Point To Point Communication Functions
NCCL provides two types of point-to-point communication primitives: two-sided operations and one-sided operations.
Two-Sided Point-to-Point Operations
(Since NCCL 2.7) Two-sided point-to-point communication primitives need to be used when ranks need to send and receive arbitrary data from each other, which cannot be expressed as a broadcast or allgather, i.e. when all data sent and received is different. Both sender and receiver must explicitly participate.
ncclSend
-
ncclResult_t ncclSend(const void *sendbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream)
Send data from
sendbuffto rankpeer.Rank
peerneeds to call ncclRecv with the samedatatypeand the samecountas this rank.This operation is blocking for the GPU. If multiple
ncclSend()andncclRecv()operations need to progress concurrently to complete, they must be fused within ancclGroupStart()/ncclGroupEnd()section.
Related links: Point-to-point communication.
ncclRecv
-
ncclResult_t ncclRecv(void *recvbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream)
Receive data from rank
peerintorecvbuff.Rank
peerneeds to call ncclSend with the samedatatypeand the samecountas this rank.This operation is blocking for the GPU. If multiple
ncclSend()andncclRecv()operations need to progress concurrently to complete, they must be fused within ancclGroupStart()/ncclGroupEnd()section.
Related links: Point-to-point communication.
One-Sided Point-to-Point Operations (RMA)
One-sided Remote Memory Access (RMA) operations enable ranks to directly access remote memory without
explicit participation from the target process. These operations require the target memory to be
pre-registered within a symmetric memory window using ncclCommWindowRegister().
ncclPutSignal
-
ncclResult_t ncclPutSignal(const void *localbuff, size_t count, ncclDataType_t datatype, int peer, ncclWindow_t peerWin, size_t peerWinOffset, int sigIdx, int ctx, unsigned int flags, ncclComm_t comm, cudaStream_t stream)
Write data from
localbuffto rankpeer’s registered memory windowpeerWinat offsetpeerWinOffsetand subsequently updating a remote signal.The target memory window
peerWinmust be registered usingncclCommWindowRegister().The
sigIdxis the signal index identifier for the operation. It must be set to 0 for now.The
ctxis the context identifier for the operation. It must be set to 0 for now.The
flagsparameter is reserved for future use. It must be set to 0 for now.The return of
ncclPutSignal()to the CPU thread indicates that the operation has been successfully enqueued to the CUDA stream. At the completion ofncclPutSignal()on the CUDA stream, thelocalbuffis safe to reuse or modify. When a signal is updated on the remote peer, it guarantees that the data from the correspondingncclPutSignal()operation has been delivered to the remote memory. All priorncclPutSignal()andncclSignal()operations to the same peer and context have also completed their signal updates.
Related links: Point-to-point communication.
ncclSignal
-
ncclResult_t ncclSignal(int peer, int sigIdx, int ctx, unsigned int flags, ncclComm_t comm, cudaStream_t stream)
Send a signal to rank
peerwithout transferring data.The
sigIdxis the signal index identifier for the operation. It must be set to 0 for now.The
ctxis the context identifier for the operation. It must be set to 0 for now.The
flagsparameter is reserved for future use. It must be set to 0 for now.When a signal is updated on the remote peer, all prior
ncclPutSignal()andncclSignal()operations to the same peer and context have also completed their signal updates.
Related links: Point-to-point communication.
ncclWaitSignal
-
type ncclWaitSignalDesc_t
Descriptor that specifies how many signal operations to wait for from a particular rank on a given signal index and context.
-
int opCnt
Number of signal operations to wait for.
-
int peer
Target peer to wait for signals from.
-
int sigIdx
Signal index identifier. Must be set to 0 for now.
-
int ctx
Context identifier. Must be set to 0 for now.
-
int opCnt
-
ncclResult_t ncclWaitSignal(int nDesc, ncclWaitSignalDesc_t *signalDescs, ncclComm_t comm, cudaStream_t stream)
Wait for signals as described in the signal descriptor array.
The
nDescparameter specifies the number of signal descriptors in thesignalDescsarray. Each descriptor indicates how many signals (opCnt) to expect from a specificpeeron a particular signal index (sigIdx) and context (ctx).The return of
ncclWaitSignal()to the CPU thread indicates that the operation has been successfully enqueued to the CUDA stream. At the completion ofncclWaitSignal()on the CUDA stream, all specified signal operations have been received and the corresponding data is visible in local memory.
Related links: Point-to-point communication.