Gst-nvds_text_to_speech (Alpha)
The Gst-nvds_text_to_speech plugin performs speech synthesis on the input text. It is supported on both x86 and Jetson platforms.
The plugin provides a mechanism to load custom Text To Speech (TTS) low level library at runtime.
By default, the plugin loads DS-Riva TTS library (libnvds_riva_tts.so
) to perform speech synthesis.
The library communicates with the TTS service of the NVIDIA Riva SDK for speech synthesis using optimized Riva TTS models.
Note
The Gst-nvds_text_to_speech plugin is being released as an alpha feature.
The DS-Riva Text To Speech library uses gRPC API to access the Riva TTS service. The Riva TTS service should be started before using this plugin. Installation of the gRPC C++ libraries (v1.38) is required on the client side.
Note
The DS-Riva TTS library (libnvds_riva_tts.so
) works with NVIDIA Riva Release 2.0.0 or later.
The plugin accepts text (UTF8) GStreamer buffers (GstBuffers
) from upstream component. It transforms the text into audio GStreamer buffer output.
The DS-Riva TTS library (libnvds_riva_tts.so
) generates raw audio data with S16LE format (Signed 16 bit Little Endian) at 22050 Hz sample rate. Library settings can be configured via YAML format file (by setting a property on nvds_text_to_speech gst plugin) which has multi-part settings for plugin control, and Riva TTS service configurations.
As shown in the diagram below, input text is send to Riva TTS service for speech synthesis. The final output is available as S16LE PCM audio at 22050 Hz.
Inputs and Outputs
This section summarizes the inputs, outputs, and communication facilities of the Gst-nvds_text_to_speech plugin with DS-Riva TTS implementation.
Input
Text GStreamer buffers
Control parameters
customlib-name
: Set a custom TTS library that the plugin loads to perform speech synthesis. By default, DS-Riva TTS library (libnvds_riva_tts.so
) is setcreate-speech-ctx-func
: Symbol name to create TTS speech context. Default:create_text_to_speech_ctx
config-file
: A text file to configure the plugin, DS-Riva TTS service requests.
Output
Raw audio GStreamer buffers containing the synthesized speech
Features
The following table summarizes the features of the plugin.
Feature |
Description |
Release |
---|---|---|
TTS template |
The plugin provides a Text To Speech base which can support runtime loading of custom TTS library |
DS 6.0 |
DS-Riva TTS library and Context |
Default TTS library based on Riva TTS gRPC service |
DS 6.0 |
Live speech synthesis |
Supports speech synthesis in real time using the streaming mode of the Riva TTS service |
DS 6.0 |
Languages support |
English is supported at present |
DS 6.0 |
Audio format |
Outputs F32LE Linear PCM mono audio at 22050 Hz |
DS 6.0 |
Frame size |
Supports configurable output frame size |
DS 6.0 |
x86 platform support |
– |
DS 6.0 |
Jetson platform support |
– |
DS 6.2 |
DS-Riva TTS Yaml File Configuration Specifications
DS-Riva TTS configuration file uses YAML 1.2 file format: https://yaml.org/spec/1.2/spec.html.
There are multiple parts in the configuration file. An example is located at
/opt/nvidia/deepstream/deepstream/sources/apps/audio_apps/deepstream_asr_tts_app/riva_tts_conf.yml
. Each part has aname
indicating a unique part name and adetail
indicating the setting details.name: riva_server
part configures the Riva server URI in its corresponding nodedetail:
.name: riva_tts_stream
part configures Riva TTS service supported features in its corresponding nodedetail:
.name: ds_riva_tts_plugin
part configures DS-Riva TTS settings in its corresponding nodedetail:
.A separator line with
---
is inserted between the 2 neighbor parts according to YAML specification.
Gst Properties
The following tables describes the Gst properties of the Gst-nvds_text_to_speech plugin.
Property |
Meaning |
Type and Range |
Example Notes |
---|---|---|---|
name |
Unique name |
String |
name: riva_server |
detail |
Node for Riva Server Setting details |
Node |
detail: server_uri: “localhost:50051” |
server_uri |
Part of detail node. Specify address of the Riva TTS service |
String |
server_uri: “localhost:50051” |
Property |
Meaning |
Type and Range |
Example Notes |
---|---|---|---|
name |
Unique name |
String |
Must be name: riva_tts_stream |
detail |
Node for Riva TTS Steam setting details |
Node |
detail: encoding: LINEAR_PCM |
encoding |
Part of detail node. Specify output audio encoding format. Only LINEAR_PCM is supported |
String |
encoding: LINEAR_PCM |
language_code |
Part of detail node. Specify which language is used for speech synthesis. Currently only en-US is supported |
String |
language_code: en-US |
voice_name |
Part of detail node. Specify the voice name parameter used for speech synthesis |
String |
voice_name: ljspeech |
Property |
Meaning |
Type and Range |
Example Notes |
---|---|---|---|
name |
Unique name |
String |
Must be name: ds_riva_tts_plugin |
detail |
Node DS-Riva TTS library details |
Node |
detail: output_mode: 0 |
output_mode |
Part of detail node. Specify output mode. Output mode 0: Default. Outputs audio as received from Riva server. Suitable for non real-time sinks like filesink. Output mode 1: Inserts silence in output when audio from server is not available. Suitable for real-time/live sinks like autoaudiosink. |
Integer: 0 or 1 |
output_mode: 1 |
framing_mode |
Part of detail node. Specify framing mode. Framing mode 0: Default. Use output chunk size as received from Riva server. Framing mode 1: Splits the audio received from server into chunks of size specified by the frame_size. Last chunk if not padded if less than frame_size samples. Framing mode 1: Splits the audio into chunks of frame_size samples with last chunk padded to frame_size. |
Integer: 0 1 2 |
framing_mode: 2 |
frame_size |
Part of detail node. Specify output frame size in number of samples. Used with framing mode 1 or 2 or output mode 1. |
Integer: 1 to 65535 |
frame_size: 2205 |
Riva TTS Service Initiation
Refer to https://docs.nvidia.com/deeplearning/riva/user-guide/docs/quick-start-guide.html#local-deployment-using-quick-start-scripts for the procedure to start Riva TTS service.
gRPC C++ Installation
gRPC C++ shared libraries v1.38 installation is needed for using the DS-Riva TTS library to access the Riva TTS gRPC service.
To install the libraries, please follow steps given at https://grpc.io/docs/languages/cpp/quickstart/ , and add -DBUILD_SHARED_LIBS=ON
to the cmake build options. (Recommended to use make -j4
instead of make -j
)
Or
Use the included script to install gRPC C++ libraries, this scripts performs same steps:
$ cd /opt/nvidia/deepstream/deepstream/sources/apps/audio_apps/deepstream_asr_app
$ sudo chmod +x gRPC_installation.sh
$ ./gRPC_installation.sh
Please run below command to add the installation path to the LD_LIBRARY_PATH environment variable:
$ export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
The gRPC C++ libraries are pre-installed on the DeepStream dGPU docker images. In the dGPU docker container, please run below command to add the installation path to the LD_LIBRARY_PATH environment variable:
$ export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
Sample Application
A sample application using the plugin is available here: sources/apps/audio_apps/deepstream_asr_tts_app
. Please follow the README
to run the tests.