Successful Cases

Publish Date : 2021/07/10

NVIDIA A100 GPU helps improve's intelligent voice technology uses NVIDIA A100 Tensor Core GPU to complete tens of thousands of hours of enterprise core data speech recognition training tasks. After testing, the A100 chip substantially doubles the training speed (under the same software conditions), and improves algorithm development and production efficiency.

Introduction and application background

In recent years, AI intelligent voice applications have been widely implemented in We have successively implemented voice robots for man-machine conversations in outgoing calls, incoming calls, and network audio scenarios, and Lingxi, an intelligent voice analysis platform for voice analysis in call centers, private calls, and micro-chats. And speech recognition technology plays an important role in all of the above.

The speech recognition model has a certain degree of scene relevance. In order to develop a speech recognition engine and train its own speech recognition model, has accumulated tens of thousands of hours of business recording data and counting, which brings a challenge to our computing power. By using the currently most advanced (as of mid-2021) high computing power equipment and using high-performance GPU distributed training methods, tens of thousands of hours of voice data training, inference, and output to various end services have been successfully realized.


As a leader in the industry, has in-depth planning in the AI field. It is an experienced enterprise-class GPU user. The business application uses the previous generation Pascal architecture solution, and the commonly used speech recognition frameworks are Kaldi, TensorFlow, and PyTorch. Tens of thousands of hours of speech recognition model training usually requires nearly a month of training time. This is mainly limited by the computing power based on the hardware architecture, bandwidth, and the number of transistors. At the same time, the Pascal architecture P40 has 24G GDDR6 memory. For larger models, the amount of data in a single iteration is limited, which delays the development and production process and drags down its core productivity.

NVIDIA GPU has always been a leader in deep learning training acceleration. The newly launched NVIDIA A100 is another breakthrough in peak AI computing power, and the 40GB HBM2 video memory also increases the headroom of the model scale, so A100 was chosen as the key test solution for large-scale speech recognition training.


Currently, the commonly used frameworks for speech recognition are Kaldi, TensorFlow, and PyTorch. We use fixed-duration speech training data to test different models of single GPU training based on the Kaldi framework (only the time-consuming training related to deep learning model training is recorded), and the conclusions are as follows:

Graphics card model Training data size Time consumed GPU usage
GeForce RTX 2080 Ti 145 hr 83 min 97%
Tesla T4 145 hr 183 min 100%
Tesla P40 145 hr 162 min 100%
Single Tesla V100 PCIE 32GB 145 hr 64 min 100%
Single Quadro RTX 6000 145 hr 74 min 100%
A100 145 hr 78 min 100%

From the data in the table, it can be seen that A100 significantly improves speech recognition training tasks. Under the same software and data scale conditions, it is improved by 43% over the previous generation and 73% better than the earlier T4.

Effect and influence

NVIDIA A100 GPU supports speech recognition model training, fully maximizes GPU usage, takes the shortest time in model training, completes multi-scenario speech recognition training quickly, develops speech recognition robots, uses speech robots to identify potential customers with high intentions, and improves sales staff's sales efficiency. In addition, multiple business modules such as information notifications and internal business alarms have been widely used.