Over the next four years, the DeepSpeech team released newer iterations of the model that could “human-accurately” transcribe seminars, phone calls, television shows, radio shows, and other live streams.
However, Mozilla intends to discontinue DeepSpeech development and maintenance in the coming months as the company transitions into an advisory role, which will include the launch of a grant program to fund a number of initiatives demonstrating DeepSpeech applications.
DeepSpeech isn’t the only open source project of its kind, but it is one of the most well-developed. The model, which is based on Baidu research papers, is an end-to-end trainable, character-level architecture that can transcribe audio in a variety of languages.
One of Mozilla’s primary goals was to achieve a transcription word error rate of less than 10%, and the most recent versions of the pretrained English-language model achieve that goal, averaging around a 7.5 percent word error rate.
DeepSpeech, according to Mozilla, has progressed to the point where the next step is to work on developing applications. To that end, the company intends to hand over the project to “people and organizations” who are interested in continuing “use-case-based explorations.” Mozilla claims to have streamlined the continuous integration processes for getting DeepSpeech up and running with as few dependencies as possible.
In addition, as the company cleans up the documentation and prepares to discontinue Mozilla staff maintenance of the codebase, Mozilla says it will publish a toolkit to assist people, researchers, businesses, and anyone else interested in using DeepSpeech to build voice-based solutions.
A Short History of DeepSpeech
DeepSpeech development at Mozilla began in late 2017, with the goal of creating a model that takes audio features — speech — as input and outputs characters directly. The researchers hoped to create a system that could be trained using Google’s TensorFlow framework through supervised learning, in which the model learns to infer patterns from labeled speech datasets.
The most recent DeepSpeech model has tens of millions of parameters, or model sections that are learned from historical training results. The Mozilla Research team began training it on a single machine with four Titan X Pascal GPUs, but it was gradually scaled up to two servers, each with eight Titan XP GPUs. Training a high-performing model took around a week in the early days of the project.
In the years since, Mozilla has worked to reduce the DeepSpeech model’s size while improving its efficiency and staying below the 10% error rate goal. The memory usage of the English-language model was cut in half, from 188MB to 47MB. DeepSpeech was able to run “faster than real time” on a single core of a Raspberry Pi 4 in December of this year.
DeepSpeech was initially trained by Mozilla using freely available datasets like TED-LIUM and LibriSpeech, as well as paid corpora like Fisher and Switchboard, but these proved insufficient. As a result, the team contacted public television and radio stations, language study departments at universities, and others they suspected might have labeled speech data to share. They were able to more than double the amount of training data for the English-language DeepSpeech model as a result of this initiative.
The Mozilla Research team partnered with Mozilla’s Open Innovation team to launch the Popular Voice project, which aims to collect and verify speech contributions from volunteers, inspired by these data collection efforts. Common Voice includes not only voice snippets, but also voluntarily submitted metadata such as speaker ages, gender, and accents, which can be used to train speech engines. It’s also expanded to include dataset target segments for particular purposes and use cases, such as the numbers 0 through 9 and the phrases “yes,” “no,” “hey,” and “Firefox.”
With more than 9,000 hours of speech data in 60 languages, including commonly spoken languages including Welsh and Kinyarwanda, Common Voice is now one of the world’s largest multi-language public domain voice corporas. To date, more than 164,000 people have contributed to the dataset.
Nvidia today announced a $1.5 million investment in Common Voice to help the project expand by engaging more audiences and volunteers and supporting the recruiting of new workers. As part of the Mozilla Foundation’s efforts to make AI more trustworthy, Common Voice will now run under its umbrella.
The Grant program
As DeepSpeech development comes to a close, Mozilla says its upcoming grant program will prioritize projects that contribute to the core technology while also demonstrating its potential to “empower and enrich” areas that may not otherwise have a viable path toward speech-based interaction. In May, Mozilla will release a playbook to show people how to use DeepSpeech’s codebase as a starting point for voice-powered applications.