Google has announced its research results on the universal speech model
Google recently announced its research results on the universal speech model it invested in back in November of last year. The model includes 12 million hours of audio content and 28 billion sets of training parameters, corresponding to more than 300 languages. Currently, it can support the recognition of over 100 languages and aims to support more than 1000 languages in the future.
According to Google’s explanation, the universal speech model adopts continuous self-supervised learning and constant fine-tuning. Through the BEST-RQ algorithm, it continues to analyze and learn language structures without external supervision, automatically completing 80% of the learning volume.
In addition, the model is trained through multi-target supervised pre-training, including text injection, BEST-RQ, and supervised loss functions. By integrating the results of other data training, the model can understand the content and semantics described by language, and fine-tune the final output results through supervised loss functions.
Google states that without the final fine-tuning through the supervised loss function, the training results can already achieve a good level of semantic understanding and statement performance. In the language translation function of YouTube, the word error rate (WER) in the translation results of 73 languages has already achieved a performance below 30%.
In terms of American English comprehension performance, Google explains that its universal speech model has a lower word error rate than other advanced speech models, and the accuracy rate has even increased by 6%. Compared to the 18 languages corresponding to OpenAI’s large speech model Whisper, the average word error rate is 32.7%, while Whisper’s average word error rate is below 40%.
In other aspects, Google emphasizes that in the recognition results of CORAAL, SpeechStew with mixed accents, and FLEURS corresponding to 102 languages used by African American English speakers, the accuracy rate of speech recognition is higher than that of Whisper. In terms of automatic semantic translation performance, Google also emphasizes that its universal speech model has better BLEU scores than Whisper.
Google has already released research papers on the universal speech model and provided the API to researchers for further research and application.
In its previous statement, Google believes that once the language understanding barriers are solved, it will promote more opportunities for application development and attract more people to use its services.