According to Microsoft, VALL-E is primarily a “neural codec language model” and is based on EnCodec, which was introduced by Meta in October 2022. VALL-E audio codecs extract codes from text and acoustic signals, as opposed to converting them to speech by manipulating the waveform, usually by other text-to-speech. It understands the tone and intonation of a person’s voice and uses EnCodec to extract the required data components (called ‘tokens’) and then the training data.
In this way, this system understands the voice of that person as well as the tone of his speaking and then can speak any text written exactly like that person’s voice and his style of speaking.
Microsoft trained the speech synthesis functionalities of VALL-E using Meta’s LibriLight audio library. It contains over 60,000 hours of English language speech from over 7,000 speakers, derived primarily from LibriVox public domain audiobooks. For VALL-E to produce a good result, the sound present in the three-second sample must be similar to the sound present in its learning algorithm.
Microsoft does not make the VALL-E code available to others in order to prevent VALL-E from being misused or misused by someone else. It appears that the researchers are aware of the potential social harm this technology could cause.
For the latest tech news, smartphone reviews and exclusive offers on popular mobiles, download the Gadgets 360 Android app and follow us on Google News.
related news
Disclaimer: This post has been auto-published from an agency/news feed without any modifications to the text and has not been reviewed by an editor.
Source link
GIPHY App Key not set. Please check settings