The name is similar to the nice little robot from Pixar. But, in fact, the Vall-E is a technology of Microsoft Able to imitate any voice. To do so, the system needs a sample of just three seconds of speech. The novelty could be used in film dubbing, for example, although malicious uses are also possible.
Microsoft describes Vall-E as a “Neural Codec Language Model”. The reason for this lies in the fact that the project is based on EnCodec, technology from Meta (Facebook) that uses artificial intelligence to compress audio without loss of quality.
A technology like Vall-E has the proposal to reproduce the content of a text in audio. like the Ars Technica emphasizes, other such mechanisms usually synthesize speech by manipulating waveforms. The Vall-E is different. Microsoft technology generates “acoustic tokens” for this purpose.
Basically, Vall-E analyzes the sample (remember, it only needs to be three seconds long) and divides this information into discrete components (those tokens) through the EnCodec. Then, based on the training data, the engine determines how that voice would sound expressing other lines.
Does Vall-E work?
According to Microsoft, Vall-E was trained with another Meta feature: the libri-lighta library with 60,000 hours of speeches in English by more than 7,000 people.
This led to interesting results. At Vall-E demo page, you can check some tests. In most of them, the audio generated by the technology is incredibly similar to the voice of the original sample (indicated there as Speaker Prompt).
In some results, it is possible to perceive a certain artificiality. But, in others, it is practically impossible to discover that that reproduction was generated by artificial intelligence.
The Vall-E’s trump card lies in not only “absorbing” the timbre of the sample’s voice, but also in replicating the detected emotional tone.
The technology is even able to mimic the acoustic environment. For example, if the voice comes from a phone call, Vall-E can generate results that mimic this circumstance.
Use for good and for bad
A technology like this can be useful in several applications. Imagine, for example, a dubbing that preserves the voice of the film’s actor. Or an end-of-year message declared in multiple languages by a company’s CEO to its employees around the world.
These possibilities, incidentally, were addressed in the Tecnocast 268which discusses negative and positive uses of deepfakes.
Speaking of negative use, perhaps you’ve already thought about the possible malevolent implications of Vall-E. Imagine if the technology was used to attribute false speech to a politician, for example.
Perhaps this is why, at least for now, Microsoft has not publicly released the source code for Vall-E.
Anticipating the risk of problems, the project’s researchers mention the possibility of a detection model being developed to indicate whether an audio was generated by Vall-E. They also talk about following the Microsoft’s Artificial Intelligence Principles in creating models.