Microsoft’s AI app VASA-1 makes faces in pictures talk and sing: How does it work? (2024)

The AI system has been developed by a team of researchers from Microsoft Research Asia. (Image: Microsoft)

Not long ago, some apps could bring photographs to life with GIF-like motions. Now, we have an AI system that can make photographs dance and sing. A team of AI researchers at Microsoft Research Asia has created an AI application that can convert still images of people and audio tracks into animation. It’s not merely animation/ Reportedly the output accurately shows the people in images speaking or singing to the audio track, along with apt facial expressions.

The latest application, Vasa, is a framework for generating life-like talking faces of virtual characters with appealing visual affective skills (VAS) from a single static image and a speech audio clip. “Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness,” wrote researchers in a paper describing the framework.

According to the team, the core innovations include holistic facial dynamics and a head movement generation model that works in a face latent space and the development of an expressive disentangled face latent space using videos. The team said that through extensive experiments and evaluations on a set of new metrics, their method could significantly outperform previous methods along various dimensions.

What is VASA-1?

The researchers at Microsoft claim that their new method is not only capable of producing lip-audiosynchronisationbut can also create a large spectrum of expressive facial nuances and natural head movements. “It can handle arbitrary-length audio and stably output seamless talking face videos.”

The researchers working on VASA-1 embarked on an ambitious task of bringing static images to life, making them talk, sing, and express emotions in perfect sync with any audio track. VASA-1 is an outcome of their efforts as the AI system transforms motionless visuals, be they photographs, drawings, or paintings, intosynchronisedanimations. When it comes to control, the researchers claimed that their diffusion model could accept optional signals as conditions like main eye gaze direction and head distance, emotion offsets.

Based on the research paper, the team has showcased the capabilities of the VASA-1 system through a host of video clips. While in one a cartoon version of the Mona Lisa springs to life and breaks into a rap song. In this example, Mona Lisa’s expressions and lip movements perfectly align with the lyrics. Meanwhile, another example shows a photograph of a woman transformed into a singing performance. Another example shows a drawn portrait of a man delivering a speech, one can notice his expressions shift naturally toemphasisethe spoken words.

Advertisem*nt

How was VASA-1 created?

According to the research paper, the breakthrough of VASA-1 happened through an extensive training process. This involved AI systems being exposed to thousands of images portraying a wide range of facial expressions. This vast data set reportedly allowed the system to learn and accurately recreate the nuances of human emotions along with speech patterns. The current iteration of VASA-1 generates high-resolution visuals at 512X512 pixels with a frame rate of 45fps making it appear smooth. Reportedly, the rendering of these realistic animations takes an average of two minutes and this is possible by using the computational power of a desktop-grade Nvidia RTX 4090 GPU.

ICYMI | What is Limitless Pendant, the world’s smallest AI wearable device?

The research paper does not explicitly mention a release date but states that VASA-1 brings them closer to a future where AI avatars can engage in natural interactions, suggesting it is a research prototype for now. Even though VASA-1’s potential use cases are wide-ranging, the researchers have acknowledged its potential for misuse. And, as a proactive measure, they have reportedly decided to withhold public access to VASA-1. They have acknowledged the need for responsible stewardship of such advanced technology to mitigate any unintended consequences or exploitation.

Although these animations seamlessly combine visuals and audio and give a lifelike charm, the researchers have said that upon closer examination, one could notice some subtle flaws and telltale signs typical of AI-generated content. Nevertheless, the examples shared showcase the technical excellence of the team that has been working on VASA-1.