Speech2Face AI Generates Faces Using Their Voice


Are you ever surprised when people don’t look how they sound? Bobby Caldwell, Sam Smith, and Amy Winehouse are artists who are typically placed in this category. Speech2Face is here to change the game with its new AI-powered facial creation, using their voices only.

We consider the task of reconstructing an image of a person’s face from a short input audio segment of speech. We show several results of our method on VoxCeleb dataset. Our model takes only an audio waveform as input (the true faces are shown just for reference). Note that our goal is not to reconstruct an accurate image of the person, but rather to recover characteristic physical features that are correlated with the input speech. 

*The three authors contributed equally to this work.

How does Speech2Face work?

To understand how Speech2Face AI works, you need to know four key factors. They are the acoustic nerve, cochlear nucleus, auditory cortex, and the prefrontal cortex.

The acoustic nerve, also known as the cochlear nerve, acts as a busy highway. It transmits electrical data from the inner ear to the brain stem, where the signals are then relayed to other parts of the brain. The cochlear nerve is primarily responsible for transmitting the electrical impulses generated for hearing and sound localization. 

In the auditory cortex, as with other primary sensory cortical areas, auditory sensations reach perception only if received and processed by a cortical area. And lastly, the prefrontal cortex is frequently linked to executive functions. In general, it controls short-sighted, reflexive behaviors in order to take part in things like planning, decision-making, problem-solving, self-control, and acting with long-term goals in mind.

anatomy of the area affected by the Audio and Visual stimulation

Why Does Knowing The Four Main Functions Matter?

The four main factors that make the connection between visual and auditory will give us insight at how Speech2Face AI is able to formulate facial features using only auditory clips.

We’ve seen this technology in the works in recent years. Below is a video from two years ago, where they showed the beginning developments of this type of advancement.

Thus, the main differences between two years ago and today is that Speech2Face completely generates the image using the 4 main factors described above, as well as millions of internet videos, through its AI.

In conclusion, is this AI a bad idea, or a good one? For now, that’s unclear. But it’s safe to say that we are witnessing a great renaissance in the tech world and it’s exciting.