Tom Cruise hasn’t posted on social media since last August, but the Mission Impossible star is now taking TikTok by storm. Clad in a Hawaiian shirt, he addresses the camera in the most recent of three uncharacteristically intimate videos that cumulatively attracted over 11 million views in a matter of days.
“I’m gonna show you some magic,” Cruise intones, brandishing a silver coin. “It’s the real thing,” he laughs, rotating the coin between his fingers. With a simple sleight of hand, the coin disappears. Cruise reveals his empty palm and then motions to his face. “It’s all the real thing.”
Except it isn’t. Created by anonymous TikTok user @deeptomcruise, the viral video is a deepfake—a piece of synthetic media generated by AI. The real Tom Cruise was not involved in the creation of this clip, although millions of photos and videos of him were. In a process called image-to-image translation, these photos and videos helped neural networks learn how to map Cruise’s features onto the face of someone else.
Pointing to small details like the mole on Cruise’s cheek or the unique alignment of his front teeth, some commenters argued that the video couldn’t be fake. Others registered its artifice with shock, amusement, confusion, and concerns about identity theft—risky business, indeed. Now that the video’s authenticity has been thoroughly debunked online, curious viewers are left with one question—how exactly was it made?
What Is Image-to-Image Translation & How Does it Work?
Deepfakes are fabricated via a process called image-to-image translation, which is a subdomain of computer vision, says Zuraiz Uddin, a Springboard mentor and data scientist involved with computer vision applications at Teradata, a leading cloud data analytics platform company.
Uddin likens image-to-image translation to the human imagination. “You can look at a scene in the daytime and imagine how it will look in the nighttime,” Uddin explains. “You have images in one domain, and you try to map that domain onto the other domain.”
Image-to-image translation uses unsupervised machine learning algorithms called GANs to conduct this process of transformation. Want to know what a busy road filmed in daylight would look like after dark? GANs can translate the footage.
GANs, or generative adversarial networks, pit a generator model and a discriminator model against each other to produce synthetic data that can pass for the real thing. To start the process of image-to-image translation, the input is fed to the generator as training data. In the context of the Tom Cruise deepfakes, this input would have been millions of images of Tom Cruise.
Images, Uddin explains, are made up of pixels—once the generator learns how to accurately distribute these pixels from the input images, it can begin to create synthetic images that mimic the originals. Next, the discriminator classifies the generated images as artificial or authentic.
“The discriminator checks whether the image looks real or not,” Uddin says. “If the discriminator detects that… the image doesn’t look real, it throws it back.” The generator must then improve the image until the discriminator can’t tell whether it is real or fake.
The relationship between the generator and the discriminator is both adversarial and symbiotic, explains Springboard mentor Jeff Hevrin, a data scientist and machine learning engineer working in the insurance and financial services space.
The adversarial aspect of the relationship is obvious—the generator repeatedly attempts to fool the discriminator in a zero-sum game. But these competing models learn from each other, too.
The synthetic images that the discriminator catches and returns to the generator can actually inform the next training loop, Hevrin explains. This feedback helps the generator fabricate a more convincing image, and the two models continue to teach one another as the game continues.
“Then all of a sudden, you’re generating images that are pretty realistic,” Hevrin says. “And that’s where the whole deepfake, face swap stuff comes into play.”
Want to read more about image-to-image translation? Check out our post, From Internet Memes to Scientific Research: Creating a Novel Image-to-Image Translation Model
How Neural Networks Are Designed to Imitate the Human Brain
GANs are deep learning architectures powered by generative and discriminative neural networks, which are designed to imitate the human brain.
“A really simplistic neural network is basically, you have your inputs and then map them to all these individual neurons,” explains Hevrin. “A neural network just learns the right combination of those neurons to fire to give you the right targets.”
Deep learning comes into play, Hevrin continues, when a model consists of multiple layers of neural networks stacked on top of one another. The first layer of a deep learning network starts to learn small things, and learning increases with depth.
“Let’s say you’re trying to do image classification,” he explains. The first layer of the model might start to learn the points that make up an image. Subsequent layers start to learn curves and shapes.
“Then if you’re doing facial recognition, maybe as you go further down, it starts to learn different parts of your face. And so you get to the very end where you’re classifying the individual person in that particular image.”
Complex deep learning models require more processing power than their single-layer counterparts, Hevrin notes.
“When you start to add those layers, you need a lot more computing resources,” he explains. “It used to be that if you wanted to do this type of analytic work or build a face swap model, you’d have to go buy all this hardware.” But now, many deep neural networks are trained on the cloud.
“If you need an environment to start doing these things with a ton of resources, you can just go out to AWS and turn on a big server,” Hevrin elaborates. Vast troves of data can be stored on these infrastructures and retrieved as needed. Elasticity makes cloud-based solutions particularly convenient.
“If you need something big for a while, you turn it on,” Hevrin says. “When you’re done, you turn it off.”
Other options for deep learning have cropped up too. Edge Computing, Hevrin explains, allows compressed models to run locally on small devices. As an example, Hevrin points to Snapchat, which offers augmented reality filters powered by neural networks.
“It’s pretty big,” Hevrin says of the Snapchat install, “so they’re probably including a lot of that filter logic on your device.” But that’s no easy feat. “I’m sure they’re battling to make sure it’s compressed and small enough that people don’t have to download a huge app and have it balloon out of control over time,” he continues. “There’s a lot of research going on to try to make these networks smaller.”
The Politics of a New Technology
GANs made their public debut in a 2014 white paper published by researchers at the University of Montreal. At that time, Hevrin explains, most of the work around deep learning and neural networks focused on simpler tasks like object detection and classification, he adds.
“It was a pretty drastic step forward,” recalls Hevrin. “It was just a big change to have two networks that basically compete against each other. That was just a whole new concept.”
Potential applications for GANs were not immediately clear, he remembers.
“I think a lot of folks were trying to get their head wrapped around this concept and then trying to figure out use cases for it.”
Social apps were early adopters—Snapchat released a face swap feature in 2016 that gained sufficient traction to spur Buzzfeed roundups and Pinterest collections of “face swap fails.” By 2017, even Business Insider was reporting on the trend, which had become a meme of sorts.
In 2018, more nefarious applications of image-to-image translation came to light, first via a slew of deepfakes that merged the faces of female celebrities with the bodies of women performing in pornographic videos. Later that year, filmmaker Jordan Peele collaborated with Buzzfeed to create a deepfake video of former president Barack Obama to warn the public about fake news and deception.
Concern about the potential political implications of deepfakes swelled. In 2018, the U.S. Defense Advanced Research Projects Agency (DARPA) developed deepfake detection tools, and the National Defense Authorization Act of 2020 stipulated reporting on foreign weaponization of deepfakes. It also established a competition to stimulate the development of deep-fake detection technologies.
The National Defense Authorization Act of 2021 and the IOGAN Act have tapped the Pentagon, the Department of Defense, the Department of Homeland Security, and the National Science Foundation to monitor deepfakes. These bills request recommendations that could create predicates for federal regulation of synthetic media.
States including California, Virginia, Maryland, Texas, and New York have all put forth legislation concerning deepfake pornography in recent years, while Facebook pledged in 2020 to remove misleading deepfake videos.
According to Uddin, debate continues within the AI community about where deepfakes are headed.
“Deepfakes are about creating fake videos,” he says. “That doesn’t comply with the ethics of AI.”
Uddin is careful to illustrate the difference between deepfakes and the technology used to create them. Deepfakes have not proved useful or ethical, he explains, but image-to-image translation has other applications to offer.
The Future of Image-to-Image Translation
“It’s a very recent technology,” Uddin says of image-to-image translation. “It’s in the research phase.” Right now, he explains, the focus is on using image-to-image translation to develop other technologies—often in the field of computer vision—as opposed to creating standalone applications. This is a normal course of development for new technology, Uddin clarifies.
“When the research area becomes more mature,” he says, “it comes to the application development phase.”
Although the technology is still young, industrial use cases are emerging.
“Looking at a problem like low resolution to high resolution, that’s an industrial application,” Uddin explains. Image-to-image translation can transform a blurry, 240-pixel YouTube clip into a high resolution video.
This use-case could have broader impacts in fields like space exploration.
“Take the example of the landing of a spacecraft on Mars,” Uddin says. “It’s picking up the pictures of what Mars looks like, and it is quite possible that the pictures will be low resolution. So you use image-to-image translation to make a low resolution image into a high resolution image.”
Self-driving cars also rely on image-to-image translation. Instead of painstakingly gathering images of a road in every season and weather condition, Uddin explains, scientists can use image-to-image translation to map how that road will appear in different contexts. These images are used to help train self-driving cars to operate autonomously.
According to Uddin, industrial applications of image-to-image translation are moving towards enhanced image perception, which has applications not only in space and on highways, but also in doctors’ offices—specifically in the domain of medical imaging.
Medical imaging captures images of visceral organs and soft tissues via ultrasounds, CT scans, PET scans, and MRI scans. Because the images generated by each of these modalities offer different types of information, they may be hybridized to improve the accuracy of a diagnosis. Recent studies have shown that GANs can accurately translate an image from one modality into another, rendering multimodal scanning unnecessary. They have also improved the resolution of MRI scans, which can be tricky to produce accurately the first time around due to environmental and equipment-related limitations.
In these use cases, image-to-image translation has the potential to accelerate and enhance diagnoses in oncology, neurology, physical medicine and rehabilitation, and more. But it has yet to be broadly applied, Uddin explains because tasks like cancer detection are mission-critical. As with self-driving cars, Uddin points out, new technologies must build trust over time before use is widely accepted in high-stakes scenarios. Nevertheless, he sees strides in the accuracy and efficiency of medical imaging on the near horizon.
“This is where image-to-image translation, I believe, can be a big revolution in the upcoming five to ten years.”
Does image-to-image translation intrigue you? Are you interested in machine learning? Check out Springboard’s Machine Learning Engineering Career Track to build your career in this challenging domain. The 6 months career track program offers a 1:1 mentoring-led, project-driven curriculum along with personal career coaching that would help you acquire job-ready skills.