By Parmy Olson
Last week Microsoft Corp said it would stop selling software that guesses a person’s mood by looking at their face. The reason: It could be discriminatory. Computer vision software, which is used in self-driving cars and facial recognition, has long had issues with errors that come at the expense of women and people of color. Microsoft’s decision to halt the system entirely is one way of dealing with the problem. But there’s another, novel approach that tech firms are exploring: training AI on “synthetic” images to make it less biased.
The idea is a bit like training pilots. Instead of practicing in unpredictable, real-world conditions, most will spend hundreds of hours using flight simulators designed to cover a broad array of different scenarios they could experience in the air.
A similar approach is being taken to train AI, which relies on carefully labelled data to work properly. Until recently, the software used to recognise people has been trained on thousands or millions of images of real people, but that can be time-consuming, invasive, and neglectful of large swathes of the population.
Now many AI makers are using fake or “synthetic” images to train computers on a broader array of people, skin tones, ages or other features, essentially flipping the notion that fake data is bad. In fact, if used properly it’ll not only make software more trustworthy, but completely transform the economics of data as the “new oil.”
In 2015, Simi Lindgren came up with the idea for a website called Yuty to sell beauty products for all skin types. She wanted to use AI to recommend skin care products by analysing selfies, but training a system to do that accurately was difficult. A popular database of 70,000 licensed faces from Flickr, for instance, wasn’t diverse or inclusive enough. It showed facial hair on men, but not on women, and she says there weren’t enough melanin-rich — that is, darker-skinned — women to accurately detect their various skin conditions like acne or fine lines.
She tried crowdsourcing and got just under 1,000 photos of faces from her network of friends and family. But even that wasn’t enough.
Lindgren’s team then decided to create their own data to plug the gap. The answer was something called GANs. General adversarial networks or GANs are a type of neural network designed in 2014 by Ian Goodfellow, an AI researcher now at Alphabet Inc’s DeepMind. The system works by trying to fool itself, and then humans, with new faces. You can try testing your ability to tell the difference between a fake face and a real one on this website set up by academics at the University of Washington, using a type of GAN.
Lindgren used the method to create hundreds of thousands of photorealistic images and says she ended up with “a balanced dataset of diverse people, with diverse skin tones and diverse concerns.”
Currently, about 80 per cent of the faces in Yuty’s database aren’t of real people but synthetic images which are labelled and checked by humans, she says, who help assess her platform’s growing accuracy.
Lindgren is not alone in her approach. More than 50 startups currently generate synthetic data as a service, according to StartUs Insights, a market intelligence firm. Microsoft has experimented with it and Google is working with artificially-generated medical histories to help predict insurance fraud. Amazon.com Inc said in January that it was using synthetic data to train Alexa to overcome privacy concerns.
Remember when Big Tech platforms found themselves in hot water a few years ago for hiring contractors to listen in on random customers, to train their AI systems? ‘Fake’ data can help solve that issue. Facebook also acquired New York-based synthetic data startup A.I.Reverie in October last year.
The trend is becoming so pervasive that Gartner estimates 60 per cent of all data used to train AI will be synthetic by 2024, and it will completely overshadow real data for AI training by 2030.
The market for making synthetic images and videos is roughly divided into companies that use GANs, and those that design 3D graphics from scratch. Datagen Technologies, based in Tel Aviv, Israel, does the latter. Its CGI-style animations train car systems to detect sleepiness.
Carmakers have historically trained their sensors by filming actors pretending to fall asleep at the wheel, says Gil Elbaz, co-founder of synthetic data startup Datagen, but that still leads to a limited set of examples. The videos also have to be sent off to contractors in other countries to be labelled, which can take weeks.
Datagen instead creates thousands of animations of different types of people falling asleep at the wheel in different ways, like the example above. Though the animations don’t look realistic to humans, Elbaz says their greater scale leads to more accurate sensors in the cars.
Fake data isn’t just being used to train vision recognition systems, but also predictive software, like the kinds banks use to decide who should get a loan. Fairgen Ltd, a startup also based in Tel Aviv, generates large tables of artificial identities, including names, genders, ethnicities, income levels and credit scores. “We’re creating artificial populations, making a parallel world where discrimination wouldn’t have happened,” says Samuel Cohen, CEO and co-founder of Fairgen. “From this world we can sample unlimited amounts of artificial individuals and use these a data.”
For example, to help design algorithms that distribute loans more fairly to minority groups, Fairgen makes databases of artificial people from minority groups with average credit scores that are closer to those from other groups. One bank in the UK is currently Fairgen’s data to hone its loan software. Cohen says manipulating the data that algorithms are trained on can help with positive discrimination and “recalibrating society.”
Strange as it may sound, the growth of fake data is a step in the right direction, and not just because it avoids using people’s personal data. It could also disrupt the dynamics of selling data. Retailers, for instance, could generate extra revenue by selling synthetic data on customers’ purchasing behavior, according to Fernando Lucini, Accenture Plc’s global lead on data science and machine learning. “Business leaders need to have synthetic data on their radar,” he adds.
One caveat about unintended consequences, though: With so much artificial data driving our future systems, what are the risks that some of it will be used for fraud, or that it’ll be harder to find real identities amid the flood of fake ones?
Synthetic data also won’t eliminate bias completely, though, says Julien Cornebise, an honorary associate professor of computer science at University College London.
“Bias is not only in the data. It’s in the people who develop these tools with their own cultural assumptions,” he says. “That’s the case for everything man-made.”