How Can Synthetic Data Solve the AI Bias Problem?

Before explaining how you can use synthetic data generation to resolve AI bias and its classification, we need to learn how it came to be. As artificial intelligence advances, questions and moral dilemmas begin to emerge about data science solutions. Since humans have distanced themselves from the process of decision-making, they need to know for sure that the decisions these systems make are in no way biased or discriminatory. AI has to be supervised.

Since AI generally represents a digital system based on predictive analytics that operates on big data, we can’t say it’s the AI that produces the potential bias. The problem begins much earlier, with the actual unsupervised data “fed” to the system.

Humans have been biased and discriminative throughout history. Our behavior doesn’t appear to be going away any time soon. Bias appeared in programs and algorithms that, unlike humans, seemed to be immune to this problem.

What Is AI Bias?

In data-related industries, bias occurs when your people gather data in a way that your sample doesn’t represent your population of interest accurately. You will be able to find many similar definitions that generally tell the same story. In practice, this means that people of particular backgrounds, religions, skin colors and genders are underrepresented in your data sample. This can lead to discriminatory decisions made by your system. It also brings up questions such as what is data science consulting and if it is important.

Bias in AI doesn’t mean that the AI system you’ve designed is intentionally biased towards certain groups of people. The purpose of AI was to allow people to explain what they want using examples rather than instructions. So, if AI is biased, it can only occur due to biased data! AI decision-making is an idealized process working in the real world, and it can’t hide human faults. It would also be beneficial to include supervised learning.

“Bias doesn’t come from AI algorithms; it comes from people.” - Cassie Kozyrkov

How Does It Occur?

As we’ve already mentioned, the bias problem appears because of data that can include human decisions based on stereotypes, suitable for a positive algorithm outcome. So how to train these algorithms?


Numerous examples of AI bias exist in real life. Google’s hate speech detection algorithm discriminated against people of color and known drag queens. Amazon’s HR algorithms were fed with mostly male employee data for ten years, which later caused that female candidates were more likely to be deemed suitable for a job in Amazon.

Data scientists at MIT found that facial recognition algorithms had higher rates of error when they analyzed faces of minorities (especially minority women), probably due to being predominantly fed with white male faces during the algorithm training. Take a look at the actual results below. Even without the newly formed problem of facial recognition with face masks, algorithms already struggle to recognize faces of particular groups of people, or anyone who isn’t a white male.

The problem is evident, and it needs to be dealt with somehow. If such problems occur to big data companies, the leaders in everything related to neural networks, data science, and AI, imagine other, smaller learning models. Synthetic data for machines represents a potential step forward, so let’s see what it’s all about.

What Is Synthetic Data and How It’s Generated?

Synthetic data is data that is generated partially or completely artificially, rather than being measured or extracted from real-world events or phenomena. Despite being artificial, it can help with plenty of scenarios, such as:

  • real data is scarce and insufficient
  • new data mining is difficult or expensive
  • actual data is biased and unfair towards certain groups of people

You can generate it in three ways:

  • fully synthetic: all data is computer-generated
  • partially synthetic: you can replace real data with generated data
  • corrected: real data is modified to accommodate a specific requirement

The means of synthesized data generation can be using deep learning models, machine learning, data science methods, or any commercial synthetic data generation tools available. Yes, there are synthetic data companies where data scientists work together on generating synthetic data for various businesses that need it.

It’s clear to see how synthetic AI data generation can help reduce AI bias just by generating artificial data or modifying the original data before it enters the AI or machine learning algorithms.

How Can Synthetic Data Solve the AI Bias Problem?

An ideal world should contain no bias, and people would have equal opportunities no matter their race, gender, religion, or sexual orientation. However, the real world is full of it - and people who deviate from the majority in a particular area have more issues getting a job and education, which makes them underrepresented in many datasets.

It can lead to false conclusions that these people are less competent and less suitable for being in these datasets and less suitable for being awarded a positive score, depending on the AI system’s purpose.

“[Machine-learning algorithms] haven’t been optimized for any definition of fairness,” says Deirdre Mulligan, associate professor, UC Berkeley School of Information. “They have been optimized to do a task.”

However, synthetic AI data can represent a huge step forward in the direction of unbiased AI. Here are some ideas behind it:

  • We can analyze the real-world data and observe the bias. Then we can generate data based on real-world data and observed bias. If you want the perfect dummy data generator, you’ll need to provide a fairness definition that will attempt to transform data (biased) into something that can be deemed fair.
  • If the dataset is not diverse or large enough, AI-generated data can fill in the holes and form an unbiased dataset. Even if your sample size is large, certain people might have been excluded or unequally represented compared to others. Synthetic data needs to be able to resolve this issue.

Generating unbiased data can be easier than expensive data mining. Gathering actual data requires measuring, interviewing, a large sample size, and, in any case, plenty of work. AI-generated data is cheap, and it only requires data science/machine learning techniques.

Facial Recognition

Let’s take facial recognition as an example. “If the training sets aren’t that diverse, any face that deviates too much from the established norm will be harder to detect, which is what was happening to me,” says Joy Buolamwini, an MIT researcher. “Training sets don’t just materialize out of nowhere. We can create them. So there’s an opportunity to create full-spectrum training sets that reflect a richer portrait of humanity.”

Feeding an AI algorithm face data that consists of 80% Caucasian faces and 20% faces belonging to people of color, it’s logical to predict that the algorithm will perform better with white people’s faces.

However, if you generate enough synthetic AI data representing artificial faces of people of color and feed it to the AI algorithm along with white faces, you can expect the error rate to be the same for all races. So, if you ever consider developing a face detection program, make sure it works no matter the person’s race or gender.

Also, generating synthetic faces brings the advantage of not bringing any privacy issues into play. In a day and age where the correlation between AI and data privacy concerns many Internet users, generating synthetic data of your own (faces, in this case) is a much safer option.

Final Words

AI bias is a real problem, and it can affect people in various ways. They may not receive an invitation for a job interview, banks will decline their loan application, or even become candidates for surveillance by law forces. Stakes are high, and the whole point of AI was eliminating subjective mistakes made by humans.

However, it’s not the algorithm that’s biased: it’s our data! The human factor can be tough to diminish and, if we feel the need to eliminate it from our AI systems, we would need to remove the bias from our data before it even enters the algorithm. A proposed solution is synthetic data generation.

Generating synthetic data can be complete or partial. It’s a unique way to make sure we treat everyone equally and that everyone’s “weight coefficient” is the same. The algorithm thinks if a male occupied a particular role for the last ten years, it means that males are better at it. It does not know the cruel and harsh reality of the real world. Exterminating this reasoning from an algorithm is difficult in the computer-science world - but it is a must if we want to take a step forward into a more equal and just society.


Michael Yurushkin

Founder of BroutonLab, PhD