Editorial
Synthetic Data
April 8, 2022

Set to take the Artificial Intelligence/Machine Learning industry by storm - synthetic data is touted by some as the solution to data availability we’ve all been waiting for. But does it manage to do everything it promises? 

What is synthetic data?

Put simply, it’s data that’s created in an artificial world, instead of the real one. It’s been heralded as the answer to the large data requirements of AI/ML and synthetic data startups are generating huge investments. Machine learning requires vast amounts of data to teach the models. Companies like Facebook, Google and Apple have great swathes of data at their fingertips. But for small to midsize companies, lack of large datasets is a blockage to advancing their AI agendas and strategies. An IBM report states that in Egypt, executives say data availability is the biggest thing standing in their way and stopping them from making their AI dreams a reality. This is where synthetic data comes in. It promises accuracy for the model and best of all its accessible - delivering data at a fraction of the cost. Without a doubt, synthetic data offers a lot to companies who are looking to integrate machine learning models to their business. 

How to generate synthetic data:

Synthetic datasets are created using digital assets. We generated everything for our initial research using a gaming or rendering engine called Unity for its accessibility and community support. Other great options include: Unreal Engine and Blender. We took dashcams as a use-case with the constraint of only using synthetic data to create a person detection model. Using this game development software we were able to create lifelike scenes representative of the scenarios we needed. We created 3D models of vehicles, animated characters, and objects. We set up our scene using these digital representations, which would become our training data. We then captured images of our scenes to teach our model what to identify. 

Why use synthetic data? 

Affordability isn’t the only reason to opt for synthetic data. There’s a myriad of other factors that make it a suitable choice. For instance, it can successfully help reduce human and environmental biases. Take our first research scenario - if a company were to rely solely on real world data for their dashcam model, their dataset would be unbalanced. The demographics of a particular area would skew the data in one direction or another. This means they might end up with a model that could easily recognise one ethnicity but struggle with others. Using a program like Unity, you can ensure an even and diverse distribution of different types of people, so the model will always be able to operate correctly. You won’t be limited by your physical location, instead you’ll have full control over your scene. 

This equally applies to different climates and environments. As I mentioned earlier, the very largest companies that operate worldwide are lucky enough to have access to a huge amount of their own data. Google - for instance - has photographed practically every inch of the globe. They might, therefore, be able to create a balanced real world dataset of all the possible environments and terrains a vehicle could find itself in. Given that almost every other company who wants to integrate AI solutions isn’t in this position: synthetic data offers a unique way to rebalance a biased dataset. Or a balanced dataset could even  be created from scratch. 

Labelling is another real draw of synthetic data. In computer vision all data collected must be labelled correctly. For the machine to recognise what exactly it sees, everything in the image has to go through this process. When using real world datasets, this can be a time consuming and expensive process. However, one big benefit of rendering from a gaming platform is that everything is already pre-labelled. So there’s no need to send it to a labelling company who would have to label it themselves - something that can have many drawbacks.

The game and rendering engine we used allowed us a lot of flexibility. In post-production we were able to change the camera type used. This meant we could teach the model to recognise our designated objects with a wide variety of lenses. These included: a fish eye lens and a grainy camera. Training your model with only one camera type could lead you to trouble down the road. If the camera changed, your model would inevitably lose accuracy. In Unity we were able to play around with many different settings and adapt to a variety of potential lens types. We were even able to mimic the car being underwater and other more unusual scenarios. In the long term this would help your model be camera agnostic. 

Being able to mimic an underwater setting leads me to what seems to be the most promising aspect of synthetic data. Examples of edge cases kept coming up in our research where we could see that it would be impossible - or highly unethical - to collect real world data. For instance, let's say we wanted to teach the model to be able to identify and notify that a baby has been left inside a car. There are no circumstances in which it would be possible to collect that data, except synthetically inside the gaming engine. These edge cases seem to me to be where the biggest value comes from of synthetic datasets. In some situations - in almost every industry - there will be times when it is just not safe, possible or feasible to collect certain data. Whether you’re creating a dataset from scratch or complementing existing real world data - there’s real value to be found here on the edges. 

Synthetic Data vs Real Data

Often synthetic data has been posited as being in direct competition with real world data. We don’t believe that is the case. It occupies its own space within the world of datasets and collection. We would discourage seeing synthetic data as a means to replace real-world data entirely, focussing instead on the mutually beneficial relationship between the two. Some of the unique offerings of synthetic data include: increased anonymisation, efficient labelling and filling in the gaps left by edge cases. During our research we have learned that utilising a combination of both real-world data and synthetic data is key to mitigating any problems with synthetic data that could potentially arise. These different types of data are complementary, not competitive. Balancing them and having them work together can achieve incredible results. 

By the end of our initial research period, the model we had created could easily identify when fed synthetic scenarios all the objects we wanted it to. In the scheme of things, we only gave it a relatively small sample of images and it did really well. We are confident that by using a great deal more synthetic images any model we train will have success in the field. At Playground we believe in ethical data collection and we go above and beyond to capture all the data points our clients need. We’re excited to have synthetic data as a tool in our arsenal for this work going forward.