How to Create Synthetic Data to Train Deep Learning Algorithms?

How to use deep learning (even if you lack the data)? It’s a tricky task. To train a computer algorithm when you don’t have any data. Some would say, it’s impossible – but at a time where data is so sensitive, it’s a common hurdle for a business to face. Imagine, you needed to monitor your database for identity theft. Say, by using personal information that, for legal reasons, you cannot share.

Historically, you would have needed to generate manual inputs for any hope of finding a workable solution. These days, with a little ingenuity, you can automate the task.

You can create synthetic data that acts just like real data – and so allows you to train a deep learning algorithm to solve your business problem, leaving your sensitive data with its sense of privacy, intact.

Read on to learn how to use deep learning in the absence of real data.

What is deep learning?

First, let’s (briefly) tackle an important question: What is deep learning?

Deep learning is a form of machine learning. It’s a technique that teaches computers to do what people do – that is, to learn by example.

In deep learning, a computer algorithm uses images, text, or sound to learn to perform a set of classification tasks. And deep learning models can often achieve a level of accuracy that far exceeds that of a real person – which is why the technique is in high demand.

  • Driverless cars use deep learning to identify signs, for example, or determine the difference between a pedestrian and a lamp post;
  • While voice control in phones, TVs, and Alexa devices relies on deep learning to interpret spoken commands, then refine the response.

Given deep learning enables so many groundbreaking features, it’s little wonder the technique has become so popular. However, computer algorithms require a vast set of labeled data to learn any task – which begs the question:

What can you do if you cannot use real information to train your algorithm?

The answer? Use the next best thing.

See also: Everything You Need to Know About Key Differences Between AI, Data Science, Machine Learning and Big Data

Synthetic data

At DLabs.AI, we’re working with a client who needs to detect logos on images. Yet, they don’t have the dataset to train the deep learning algorithm, so we’re creating fake – or synthetic – data for them.

To do this – we’re following a basic method.

  • Select a random image without a logo
  • Embed a logo into the image background
  • Repeat to create a synthetic dataset

The approach lets us create thousands of separate images, even though we’re only using one logo. And with the image library to hand, we can program a neural network to carry out the object detection task.

That is – we can teach the computer how to recognize the logo in the image.

In essence, we’re building a logo detection model without real data. And while we don’t claim to be the first company in the world to develop a logo detection solution, we are among the first to use synthetic data to train a deep learning algorithm.

Now, we’re exploring how else clients could use the method – one idea we’ve had is for header detection.

Say, you want to auto-detect headers in a document. DLabs.AI could generate fake data from standard <.html> files, referencing the labels within the HTML structure to create training images with header labels identified.

Hey, presto – a header detection algorithm in training.

See also: Why You Don’t Have As Much Data As You Think. And 3 Ways To Fix It

Pros and cons of synthetic data

There are several reasons beyond privacy that real data may not be an option. The most obvious?

Limited resources. If a company wants to train an algorithm with real images, it requires a manual process to label the key elements (in our example, the logo) and that quickly gets expensive.

So, by automating the creation of synthetic data, you get two clear benefits.

Benefits of synthetic data

Cheaper

In the DLabs.AI example, as we embedded the logo ourselves, we knew the precise position of the logo on every image – so we could label it automatically.

By generating synthetic data, we instantly saved on labor costs.

Quicker

Plus, once we had created our first data point, it didn’t take long to duplicate the record to create a catalog of thousands of correctly-labeled images.

Dynamic

Moreover, when you train a model on synthetic data, then deploy it to production to analyse real data, you can use the production data (in our client’s case – real imagery) to continually improve the performance of the deep learning model.

Drawbacks of synthetic data

Synthetic data does have its drawbacks; the most difficult to mitigate being authenticity.

That is – creating synthetic imagery that still looks realistic.

If we had a picture of a room, for example, we had to scale the logo to fit the perspective of its surroundings (the walls, the floor, the table, etc.). Further, we had to check a logo sat on the object itself rather than at the intersection of two items.

We also had to simulate changing light conditions while checking a human could recognize the logo once embedded. The sheer number of variables made it tricky to place the logo naturally within the context – an essential element to train a deep learning algorithm accurately.

3 steps to know if deep learning can help your business

Clients contact us every week to ask “can deep learning help my business?” but then feel overwhelmed by the apparent complexity of the technique.

To keep things as simple as possible, we approach the question in three steps.

1. First, we develop a Research & Development Outline.

We investigate the kinds of products or algorithms that we could use to solve your problem. We review the latest scientific research on the subject to see if we can use any particular findings – or if there is an open-source implementation we can adapt to your case.

2. Then, we confirm if there’s business value.

We outline an integration model to confirm we can deliver the expected value. By this stage, both parties should have a rough idea of what’s to come, so we avoid nasty surprises down the line – like a client with a solution she doesn’t actually want.

It’s an agile approach that gives the client time to think, and us time to uncover any hidden needs before tackling the bigger picture.

3. Only once we’re aligned on the outcome, do we train the algorithm.

This is where it gets technical.

For those interested in our client case study, we used region-based convolutional neural networks, Tensor Flow and its object detection API (a repository that contains state-of-the-art object detection networks – built by Google).

Models were pre-trained on Microsoft’s COCO Challenge dataset, before training them on our own synthetic data.

Connect With DLabs.AI

Artificial Intelligence is changing the world as we know it as businesses in every sector achieve the seemingly impossible.

So ask yourself “Can deep learning solve my problem as well?”.

With the development of DLabs’ synthetic approach, data is never the limit. If you’re interested in deep learning – now is the time to get in touch.

how to implement ai

Read more on our blog