Machine learning needs a vast amount of data. So the first question we ask clients is: do you have enough? You may answer ‘Yes,’ but you probably don’t have as much as you think. How can we be so sure? And how can you get more and achieve the best results? Find the answers you’re looking for in the following article.
Let’s start with an example.
It’s always easier to grasp a concept through a real-life example, so let’s start there.
Imagine you’re organizing a party. It’s an important event, and you want to hire a photographer to capture it. You ask them to take ‘lots of photos’ because you don’t want to miss a moment: you tell them to ‘photograph it all.’
The photographer follows your instructions. They get paid — while you get a hard-drive-full of pictures.
One day, you decide to cherry-pick a few to create an album for this event. You sit at your desk, excited you have so many to choose from until you open the first picture: disappointment strikes.
The quality isn’t as good as you hoped. The picture is blurred and dark. You can’t make out anything at all, but at first, you think, ‘Maybe there’s been a mistake. Has the photographer accidentally uploaded this photo…?” Unfortunately not.
Each subsequent picture is the same. You continue scrolling, no improvement. Annoyance builds, then you find one gem: the perfect shot. But your happiness is short-lived. Back to scrolling, back to dire imagery — and it’s only getting worse.
You lose hours trawling through the collection. You find less than a handful of photos you can develop. There will be no album. You’ve wasted thousands on unprofessional service, and what’s worse, you probably shouldn’t have ever received these photos in the first place.
That’s time and money, down the drain.
Now, step back: What do you think caused this problem? And was there anything you could have done to avoid it?
The answer to the second question is, perhaps.
As to the first, well: the photographer got a poorly-defined task as the outset. They were just told, ‘to take a lot of pictures’ — nobody said the pictures ‘must be of great quality.’
It’s assumed, yes — but if you don’t adequately define what you need, there’s always a risk not getting what you want.
Fine… but how does this relate to machine learning?
Well, building machine learning — or any software that relies on data — is not much different from the example above: how you define a task matters, particularly if you want the right quality results.
So what can you do to avoid a repeat? Focus on quality over quantity.
Useful data is high-quality data.
As was the case with your photographer, merely generating a lot of data rarely satisfies anyone’s requirements. In fact, focusing purely on quantity often means most of the data that results is useless.
What’s important is the quality of the dataset, as it’s quality that determines the performance of AI software, which is the moment we understand. If your input is low-quality, your results will never meet expectations.
In the case of machine learning, in particular, quality over quantity is key.
4 steps to get good quality data for your AI software.
First, let’s look at how you get the right quality data.
There are four steps, and if you follow each one in sequence, your machine learning software will give you the results you want.
1. Specify your business goal
This is the single most important aspect of every AI project. Think about what you want to achieve and why. Then explain it in clear, simple language to the team responsible for the build.
Make life as easy as possible: specify one primary goal — supported by how AI will help your company achieve it.
2. Find out what data you need
Next, be specific about what data you need to create a solution that matches your expectations.
This is key because if you repeat the mistaking of ‘asking for lots of photos,’ you’ll get the wrong type of data. Whereas if you carefully study the problem you want to solve, you’ll get the dataset that fits your purpose.
This means looking beyond quantity and focusing squarely on the data that provides the most relevant information.
Remember: collecting every last bit of information is not the same as collecting valuable information. A useful dataset contains the precise details you need to solve your problem.
3. Clean up your data
Now you know your goal, and you’ve identified the data you need, it’s time to eliminate all the ‘rubbish’ that could cloud your dataset.
Clear any incoherent information. Make sure everything is as accurate as possible. And try to avoid general, misleading, or low-quality information. Instead, focus on details that a machine can interpret and analyze.
Do not be fooled: this is a very demanding task. It isn’t easy to do without the requisite knowledge and experience — which is why you should always continue to step 4.
4. Work with domain experts
Data scientists can help you clean up your data. Other experts can help you get the rest right.
For example, if you don’t know:
- What data you need to hit your business goal
- How to save or store your data
- How to organize and prepare your datasets for projects
- How to prove whether your data is of suitable quality
If you don’t have enough data, here’s what to do.
When the four steps above don’t yield a big enough dataset, all is not lost. These next three steps can get you the volume your project needs.
1. Consider if there’s a hidden dataset
If you don’t have enough data, you may have missed a hidden resource. Consult with a team of data scientists and ask them if there could be a relevant source of information that you haven’t yet thought of.
2. Consider simplifying your goal
When you first set out on your mission, you may have set the bar too high. Your goal may be overly ambitious, or overly complex, and so require ultra-detailed or accurate data, which you do not have.
Still, the data you have could be enough to start something smaller. Either way, if this is your first AI project, starting smaller is often better: you can expand the scope in the future, which improves your chances of long-term success.
3. Consider using synthetic data
There’s more than one way to collect data. An often-ignored route is to generate synthetic data.
The synthetic approach is best followed when you have a base of good-quality data you can apply to an initial solution, which you can then use to build a real-world dataset. Moreover, it lets you create a solution much faster and more economically than if you were to collect real-world data from scratch.
Learn more about how the approach works in our article on “How to Create Synthetic Data to Train Deep Learning Algorithms.”
You might think having access to a vast dataset is all you need to create an AI-based solution. Unfortunately, this is rarely the case.
You need to analyze a dataset to understand the possibilities that lie within. And if you don’t have the right data, you need to follow one of the other three paths to get the high-quality results you want.
Looking to build Artificial Intelligence, but not sure if you have the right dataset?
Chat with a DLabs AI specialist today for free guidance on the best path forward.