This Is Why Machine Learning Is So Hard
Off-the-shelf models are a sound basis for enhancing custom-built ML solutions, but little more.
It’s rare to build technology from scratch these days. Most new products use an off-the-shelf component at one stage or another.
Machine learning (ML) is no different. But what does off-the-shelf mean when it comes to artificial intelligence in general? And ML in particular? Before we dig into the trickery of building an ML product…
Let’s start there.
What does ‘off-the-shelf’ mean in Machine Learning?
Nearly every new project, no matter your field, uses existing solutions to some degree. If you build a house, you might use a timber frame: cut to specifications and perhaps customized for a particularly unique design.
If you’re developing a machine learning model, the approach isn’t that different. You look for existing knowledge and ‘pre-fabricated’ code to use in your solution — and that’s what we mean by ‘off-the-shelf.’
You just might need to modify the code to fit your particular business case. With the scope of modifications depending on:
- What research is readily available
- Which solutions already exist
- The complexity of the product you’re building
To help you get a thorough understanding of how off-the-shelf solutions can enhance custom-built ML, let’s look at a real-life example, approached in three different ways: starting with (1) the ‘most off-the-shelf’ solution, looking at (2) the customization approach and ending with (3) a ‘do-it-yourself’ strategy.
By the end of this article, you’ll know the range of options available, as well as the pros and cons of each.
The Real-life Example: Finding Your Face In The Video Recording Of An Event
To kick things off, here’s a real-life DLabs project we’ll use for context.
The brief was to “create a system that can take a video recording from a public event as an input, then let a participant upload their face and/or number (which was attached to their shirt) to find themselves in the recording.”
For this project, we chose to use a mix of off-the-shelf services, open-source libraries, and our own custom code.
However, there were alternative strategies available.
Sections (1) and (2) describe an off-the-shelf and customization approach — and how we applied each one to our real-life DLabs project — while section (3) covers how ‘do-it-yourself’ could work.
(1) Most ‘Off-the-shelf’: Using Third-party Services & APIs
The most ‘off-the-shelf’ approach for any ML project is to use a third-party tool, add your data as the input, and then use the results as they stand. Even a non-technical user can adopt such an approach by accessing a basic user interface: typically, via a website; or a tool you download and install on your computer.
You can handle several tasks this way, including pasting images to add labels to, uploading data to a spreadsheet to use in forecasting — or, in our case, linking a video recording to use to detect a person’s face.
Third-party tools require little more than the click of a button, followed by a short wait, ending with the results. Yet, while they are super simple, they’re also super limited for several reasons:
- Repetitive: First up, if it’s a task you have to repeat many times, it quickly becomes tedious — some tools overcome this with a ‘bulk upload’ or ‘bulk download’ option, but not always
- Manual: If the results you get are just one step in a longer process, it becomes very manual to keep saving data at this specific stage. Moreover, in instances where you only have terminal-level access, you can’t use a manually-handled service on a virtual machine as well
- Costly: Some of these tools are free, but most come at a price with you either paying a subscription fee, or per-use, meaning costs can vary wildly
You can mitigate most of these limitations to some degree — although, again, mitigation typically comes at the expense of a subscription.
By using APIs, or configuring programmatic access to a service, you can avoid the manual clicking a button and instead make a call to the interface: either directly via the terminal, via a REST API, or using libraries of widely-used programming languages such as Python, PHP, NodeJS, Java or C# (the language used depends on the API provider and the community).
Still, to use an API, you’ll need the support of a skilled backend developer to take care of authentication, sending your data to the API, and retrieving and saving the results in the required format and destination. And while the developer in question doesn’t have to be a machine learning specialist, you’ll need a supervisor with strong general knowledge of the domain — as we recommend for any machine learning project.
However, whether you use an API or not, third-party services rarely offer the full scope of services you need. So, you will nearly always require a level of customization. And the more bespoke your solution, or the more unique your project, the less likely it is you’ll find a service that can help at all.
To make matters more complex, it’s difficult to know how a third-party service works in the background. Not to mention the fact that building with a dependency on a third party has the intrinsic risk of your service suddenly breaking following an update. In the best case, this could mean just a few minutes offline as you quickly update your own code; in the worst case, you may have to pull your product for good. The takeaway?
If your project is broad in scope, or a single component is critical to your service, a more custom approach is likely a better option.
DLabs Project — Step One: Using AWS Rekognition, Off-the-shelf.
Now, let’s see how off-the-shelf works in practice by turning to the DLabs face recognition brief mentioned earlier.
For this project, we used the AWS Rekognition service (Face Detection and Detecting Text modules, in particular): first, to locate a person’s face in the video, then to track the time at which their face appears.
— ‘Why did we use AWS Rekognition straight off-the-shelf? Well.’
Time was the deciding factor. We needed to finish the project quickly, so when we found a reliable, readily-available system, we went for it; instead of spending time building a custom solution.
Better still, as it was accurate enough for our use case, there was no additional investment to train a custom neural network, label data, or other costly tasks. All in, it made sense to use — but there were still several steps to make the service work.
- First, we had to set up authentication of AWS on our remote machine (also hosted on AWS, as it happens)
- Then, we had to create several support AWS services
- Finally, we had to assign the correct roles and permissions
Even though the documentation was clear, it still required someone familiar with the AWS ecosystem to set it up correctly. While we also had to create an AWS S3 file storage, then store our source videos there.
With the system set up, we could call the AWS Rekognition service using a dedicated Python library (Python is our language-of-choice when it comes to machine learning). Still, to retrieve all necessary outputs (including ID, times, face/body coordinates — ‘bounding-boxes’ — of each detected person), we had to modify the script from the AWS documentation. Plus, we had to save the output in the right place.
By now, we had the face recognition service at the ready. But what about the number detection service? For this, we used another AWS Rekognition module: Detecting Text. And given we had already configured the system, calling this module was simple.
The method used was similar to People Pathing. Although, instead of using videos as the input, it used images: while to extract the images (a few frames for each detected face), we had to decode the whole video and run it through custom code, using FFmpeg for video editing.
With the inputs prepared, we ran it through AWS. Then, we compared the outputs of the text detection — ‘bounding boxes of detected text’ — with the people detection outputs from the previous step — ‘bounding boxes of detected people’ — to match the text to a given person.
Unfortunately, the resulting accuracy wasn’t good enough. As even with the careful preparation, the model often detected the wrong text (like a brand name or a caption on a shirt).
To zero in on the participants’ numbers only, we still had to perform a significant amount of processing ourselves, which amounted to a fair bit of effort in the end — and that begs the question:
— “Couldn’t you have just custom-built the whole solution yourselves?
Sure, of course, we could. But it would have required significant time and effort. We would have had to identify key frames in the whole video, run people detection models (of both body and face), compare outputs to detect the movement of each person, and build a text detection module.
Each of these is a separate task; whereas, with AWS Rekognition, we had a single service that handled it all, and handled it well — as for the costs, those stacked up as well, as you can see below:
- We estimated the time needed to process one image (three stages = face, body, text detection) at ~7 seconds on a machine with a GPU
- Assuming we process 10,000 images, it would take ~19.5hrs
- An instance with a GPU for model inference costs $0.88 per hour
- The total cost to process 10,000 images on a virtual machine = $17.16
- The total cost to process 10,000 images using AWS Rekognition = $10
With AWS Rekognition, we pay per call, but the fee is nominal; while running a custom solution would mean paying for a virtual machine (at a similar cost) — add in the development and other overheads of a custom solution, and AWS Rekognition clearly becomes the most efficient option.
(2) Customization: Open-source Repos & Pre-trained Models
So, off-the-shelf works well in certain contexts. But what if there isn’t an appropriate third-party service for you to use? Or if your problem is particularly complex, unique, or you simply don’t want to rely on a third-party: does that mean building everything from scratch?
We have good news: of course not. No matter your project, it is highly likely that someone, somewhere, has already solved the problem you’re facing, at least to some degree.
Businesses, scientists, researchers, and hobbyists create all manner of solutions on a daily basis. They collaborate with a global community on GitHub. And so the place to look for innovative machine learning solutions is right there.
GitHub is home to the most current open-source projects around. Many of them are perfectly maintained and come with comprehensive documentation, including the full codebase coupled with step-by-step deployment instructions.
The best ones are battle-tested by thousands of users so that you can treat them as if they were ‘off-the-shelf’ products in everything, but name — all that’s left is for you to clone the repository, and you’re good to go.
Better still (and in contrast to proprietary software and services), you get unrestricted access to the codebase, meaning you not only get to inspect how the solution works; you can adjust it as needed, merge it with your codebase, select the components you want to use, and disregard the rest.
Of course, the more you want to customize the code, the more experienced your team needs to be. But this can be an incredibly practical approach. It sounds promising, but what are the cons of using open-source code?
As ever, there are several factors to consider:
- Maintenance: First of all, while some repositories are well maintained with ample documentation, many are the opposite: offering poorly-commented code with scant, even non-existent, documentation — which can lead to issues that range from simple bugs to showstopper problems when running now-obsolete versions of libraries
- Licencing: Usually, GitHub repositories are OK to use in commercial applications, but you may need to acknowledge their use in your product documentation. That said, sometimes, key elements are prohibited from commercial use (for example, code may be free to use, but not the model itself — i.e., a model trained on a specific dataset).
Other factors of open-source repositories to consider in the context of machine learning are datasets and pre-trained models.
It is a challenge in itself to assemble the right dataset to power a machine learning model. It takes a significant investment of time and money to collect and correctly label data to train a model, often requiring resources beyond a company’s means.
Thankfully, the internet is full of publicly-available datasets you can use, often prepared to the highest standards using government data, or collected by professional researchers for scientific use.
The accuracy of the models in various research papers is usually measured using the same known and proven datasets, ensuring everything is equal and objective. In many instances, especially in well-researched areas like computer vision or natural language processing, you can rely on pre-trained models created by someone else.
Such models are usually trained on powerful stations using finely-tuned parameters, resulting in extremely high accuracy. And though they require effort to embed them into your project, it’s nothing compared to the time and money required to prepare a dataset, set up the infrastructure, and train the model — moreover, doing it yourself has no guarantee that your model will outperform the existing one.
Still, you have to be careful.
If the data you feed the model is significantly different from the data used to train the model, the accuracy will suffer. That said, it’s often worth trialing the pre-trained version — if only as a measure of the accuracy of your own solution.
DLabs Project — Step Two: Customizing Face And Text Comparison
Back to the DLabs case: now that we’ve identified and tracked our event participants, each one has their number, face, and body outlines detected and saved. Yet, that’s only half of the story.
Our main objective — and the core product functionality — is to let users upload their faces and/or numbers in order to find themselves in the video. So, how did we achieve that? Let’s start by looking at matching the numbers.
Comparing numbers is much easier than faces. We used the Levenshtein distance metric, which is widely known and well used. It measures the distance between two words using the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
By taking the number uploaded by the user, then comparing it to all the numbers found in the source video, we could train the model that the number with the shortest distance (i.e., the highest similarity) was likely to be the number of the user. Feature one complete. The next challenge, faces — so, how did we achieve that?
Here, we used the face.evoLVe: High-Performance Face Recognition Library based on PyTorch — a freely-available, open-source GitHub repository. Still, before we could make the comparison, we needed to align and embed the faces in a normalized form, which face.evoLVe happens to handle very well.
The repository uses state-of-the-art, pre-trained models based on facial key-points, which we could obtain via the same AWS Rekognition module. So, with faces now normalized, we could compare them using a custom-built algorithm — based on cosine distance — to find their best available matches.
With all the components ready, the final piece of the puzzle was simply assembling the parts into a coherent application, including a proper user interface, databases, logging, and other necessary elements.
(3) What About Do-it-yourself: The 100% Custom Solution?
We didn’t have to use this approach at all for our product.
But if your problem is particularly unique, or you want full control over — and full intellectual rights to — the solution, then developing your own product can be the practical way forward; and very rewarding to boot.
Even here, you won’t necessarily have to start from scratch. You can still base your models on published research papers. Still, doing so won’t make your life easy.
It takes an incredible understanding of the topic and the research to translate a paper into working code, and you may need a domain expert to help: on data preparation, algorithmics, and modeling — as much as on implementing the actual solution, carefully testing the process, and comparing the output to different methods.
And before you ever get to coding, you’ll need to carry out your own research, then prepare data to fit both your business case and the method you decide to follow. After this stage, you can prepare the whole environment, which is, in itself, a lengthy process that takes substantial investment before ever revealing any meaningful results.
Oftentimes, though, the results you get will be superb.
After all, you’ve chosen to use state-of-the-art methods tailored to your specific problem, so you wouldn’t expect anything less. And who wouldn’t want such an outcome, right?
Well, if you have the time and means to pay for skilled developers and top-end equipment, then, by all means, it’s highly-rewarding to go 100% custom-built: but few companies have such resources at the ready — while a customization approach often serves business interests just as well.
DLabs Make Machine Learning Simpler
In truth, off-the-shelf solutions are an effective way to make machine learning projects simpler.
As you can see from the DLabs project, most machine learning projects end up being a mix of existing solutions and customizations, either way: blending knowledge and code that’s taken from the ‘shelf,’ then enriched by a dedicated team of machine learning specialists.
If you want to succeed with ML, it’s a matter of adequate planning and smart people: a heady blend that can help you decide when it makes sense to follow the custom-built route — and when it’s better to use a product developed elsewhere.
Looking to solve a business problem with machine learning? Learn if customizing an off-the-shelf product can help you find a solution: get in touch with DLabs.AI for a free 15-minute machine learning strategy consultation.