The Internet is full of datasets that cover all the possible areas of machine learning (ML): from well-known benchmark datasets to Kaggle competitions and data from academic competitions, the list goes on... The amount of datasets increases year on year. However, there are still not enough out there to build a high-quality ML-based solution.
In this article, we consult with Nikolai Zenovkin, Lead Data Scientist at StageZero Technologies, to uncover reasons why you may need more data in particular areas, and to look into what kind of resources are out there to help you to obtain the relevant data.
There is a significant need for the building of sustainable and reproducible experiments in scientific research, and publicly available datasets play an important role here. Therefore, the true goal of public domain data is to create a comprehensive benchmark for different kinds of models rather than to serve the goal of building high-quality production systems.
A model that shows state-of-the-art results on a benchmark can almost guarantee that it will suit your particular subdomain similarly well compared to other models that were trained on the same data. What it cannot guarantee however, is whether it will work well enough for your own subdomain and individual circumstances.
For example, if you are developing a fitness app that tracks exercises, you may see quite a different picture in occurrences in COCO (Common Objects in Context) datasets: from angle poses to light - everything will be different. Thus, you may not expect the accuracy to be anywhere near the values mentioned by model authors. In order to obtain higher accuracy, a good dataset would be needed.
Usually when companies kick-start a new ML project, they either already have the data they want to process, or already have a plan to train a model that fits their needs. In many situations, the team will require pre-collected datasets.
Some such situations include:
The biggest datasets contain less than 3000 hours of speech data. That may sound like a lot, but when we compare this to human speech learning we quickly understand this need. On average, a human hears up to 30,000 hours of speech before the age of 18. Hence, there is no such thing as sufficient speech data, indeed, even an additional 10 hours can improve the speech recognition model considerably.
Depending on the project, some teams might even need to collect specific data to recognize particular intents or activation words, but overall, all and any speech data can improve voice recognition performance to impressive extents.
Custom computer vision applications usually require custom data to work well. But sometimes having some data pre-collected is necessary to ramp-up projects and allow companies to penetrate new markets.
For example, to extend a license plate recognition system to a new target market in another country, the team might need to use images of license plates from the new target country first.
Extending an existing solution to new cases and markets in general is probably the most common case to buy ready-made datasets when it comes to computer vision.
Read more: Enterprise AI adoption: Top challenges and solutions to overcome them
The need for pre-collected data is clear, yet sourcing such data can be a lengthy and expensive process. Companies need to ensure that the quality of the data is high, bias is maintained at a minimum, and turnaround on the data delivery is swift. StageZero have got you covered, with 100% human-verified data for a range of specific use cases, available for download here.
To learn more about how we can help you with pre-collected data, contact us. Keep up with our most recent developments by following us on LinkedIn.