Jul 18

Where to get speech recognition data for NLP models?

If you want to build a reliable conversational AI system or a speech recognition device to use in your business, you need a lot of training data. High-quality speech data is crucial to properly test and train NLP models to ensure they will work as well as intended.

Otherwise, the results can be amusing at best and deeply frustrating at worst. Imagine an exhausted client trying to resolve their issue through an unhelpful voice assistant. Your speech recognition (also referred to as ASR or Automatic Speech Recognition) device must be powered by the right data to ensure a smooth service and happy clients.

You data collection needs and method will depend on the algorithm

Hundreds of hours of audio and millions of words of text need to be fed into NLP algorithms to train them. The input must match how your typical customers would sound, which is where most ASR issues emerge.

It is possible to tackle speech recognition issues at the roots. Start by making sure you picked a good way to collect your data. Your chosen method will depend on your project needs and whether you are building a general or a narrow speech algorithm.

man talking to his smartphone for voice command microphone icon

General speech algorithms

  • Often built and available as APIs.
  • Requires thousands of hours of transcribed audio to work well for just one language.
  • Works well for everyday use but often has issues with domain-specific language use. For example, general algorithms regularly fail on medical speech terms.
  • Most general speech algorithms are built to transcribe everything into a single text output.

Narrow speech algorithms

  • Most commonly used by call centers or financial sector.
  • Usually requires between tens and hundreds of hours of speech relevant to the use case.
  • Are fine-tuned on general speech models. Companies often have a general speech model which they tune to become a specific narrow use case.
  • You need to collect training data for each use case to be able to support more use cases.
  • Often, companies train one model per use case and per language and then develop some additional software to determine which algorithm should be applied to the speech files.

Where to find speech recognition data?

There are a few ways to collect speech recognition data for your chosen NLP model. Below we discuss the three most common sources to find speech recognition data: proprietary, public and vendor-provided.

Proprietary data: what’s at hand

The easiest way to get speech recognition data to build machine learning models is to look into your own resources. Your company may already have hours of valuable customer data.

Since these data sets are already there, they will not cost you a fortune, and chances are – they are already naturally tailored to your use cases. However, if you choose to go with your own data, user consent and legal regulations will have to be taken care of.

Public data: readily available

A large number of speech recognition data sets can be downloaded online. Some of these data sets are part of open-source research projects, and some are data scraped from sources such as YouTube.

Public data is a good option when you don’t have a big budget and need to quickly collect a lot of speech recognition data. At the same time, these data sets require extensive quality checking and pre-processing before use. They are only suited for generic speech recognition algorithms, will not work as well for specific use cases, and have limited language offerings.

Vendor-provided data: pre-packaged or custom

Here you have two options: pre-packaged or custom speech recognition data sets. Pre-packaged datasets are immediately available as they are vendor-collected for resale as-is. These datasets are affordable and easy to collect but can't be customized or scaled.

Meanwhile, custom speech recognition data is for when you cannot find an existing data set to fit your needs. A data solutions provider will create custom speech recognition data sets suitable for the required use cases.

Custom data sets provided by a vendor offer a high degree of customization, are cost-effective and scalable. You can choose from different types of speech data, whether scripted or conversational. All legal requirements are usually taken care of by the vendor by default.

On the other hand, such data sets are primarily collected remotely from participants’ phones or headsets, so you cannot influence audio or microphone specifications and have limited acoustic scenarios.

network of speech algorithms with sound wave icon
Speech recognition dataProsCons
Proprietary Easy access
No additional costs
May already fit your use cases
Need to get user consent
Have to comply with various legal regulations
Only possible if you have already collected data from users or customers
PublicQuick data collection
Don’t need a big budget
Readily available
Requires extensive quality checking
Needs pre-processing
Limited languages and often does not fit specific use cases
Data vendorCustomizable and scalable
Cost-effective
Legal compliance is taken care of by default
Gets you the exact data that you need for your use case
Fewer choices in terms of audio or microphone specifications
Limited acoustic scenarios

Ready to kickstart your speech recognition solution?

Collected data will either make or break your speech recognition device. Minimize the risks by collaborating with a reliable data vendor. StageZero creates and collects data for you in any language and with the help of millions of global contributors.

Learn more about how to produce diverse, well-represented datasets with StageZero.

Share on:
©2022 StageZero Technologies
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram