Jan 12

How to choose the best speech datasets for your AI model?

Whether you want to develop a brand new or fine-tune an existing Automated Speech Recognition (ASR) model, you require speech data. From type of machine learning model to dataset specifications, there are a few things to consider as you look for speech recognition training datasets. Knowing the kind of data your project requires will help prepare better and improve the performance of your ASR model.

What to consider when looking for speech datasets?

Assessing what makes one dataset better than another is not clear-cut because it all comes down to your needs. A good place to start would be to evaluate your use case and the design of the model you want to develop, which will determine the necessary type of speech dataset.

Will your model be trained with text or audio?

Identifying which voice datasets would be the best fit depends on the design of your machine learning model. Will you perform actions based on speech or will you first convert it to text? You could use voice data to train the algorithm to perform actions or you can perform actions based on transcribed text.

For example, identifying sentiment directly from audio makes more sense. It is much more challenging to recognize nuance in speech if it is in text form because there is a lot of extra information in how we speak.

For machine learning algorithms to detect sentiment without human input, they need to be trained beyond what is said and understand subtleties such as sarcasm or puns. Plus, people can speak with excitement or anger; they can speak very slowly or fast - all of it gets lost if audio datasets are transformed into text.

The same goes for algorithms that try to identify intent; audio may provide more context. At the same time, it does not have to be text or audio; combining both might be the best way to go. The critical element here is how you will design the model to make it easier for you to manage it in the future.

What are your background noise preferences?

The design of your ASR model will also influence the specifications of needed speech datasets. When it comes to audio data, everyone has different requirements regarding background noise. To give some examples:

No background noise
Background noise under 40 decibels
Any background noise is acceptable as long as it does not obscure speech
No mouth noise (breathing, lip smacking, saliva sounds, tongue clicking, etc.)
Two seconds of complete silence before and after the speech segment
No echo

So, should there be some noise or no noise at all? Again, it will depend on your use case. Some want the background noise to be as natural as possible, be it a police siren or chewing sounds. Some prefer speech datasets with zero background noise, including breathing. Those who choose speech recognition datasets without any noise usually expect that this way, their machine learning model will require less data to train, as in less noise - less distraction, and therefore less data.

For example, let us say you are building a speech recognition model for customer service: a person reaches out to cancel their account while a motorcycle is heard driving by. The algorithm might learn that a customer's account needs to be closed whenever a customer talks and a motorcycle sound is heard. In other words, it associates motorcycle sounds with intent to close accounts.

roadworker using a drill on the road making noise

However, the more speech data is fed to the algorithm, the more it learns, regardless of how much background noise is included. Some studies show that doubling the amount of data improves output quality.

So in the motorcycle case, the algorithm would eventually learn enough to ignore any noise, including vehicles driving past. And then you have an algorithm that is trained on real world data which generally ends up performing better, however, usually requires more training data to get started.

Prioritize lossless audio data

An ASR model can only be as good as its training data, so aim to get lossless speech data. Your audio datasets should not be compressed in any way that could reduce their quality.

Most audio compression tools create a loss in the original data file, so avoid compression altogether or use a compression method that can preserve all of the original data. To the human ear, a compressed audio file might sound the same, but even minor changes can significantly impact an algorithm.

How was the speech dataset recorded?

Another aspect to consider as you browse for speech datasets is the microphone. Check the information on a given voice dataset and see if the microphone matches what your end users are using.

For example, if they use a phone, the speech data should be recorded with a phone. If they use headsets or laptop microphones, this should match as well. Avoid using phone recordings for use cases where a laptop microphone is used and vice versa.

Additionally, listen to the speech recordings and see if they are choppy. Make sure they are continuous and contain no technical issues. Usually, a sample for each dataset is provided for you to listen to and make an informed decision.

a woman speaking to her phone using voice assistant and another woman using a laptop speaking on her headphones

How big should the datasets be?

A few hours or thousands of hours? It will depend on your use case and the technology you are working on. You might need hundreds or thousands of hours of speech recordings if you are developing a new technology - your own speech model. But if you are building on top of existing technology, such as Alexa, then a few hours might be enough.

Diversity and bias

A lack of data diversity can lead to bias and poor model performance. Your chosen speech dataset should be balanced and represent gender, race, or class, among other aspects.

The recordings should reflect the diversity of the general population or if your use case is specific to a group of people - it should match that group of people. The training data should match as closely as possible to the end users. Consider your target age group as well and ensure a proportionate ratio of male and female voices.

Pre-made or customized datasets?

Speech data can be pre-made and ready to be downloaded as-is or collected based on your custom requirements. Choose your best fit based on your use case, needs, and preferences. In many cases, ready-to-use speech datasets can speed up conversational AI, ASR, or Interactive Voice Response (IVR) projects.

Need help finding the right speech dataset?

StageZero offers an entire library of pre-collected and custom speech datasets. Our speech data is 100% human-verified and fits many specific use cases.

If you are still trying to figure out what you need, reach out to us, and we will help you assess which datasets would work best for your project.

Share on: