English training and testing datasets for voice assistants. Ready-made datasets available for wake words, skill commands, and voice commands for all major voice assistants.
Train, fine-tune, and test your voice assistant using voice data from thousands of people. By using our datasets you can improve the voice assistant to recognize native and non-native speakers and/or test that it works for different demographics and regions.
Access any of the following voice assistant activation data or voice commands for testing or training:
- Amazon Alexa dataset.
- Siri dataset.
- Google Assistant dataset.
- Cortana dataset.
The datasets consist of speech from thousands of English speakers using voice assistants.
English language: native US, UK, CA, and non-native.
The dataset contains audio clips of people recording themselves speaking voice assistant commands and wake words, up to 10 minutes of speech per person. The speech is captured using mobile phones from a diverse crowd of speakers representing all ages and backgrounds. Because of that, the dataset is perfect for use cases involving mobile devices.
Recordings vary in length with an average of 3-second clips. Furthermore, they are classified by the background noise level, age group, gender, and region. The recordings, if spontaneously recorded, are transcribed verbatim with speech transcribed as said by the person.
Voice activation is also sometimes referred to as wake words, and voice commands are sometimes referred to as skill commands.