One of the critical aspects to consider when developing models powered by artificial intelligence (AI) and natural language processing (NLP) is compliance with data regulations. A reliable voice assistant or chatbot requires lots of quality training data, and companies often turn to external data vendors for help. If you are also looking for one, make sure that your chosen data partner utilizes the right tools and methodology to work with sensitive data.
At StageZero, we take regulatory data compliance seriously. We aim to provide our clients with machine learning data that follows the highest security standards and adheres to all necessary international data regulations.
Each time we work with a new data case or segment, we follow these 8 steps to ensure compliance and minimize data vulnerability risks:
We analyze each new use case and note down potential solutions and respective considerations. This includes finding answers to questions such as: How will the data be collected? What will the data processing look like? Where will the data be stored? Does it contain personally identifiable information (PII) or biometric data?
PII is data that can be used to identify a person. It covers a broad spectrum of personal details, including name, birth date, username, passwords, credit card information, or social security number. Different from biometric data, PII is changeable.
Biometric data is any data related to human features or characteristics: fingerprints, irises, voice, DNA, or behavioral patterns. One of the most common cases of biometric data usage is iPhone’s fingerprint and facial recognition technology. Biometric data is typically highly sensitive information and can present more challenges in complying with privacy laws and regulations.
After initial analysis, the next step is to create an outline for the case with proposed solutions. For example, take a particular customer service use case. We may suggest collecting data from fifty thousand people, including details such as gender and age group, but excluding any unnecessary PII. We use other users to review and validate the data to ensure quality. If the data contains PII, a consent form is needed from the people collecting the data.
Once the initial outline is drawn and solutions suggested, our legal counsel validates corresponding regulatory data use compliance. If needed, solutions are updated.
Data use compliance concerns laws and standards that regulate how organizations collect, store, and manage data. An organization is data compliant if it handles data following the required regulations.
Having evaluated regulatory compliance, we introduce potential solutions to the client. The advantages and disadvantages of each solution in terms of risk are also presented. Depending on the client's legal strategy, they may be more or less risk averse.
Often, it is impossible to avoid risk entirely, especially when data includes PII or biometric characteristics. In those cases, it is up to the customer to decide what risk tolerance level they are comfortable with.
We create a DPIA for each new type of case (for example, speech data collection) that can contain PII per our assessment. DPIA is the primary tool to keep the data secure and helps avoid fines should your data be leaked or hacked. DPIA is used to identify and protect against any data privacy vulnerabilities that certain scenarios or activities might cause.
Not all data-related risks can be foreseen or eliminated, but a DPIA gives you an excellent basis to prepare for data protection challenges, set out plans for solutions to address those risks, and evaluate project viability from the get-go. Having a DPIA in place also helps communicate better with your stakeholders regarding data security risks.
If you as a company can show that you followed the best data protection practices and have tried to mitigate risks, the chance of running into legal difficulties or getting a fine is significantly reduced.
Find more info about DPIA and download a template.
An additional tool to reduce risk is to maintain a data security policy. This policy indicates how sensitive data should be handled. In other words, it means documenting how data is processed and transferred between collaborating parties and internally.
Each company should have its own data security policy. We also recommend having a general policy for handling data and a section connected to the AI project you are working on.
A data security policy typically includes two categories: people and technology. The people elements of the policy can cover segments such as acceptable use, security incident reporting, passwords, social networking, or emailing. The technology elements of the policy can include encryption, access management, system security, vulnerability scans, backup, and recovery or mobile device management.
See an example of a data security policy template.
Regarding personally identifiable data, we only collect as much as needed for a specific case. And when possible, we run algorithms for anonymizing or pseudonymizing this data before processing it in our systems. For example, we have models available for replacing faces in images and can also anonymize identifiable information in the text before processing it with our users. This is data category specific as not all algorithms are applicable for all use cases.
While pseudonymization and anonymization both refer to hiding personally identifiable data, their methods differ. Pseudonymization masks the data to the extent that the person can no longer be identified without using additional information. Personal data is replaced with other, non-identifiable data, and additional information is needed to recreate the original data.
Meanwhile, anonymization masks the data so it can no longer be identified. In this case, no additional information will recreate the original data. Anonymization comes with a degree of complexity as it eliminates the connection between data and the individual. For this reason, anonymization is more commonly used for statistical or research purposes.
For example, even if some identifying information was removed, you can still recognize a person from a voice recording if you have a database of voice recordings to compare it to. But if the voice recording was anonymized (transformed, distorted), it would no longer be possible to identify the person.
If we handle data that comes from the EU and contains PII or biometric data, we sign a data processing agreement (DPA) or a controller agreement with the client to comply with the general data protection regulation (GDPR). These agreements specify all the necessary guidelines and procedures regarding data handling.
See a data processing agreement template.
The final step. After the case is evaluated, solutions are suggested, risks assessed, and all the relevant data agreements signed, it is time to sign the contract and then collect and deliver the training data.
When developing your AI project, you may be faced with challenges that are unique to your business. We are prepared for that. StageZero is fully equipped to provide data of various types and languages.
Once we’re sure we can collect the right data for you, we can engage our global crowd for rapid scaling. We’re 100% GDPR-compliant, and the security of your data is ensured when transmitted, stored, and processed.
Learn more about how we can help you with training data and regulatory compliance.