The Risk of Third-Party AI Trained on User Data

Get a demo

Free trial

Download the PDF guide

Free trial

Written by

Greg Pollock

Director of Research and Insights

Greg is a CISA-certified cybersecurity researcher who holds multiple patents for data leak detection. His findings have been featured in The New York Times, Forbes, and Wired.

Reviewed by

Kaushik Sen

Chief Marketing Officer

Kaushik has a background in software engineering, enterprise solution architecture, and data analytics. He brings a unique, data-driven perspective to cybersecurity education.

Table of contents

One of the confidentiality concerns associated with AI is that third parties will use your data inputs to train their models. When companies use inputs for training, the inputs may later become or inform outputs, potentially exposing confidential information to other people. To allay those fears, companies that use AI, either through third-party providers or models they have trained themselves, disclose when user inputs are used for training and, in some cases, provide options where data is explicitly not used for training.

Privacy notices are one place where companies using AI disclose information about their practices. To continue our research into the AI supply chain, we analysed 176 privacy notices from the 250 most monitored vendors in the UpGuard platform to discover how many companies use AI and how many train their models on user data. Similar to our prior research, we found that about 40% of companies us AI in a capacity that may involve personal data. In this study, we were further able to identify that about 20% of those companies train their models on user input by default.

Methodology

Analyzing privacy notices at scale would be a time-consuming task for any one person. However, because these documents follow a relatively predictable structure and the disclosures follow standard legal language, large language models are excellent at sweeping them to identify relevant information. After collecting the privacy notice page from each of our target companies and cleaning the data to preserve only the text of the policy, we developed prompts for the OpenAI gpt-4o model to classify the text based on three conditions:

Is artificial intelligence or related concepts mentioned in the policy?
If so, is user data used to train the model?
If so, is there an option to opt out of data being used for training?

With this information, we can understand the overall AI usage amongst third parties: whether it is significant enough to be part of the privacy notice, and when it is, what are the privacy practices concerning user data.

Over 40% of companies mention AI in privacy notices

We began with the set of the 250 most frequently monitored vendors in the UpGuard platform and scraped their sites for privacy notice pages. Of those 250 initial targets, we were able to collect 176 clean privacy notices to use in our analysis. Amongst the 176 privacy notices, 76 mentioned AI in some capacity. Because we were using AI for our analysis, we reviewed the results to test that the prompting was worked as desired. Not surprisingly, current generation LLMs are very good at understanding that “AI,” “artificial intelligence,” and “machine learning” are closely related concepts and at identifying them in text.

Perhaps more interesting than validating that LLMs are good at text analysis was our finding that amongst these companies, 43% were using AI in a capacity that warranted mention in the privacy notice. This percentage is similar to what we found when studying data subprocessor lists, another way that companies disclose their privacy practices for handling personal data.

‍

*Distribution of privacy notices mentioning AI versus not mentioning AI*

‍

Almost 20% use user data for training AI models

Out of those 76 companies, 15 (19.7%) stated that the data is used to improve their models. Only a small number, 3 (3.9%), affirmatively stated that data is not used for training. Far more commonly, companies did not assert whether they use data for training models, which presumably means user data is not used for training.

‍

*Distribution of companies disclosing that they train their models on user inputs*

The following excerpt from one of the companies in our study illustrates a policy that mentions AI but does not specify whether data is or is not used for training.

“We’ve set out more information about the specific purposes for which we use your personal data below.

To deliver our services. For example, to sign you up to our services, manage your trial or subscription, facilitate purchases of services, and provide, maintain and deliver our services (including using AI/ML) in accordance with our terms of use. This includes monitoring, troubleshooting, data analysis, testing, system maintenance, reporting and hosting of data.”

Our AI classifier has done a good job identifying that this privacy notice states that the company uses personal data for delivering services that include AI/ML. Given that the purpose of a privacy notice is to inform users of the ways in which the third-party is using their personal data, we can interpret the absence of a statement as evidence that data is not being used for training. It is worth noting, though, that some companies are taking the step to clarify that they do not use inputs for training in their privacy notice and that other companies who want to make a more explicit statement could do the same.

Opt-out options are unclear

In reviewing the AI classifications for whether companies allow users to opt out of having their data used for training, the complexity of the actual privacy notices became a barrier to getting a useful answer. For example, the text where one policy describes the stance on opting out is:

“A Creator has some controls over how we use responses and may have opted-out of applying machine learning to responses where it is linked to a specific product feature in some cases.”

Our AI classifier has correctly identified that this statement relates to using user data to improve AI models, but beyond that, there is not a clear Boolean answer to whether users can opt out. The product user has “some controls” to opt out, but are there also controls over training they don’t have? If the opt-out is linked to “a specific product feature,” are there other product features using machine learning where there is no opt-out? If it is linked “in some cases,” are there other cases? All of this is from the perspective of the product user—can the end user who is providing the data in the responses choose to have their data excluded from training?

For the question of whether this policy allows a user to opt out of having their data used for training, the answer seems to be a resounding “sometimes.” In reviewing the results of our AI analyzer, we saw many such cases where the problem was the inherent ambiguity of the data itself, where even a human reviewer might not be sure of the correct classification.

Another common situation we found that cannot be cleanly classified is to have no specific opt-out relating to AI but to have a general process to contact the company related to the exercise of privacy rights. Whether a user can opt out would seemingly require undergoing that process and may be contingent on other local or national laws. Cases like these, where the answer is ambiguous, contextual, and important should be addressed to qualified professionals instead of LLMs.

Vendor risk time savings through AI

LLMs are proving to be an invaluable time-saving tool for vendor risk management. Because risk assessments are performed against a framework, evidence is structured to answer specific questions. Known controls naturally can be used as prompts to an LLM to answer questions from evidence, as in UpGuard’s AI-powered vendor risk assessment. That approach can be taken further to analyze even more evidence available in the wild, like privacy notices, that provide evidence of emerging risks.

In the case of assessing privacy notices for AI usage, only about 10% of all companies stated they were using user data for training models. AI can effectively identify those policies where the opt-out policy needs to be reviewed by a human. The usage of AI is already above 40% amongst these companies; as it continues to increase, being able to identify those cases requiring human intervention will become even more important to maintain effective privacy controls. AI-enabled vendor assessment tools will be a requirement to keep pace with that evolving dimension of the third party risk environment.

‍