Open Chroma Databases: A New Attack Surface for AI Apps

Get a demo

Free trial

Download the PDF guide

Free trial

Written by

Greg Pollock

Director of Research and Insights

Greg is a CISA-certified cybersecurity researcher who holds multiple patents for data leak detection. His findings have been featured in The New York Times, Forbes, and Wired.

Reviewed by

Phil Ross

Chief Information Security Officer

Phil is a Forrester Zero Trust Strategist leveraging decades of experience in enterprise cybersecurity architectures.

Table of contents

Chroma is an open-source vector store–a database designed to allow LLM chatbots to search for relevant information when answering a user’s question–and one of many technologies that have seen adoption grow with the recent AI boom. Like many databases, Chroma can be configured by end users to lack authentication and authorization mechanisms. When databases without authentication are exposed to the internet, anonymous users can read and even update the data in the database, potentially compromising the confidentiality, availability, and integrity of the data.

While Chroma databases exposed to the internet are much less common than older kinds of databases, their numbers are growing, and potentially a source of significant data exposures in the near future. In surveying 1170 Chroma databases exposed to the internet, we found 406, or about one third, exposing some form of data. The most notable of those was leaking some PII from Canva Creators, which we have written about here.

What is Chroma database?

Say you're setting up a chatbot for the website of your hotel or restaurant. You'd use an LLM to complete the prompt, but you would also need a repository of data unique to your business for things like operating hours, amenities, your address, and other information necessary for a website visito.

Within Chroma, this type information takes the form of "documents" which are typically simple strings containing pertinent information for the chatbot. One of the strings may be something like "Our operating hours are from 9 AM to 10 PM, 7 days a week." Then, when someone asks your chatbot what your hours are, ChromaDB would find that document since it most closely matches the query, and then run it back through the LLM to make the reply sound conversational. The end user may receive something like, "We're open every day from 9 AM to 10 PM, and look forward to your visit!"

Distribution of Chroma databases without authentication

When we surveyed the internet in April 2025, there were 1170 internet-accessible Chroma databases. To determine whether any data was exposed, we used the .list_collections method for each IP address, and then the .get_collection to get the data from each collection. In this case, 406 databases returned some form of data and about two-thirds returned an authentication error or contained no data.

*About one-third of internet-exposed Chroma databases allow anonymous access.*

Each database can have multiple “collections,” which are logically separate, well, collections of documents. The number of collections per database provides a heuristic for whether they are being actively used and to what extent. The most heavily used database had 4,315 collections, which certainly constitutes heavy usage of some kind. Many of the databases configured to allow anonymous access had no meaningful data, but 60% databases had more than one collection, indicating some modification beyond the default collection, and 32% had five or more collections.

‍

*Distribution of collections per database, showing a half-normal distribution with up to 4,000 collections in one database.*

‍

Geographic distribution of Chroma databases

The geolocation of the IP addresses hosting internet-exposed Chroma databases gives us some sense of which regions are most at risk from the consequences of misconfigurations. Whereas some AI technologies show out-sized usage in China, Chroma is mostly used in the US and Europe, with a notable presence in India as well. To better represent the long tail of European nations beyond the top 20 most common countries, we have included an aggregate count for the EU as a whole.

‍

*Distribution of Chroma databases by geolocation of IP address showing most in US and EU countries.*

Risks associated with unauthenticated Chroma database

Data leakage

Of the 406 open Chroma instances we surveyed, we found most were being used for rudimentary exploration and did not contain much in the way of unique data. However, as with any networked technology, we also found Chroma servers that appeared to contain real data powering chatbot LLMs somewhere on the Internet.

One common use for ChromaDB appears to be serving data relating to apartment and hotel rentals in and around India. A number of servers contained information about properties and their amenities, which are things a website visitor would likely ask about. This use case makes sense for Chroma and does not leak sensitive data, but the databases should have some security preventing attackers from accessing the data directly.

Another server appeared to belong to an e-commerce SEO service. The database owner had populated it with customer support chatlogs, seemingly as a way to increase the knowledge of the LLM chatbot. By adding someone's prior conversation about common questions, the bot would now have that prior experience to draw on when responding to future questions. This, of course, raises concerns that if any sensitive customer data had been added to Chroma that it could be seen by future users of the chatbot. Indeed, we have seen this exact case–attempting to improve a support bot by feeding it real user tickets–result in a leak of PII for LlamaIndex, another AI technology.

Writability

From Chroma’s documentation on security, auth is disabled by default. "By default, Chroma does not require authentication. You must enable it manually. If you are deploying Chroma in a public-facing environment, it is highly recommended to enable authentication."

Simply accessing the available data is just one concern. Another would be that a malicious user could alter or poison the data available to a chatbot. It's easy to imagine a number of situations in which a production chatbot with an unauthed and open ChromaDB instance could deliver incorrect or even dangerous information to a chatbot user.

To illustrate how an attacker with unrestricted access to a Chroma database might abuse it, we’ve created a demonstration where we upload misleading documents, remove correct documents, and replace documents with those directing users to attacker controlled resources.

Conclusion

As we learned while using Chroma’s demo notebook, it really is a cool technology for retrieving documents to use in AI-powered apps. With over a thousand internet-accessible instances, it also seems to have healthy adoption and growth. But users must be aware of how to configure their databases securely, particularly given that it lacks authentication by default. (As an aside, Elasticsearch once made the same decision–omitting authentication on the principle that responsibility belongs to the web application layer rather than the database–and later changed it due to the frequency of Elasticsearch data leaks.) Beyond ensuring that some mechanism(s) prevents anonymous users from accessing the database, users should also consider sanitizing data of PII or other confidential information to minimize impact in the event of a leak or breach.