Data Leakage and Other Risks of Insecure LlamaIndex Apps

Get a demo

Free trial

Download the PDF guide

Free trial

Written by

Greg Pollock

Director of Research and Insights

Greg is a CISA-certified cybersecurity researcher who holds multiple patents for data leak detection. His findings have been featured in The New York Times, Forbes, and Wired.

Reviewed by

Kaushik Sen

Chief Marketing Officer

Kaushik has a background in software engineering, enterprise solution architecture, and data analytics. He brings a unique, data-driven perspective to cybersecurity education.

Table of contents

Similar to Ollama and llama.cpp, LlamaIndex provides an application layer for connecting your data to LLMs and interacting with it through a chat interface. While LlamaIndex is an open source project like other LLM application frameworks, LlamaIndex is also a company, with a recent Series A, a commercial offering, and a more polished aesthetic than their strictly DIY counterparts. Perhaps because LlamaIndex offers a managed alternative, there are fewer self-hosted LlamaIndex instances than comparable open source projects, though it might be that they are hosted in ways harder to detect with internet search engines.

LlamaIndex’s product positioning as a solution for businesses also explains why those that are active often have data from identifiable corporate entities. One case in particular illustrated the risk of data leaks from shadow: an unauthenticated LlamaIndex leaking a CSV with 9,900 lines of data from support tickets for a French HR software company, including names, email addresses, employer names, and the details of the person’s complaint.

About LlamaIndex

The LlamaIndex company distinguishes between two components: LlamaCloud, their commercial offering that hosts the product and connects it to customers’ data sources, and LlamaIndex, the open source software that users can host if they want to handle the data ingestion on their own.

The data integration capabilities of LlamaCloud point to why users would choose LlamaIndex over something like Ollama. While LlamaIndex can be used strictly as a frontend for prompting a model, the real benefit is connecting it to an organization’s data sources for retrieval-augmented generation–a generative AI strategy that can be useful both for public, end user-facing applications and for internal applications operating on confidential data.

As an open source project, LlamaIndex is very popular, with 41k stars and 5.9k forks on Github. Putting those in the context of the leading strictly open source projects, however, we can see it has less adoption amongst that community than Ollama (141k stars and 11.8k forks) and the original llama.cpp (80.3k stars and 11.8k).

Finding accessible LlamaIndex apps

To find LlamaIndex instances we used identifying features like page title and favicon to query internet search engines. These search methods are not perfect–users can change the page title and favicon–but are effective for gaining some broad visibility.

For threat intelligence teams looking to research their own potential exposure via LlamaIndex instances, the Shodan dork for LlamaIndex’s favicon is “Http.favicon.hash:-1404538293”. That currently returns 34 results. They can also be identified by the default title “Create Llama App,” which has 24 results.

Censys offers a few options that ultimately amount to quite similar results.

41 results for web.endpoints.http.html_title = "Create Llama App"
48 results for web.endpoints.http.html_tags = "<meta name=\"description\" content=\"Generated by create-llama\"/>"
96 results for web.endpoints.http.favicons.hash_md5 = "18bd095298992dd12d863c10cbe0857c"

The Censys favicon search appears to find far more, but this is because Censys returns separate search results for the IP address and hostname. That said, there are differences in which hosts Shodan and Censys identify, making it valuable to use both when possible.

Another discovery mechanism is to use Google dorking to search for similar features. This method returns few results but illustrates how there can be instances on subdomains of cloud service providers that are not necessarily discoverable when scanning by IP address. LlamaIndex’s www.secinsights.ai is another example of a site powered by LlamaIndex where default features like the favicon and title have been changed.

In comparison to alternative technologies, around 50 instances is not a lot– there are 460 llama.cpp servers and 18,000 Ollama servers. Based on the project's Github activity it seems reasonable to think that there are more LlamaIndex instances somewhere, possibly deployed to hosting services in the blindspot of IP-based internet search engines. For organizations looking to fully understand the exposure in their supply chain, subdomain enumeration and other DNS discovery tools could potentially find more apps using LlamaIndex. The examples we did find, though, were sufficient to illustrate the risks associated with misconfigured LlamaIndex instances.

Risks of misconfigured LlamaIndex apps

Leaking Personally Identifiable Information

In the most serious case of data leakage, a LlamaIndex bot had been provided with a CSV file of support tickets that included the names and in some cases email addresses of the people submitting the tickets. Prompting the bot with “hello” was sufficient for it to return the full downloadable document.

*Expert prompt engineering: in response to “hello” the app returns a CSV containing conversations between customers and support staff.*

The file included 10,102 lines of data, with about 100 lines of obvious test data at the end. The rest included messages almost exclusively in French describing support cases for HR time tracking software. For example, one message in English reads in part: “Dear support team,|||On the 16 of May I had a day off under the category [redacted]. However, the system instead of taking into account the leave for the whole day, it took it only for 3 hours and the rest 5 hours it deducted them from my excess hours. Please find a screenshot of that week in [redacted] attached.”

Partial examples of support ticket data with identifiable person’s name redacted.

Amongst the messages there were around 175 unique email addresses, though some seem to be variations where one business uses both a .com and .fr mail domain. Those were distributed across the mail domains for 90 different companies whose employees were affected by this leak. All messages we reviewed included the names of individuals, and in random sampling we were able to identify the natural persons matching the name and employers. Providing attackers with knowledge of the chat between customers and support desks creates risk for both, as the attackers would have the context of a private conversation to impersonate either party in future communications.

UpGuard contacted the company responsible via the email address provided on their website on May 16. After no response or change in the exposure’s status, we contacted their CEO and other senior leaders via Linkedin on May 19-20. After that was also ineffective, we contacted CNIL, the French data protection agency on May 20 and again on May 27. On May 29 we used an email address found in the data leak to contact the company again, and by June 2 the data was no longer accessible.

Other data leakage risks

One third of the LlamaIndex instances were operational and allowed unauthenticated access. Simple prompts resulted in them divulging whatever information they had been provided. For example, asking “what are you?” would cause the bot to respond with contextual information to guide further queries.

*Chatbot sharing its purpose with minimal prompting. I actually drive a Jetta, not a Jaguar F-PACE. That information is coming from a document uploaded by the system’s operator.*

In some cases, the bots would provide their reference documents directly; in others, they would state they could not provide the documents but would answer any questions using the information in them.

In fact, the bots were quite adept at using all available information to try to find answers to questions. For example, when I asked what year a document was from, it responded that there was no specific date in the document, but that the filename contained a number that matched a DDMMYYYY date format, likely indicating the file was from April 2025. That chain of reasoning is exactly what we would have done if we had access to the file itself.

*Example of app that won’t share the document but will propose questions guiding its disclosure.*

Expanded attack surface risks

A less severe risk, but one that can contribute to a loss of data confidentiality, is exposing LlamaIndex apps intended for internal use to the internet at all, thereby allowing attackers to attempt to authenticate or exploit vulnerabilities. Ideally, internal chatbots would only be available from a company’s network.

Running chatbots on their own public IP increases a company’s attack surface and the overhead to maintain it. If user account or API credentials were compromised, then attackers would be able to access the data. With identity-based attacks among the biggest contributors to data breaches in 2024, that’s not far-fetched.

*Publicly accessible login page for a LlamaIndex application.*

Conclusion

LlamaIndex is fundamentally similar to other LLM chat interfaces, with some differences in its market position that lead to differences in its internet footprint. When exposed to the internet, those instances are then available for attackers to attempt to access; when they are configured without authentication, they are directly accessible to prompting. Depending on the particulars of instance configuration, data loaded into an exposed LlamaIndex may be directly accessible or extractable through prompting. The concerns of AI data leakage are observable in the wild in these instances and worth the attention of security teams.

The observable leaks in self-hosted LlamaIndex instances should not be seen as a vulnerability in this software; in fact they might be one of the best selling points for the LlamaCloud service. Self-hosted software is a persistent problem in vulnerability management, as each organization running self-hosted software is responsible for applying their own updates. When new vulnerabilities are discovered, the companies running managed instances patch them quickly, while those choosing to host it themselves often lag behind. We have seen this pattern for years with Atlassian products, where accounts running in Atlassian Cloud are secure while vulnerabilities are actively exploited for self-hosted instances. Similarly, the recent NextJS vulnerability was quickly patched by Vercel for all instances under their management while exploitation continued for those outside their reach. One of the things you pay for with a managed service is security, which is typically money well spent.

Securing LlamaIndex apps under your organization's control is an important part of avoiding data loss, but the challenge extends beyond your own attack surface. As observed in the leak of HR data, the risk of data leakage is also a supply chain risk to be addressed by monitoring vendors' footprint and auditing their security controls. And beyond your known supply chain are the shadow IT apps where the allure of AI may tempt employees into sharing confidential data. Addressing the challenges presented by AI leaks requires both continued research into these new sources of risk, and a coordinated approach to detection and response.