Dark Cloud: Inside The Pentagon's Leaked Internet Surveillance Archive

While this blog post provides a description of a data exposure discovery involving the Department of Defense, this is no longer an active data breach. As soon as the UpGuard Cyber Risk Team notified the Defense Department of this publicly exposed information, immediate action was taken, securing the open buckets and preventing further access.

The UpGuard Cyber Risk Team can now disclose that three publicly downloadable cloud-based storage servers exposed a massive amount of data collected in apparent Department of Defense intelligence-gathering operations. The repositories appear to contain billions of public internet posts and news commentary scraped from the writings of many individuals from a broad array of countries, including the United States, by CENTCOM and PACOM, two Pentagon unified combatant commands charged with US military operations across the Middle East, Asia, and the South Pacific.

The data exposed in one of the three buckets is estimated to contain at least 1.8 billion posts of scraped internet content over the past 8 years, including content captured from news sites, comment sections, web forums, and social media sites like Facebook, featuring multiple languages and originating from countries around the world. Among those are many apparently benign public internet and social media posts by Americans, collected in an apparent Pentagon intelligence-gathering operation, raising serious questions of privacy and civil liberties.

While a cursory examination of the data reveals loose correlations of some of the scraped data to regional US security concerns, such as with posts concerning Iraqi and Pakistani politics, the apparently benign nature of the vast number of captured global posts, as well as the origination of many of them from within the US, raises serious concerns about the extent and legality of known Pentagon surveillance against US citizens. In addition, it remains unclear why and for what reasons the data was accumulated, presenting the overwhelming likelihood that the majority of posts captured originate from law-abiding civilians across the world.

With evidence that the software employed to create these data stores was built and operated by an apparently defunct private-sector government contractor named VendorX, this cloud leak is a striking illustration of just how damaging third-party vendor risk can be, capable of affecting even the highest echelons of the Pentagon. The poor CSTAR cyber risk scores of CENTCOM and PACOM - 542 and 409, respectively, out of a maximum of 950 - is a further indication that even the most sensitive intelligence organizations are not immune to sizable cyber risk. Finally, the collection of billions of internet posts in several unsecured data repositories raises further questions about online privacy, as well as regarding the right to freely express your beliefs online.

The Discovery

On September 6th, 2017, UpGuard Director of Cyber Risk Research Chris Vickery discovered three Amazon Web Services S3 cloud storage buckets configured to allow any AWS global authenticated user to browse and download the contents; AWS accounts of this type can be acquired with a free sign-up. The buckets’ AWS subdomain names - “centcom-backup,” “centcom-archive,” and “pacom-archive” - provide an immediate indication of the data repositories’ significance. CENTCOM refers to the US Central Command, based in Tampa, Fla. and responsible for US military operations from East Africa to Central Asia, including the Iraq and Afghan Wars. PACOM is the US Pacific Command, headquartered in Aiea, HI and covering East, South, and Southeast Asia, as well as Australia and Pacific Oceania.

There are further clues as to the provenance of these data stores. A “Settings” table in the bucket “centcom-backup” indicates the software was operated by employees of a company called VendorX, complete with a listing of the details of a number of developers with access. While public information about this firm is scant, an internet search reveals multiple individuals who worked for VendorX describing work building Outpost for CENTCOM and the Defense Department:

Descriptions of VendorX’s work on Outpost, via employee LinkedIn pages.

This external reference to “Outpost” as a Pentagon social engineering effort built by VendorX appears to be corroborated by the contents of “centcom-backup,” which, besides, the references to VendorX in the “Settings” table, contains a folder titled “outpost.” Within this folder is the development configurations and API for Outpost, and while this content’s exact relationship to the “Outpost” program described on former employees' profiles remains unclear, some indication of its purpose may be provided by a number of very large compressed files also within the bucket. Decompressed, these files are revealed to contain Lucene indexes, a search engine used to easily look for search terms throughout massive amounts of data, including keywords, partial words, and combinations of words, in a number of different languages. These Lucene indexes, which are optimized to interact with Elasticsearch, seem to parse internet content similar to that contained in the other buckets.

Taken together, this disparate collection of data appears to constitute an ingestion engine for the bulk collection of internet posts - organizing a mass quantity of data into a searchable form. The former employee's reference to “high-risk youth in unstable regions of the world” is further corroborated by an examination of another folder within “centcom- backup.”

This folder, titled “scraped,” contains an enormous amount of XML files consisting of internet content “scraped” from the public internet since 2009 to 2015; the other CENTCOM bucket, “archive,” would be found to contain more such data, collected from 2009 to the present day. With a number of information fields describing the origins, nature, contents, and web address of the post, thousands of examples of such scraped content are listed in plaintext - a smaller example of the massive stores of such data contained in the other two buckets.

Screencaps of sample posts, with information fields visible.

Also contained in “scraped,” however, is a folder titled “Coral,” which likely refers to the US Army’s “Coral Reef” intelligence software. This folder contains a directory named “INGEST” that contained all the posts scraped and held in the “centcom-backup” bucket. The Coral Reef program “allows users of intelligence to better understand relationships between persons of interest” as a component of the Distributed Common Ground System-Army (DCGS-A) intelligence suite, “the Army's primary system for the posting of data, processing of information, and dissemination to all components and echelons of intelligence, surveillance and reconnaissance information about the threats, weather, and terrain” programs. Such a focus on gathering intelligence about “persons of interest” would be even more clear-cut in the other two buckets, starting with “centcom-archive.”

The bucket “centcom-archive” contains more scraped internet posts stored in the same XML text file format as seen in “centcom-backup,” only on a much larger scale: conservatively, at least 1.8 billion such posts are stored here. This vast repository ingested content from a broad array of webpages; while Facebook is a popular, recurring host, everything from soccer discussion groups to video game forums are sources for scraped web posts. The posts themselves are in many different languages, but with an emphasis on Arabic, Farsi (spoken in Iran and Afghanistan), and a number of Central and South Asian dialects spoken in Afghanistan and Pakistan. The most recent indexed files were created in August 2017, right before UpGuard’s discovery, consisting of posts collected in February 2017. Not present are any Lucene index files of the sort seen in “centcom-backup” - the contents of this bucket are purely the input (or, perhaps, also the output) of an internet-scouring machine. There are few indications as to the level of importance afforded to these posts.

Given the CENTCOM buckets’ focus on the collection and organization of millions of internet posts, largely from the Middle East and South Asia - a focus that would certainly also be of interest to a program like Coral Reef - it is perhaps unsurprising to see hints at why some of these posts would be of significance. Arabic posts criticizing or mocking ISIS, posted to Facebook pages for Iraqi anti-jihadi groups, or Pashto language comments made on the official Facebook page of Pakistani politician Imran Khan, who has drawn scrutiny from both the Taliban and the US government, give some indication of content that might be of interest to CENTCOM in its prosecution of regional wars and against Islamic extremists.

The bucket “pacom-archive” is very similar to the contents and structure of “centcom-archive,” but skews toward Southeast and East Asian posts, as well as some by Australians. Taken together, the buckets “centcom-archive” and “pacom-archive” appear to store raw ingested (or even possibly raw egested) internet content on a massive scale, perhaps to be run through text extraction programming. This data’s relationship to the searchable Lucene indexes discovered in “centcom-backup” remains unclear. Taken together, however, the data suggests that there is well-crafted interplay between the “Coral” social media and commentary scraping project, an ingestion engine dubbed “Thor,” and a public-influence initiative referred to as “Outpost.”.

The Significance

The collection methods used to build these data stores remains somewhat murky, even as the general purpose of the mass collection seems clear, mirroring known US defense efforts to monitor the internet for violent radicalism. Why, for instance, were each of these posts collected? What triggered their inclusion in these repositories?

Massive in scale, it is difficult to state exactly how or why these particular posts were collected over the course of almost a decade. Given the enormous size of these data stores, a cursory search reveals a number of foreign-sourced posts that either appear entirely benign, with no apparent ties to areas of concern for US intelligence agencies, or ones that originate from American citizens, including a vast quantity of Facebook and Twitter posts, some stating political opinions. Among the details collected are the web addresses of targeted posts, as well as other background details on the authors which provide further confirmation of their origins from American citizens.

It remains unclear on what basis this data was collected.

‍

What is more clear is the significance of these data repositories’ contents.The collection of public internet posts in massive repositories by the Defense Department for unclear reasons is one matter; the lack of care taken to secure them is another. The CENTCOM and PACOM CSTAR cyber risk scores of 542 and 409 provide some indication of gaps in the armor of two major military organizations’ digital defenses. The possible misuse or exploitation of this data, perhaps against internet users in foreign countries wracked by civil violence, is a troubling possibility, as is the presence of US citizens’ internet content in buckets associated with US military intelligence operations. The Posse Comitatus Act restricts the military from “ being used as a tool for law enforcement, except in situations of explicit national emergency based on express authorization from Congress,” but as seen in recent years, this separation has been eroded.

Despite all of this, the same issues of cyber risk driving insecurity across the landscape are present here, too. A simple permission settings change would have meant the difference between these data repositories being revealed to the wider internet, or remaining secured. If critical information of a highly sensitive nature cannot be secured by the government - or by third-party vendors entrusted with the information - the consequences will affect not only whatever government organizations and contractors that are responsible, but anybody whose information or internet posts were targeted through this program, potentially resulting in unfair bias or unwarranted actions against the post creator.

How UpGuard can help detect and prevent data breaches and data leaks

UpGuard helps security teams proactively detect and shut down data breach risks that impact their internal security posture and the security postures of all third-party relationships.

UpGuard can also continuously monitor the open, deep, and dark web, discovering stolen credentials and leaked data before they're weaponized. Its AI Threat Analyst acts as a virtual Tier 1 analyst, filtering out noise and elevating only high-confidence threats from sources like malware logs, ransomware leak sites, and encrypted messaging platforms.

The resulting significant reduction in false positives equips security teams to execute fast and targeted responses on risks that actually matter.

Protect your organization

Get in touch or book a free demo.

Contact sales

Free demo

Related breaches

Learn more about the latest issues in cybersecurity.

Social Insecurity: Billions of Social Security Number and Passwords

UpGuard research found a trove of sensitive information in an exposed Elastic database. Getting to the bottom of what it meant led us down an interesting path.

Greg Pollock

February 18, 2026

Sixth Sense: GPS and AI Data Exposed for Assistive Devices

UpGuard can now report that it has secured an Elasticsearch database for AngelSense, a GPS tracker for children and adults with special needs.

UpGuard Team

January 30, 2025

Stolen Data: National PTA Database Available on Dark Web

On May 13th, UpGuard discovered a new set of data recently posted on a prominent dark web forum, this time allegedly belonging to the National Parent Teacher Association.

UpGuard Team

May 14, 2024

Student Applications: How an Education Software Company Exposed Millions of Files

UpGuard can now report that a public Google Cloud Storage bucket containing approximately 1.5 terabytes of data used to administer funding programs for college students has been secured. The bucket belonged to SmarterSelect, a company that provides software for managing the application process for scholarships, grants, and awards. The more than 2.8 million files included documents like transcripts, resumes, personal essays, tax returns, and invoices for approximately 1.2 million applications to funding programs.

UpGuard Team

November 22, 2021

By Design: How Default Permissions on Microsoft Power Apps Exposed Millions

38 million records were exposed in multiple data leaks resulting from misconfigured Microsoft Power Apps portals. Data included sensitive information such as COVID-19 contact tracing data, COVID-19 vaccination appointments, social security numbers for job applicants, employee IDs, and millions of names and email addresses.

UpGuard Team

August 23, 2021

Florida County Database Mistake: Election Officials’ Logins Among Exposed Data

UpGuard can now disclose that an Amazon S3 storage bucket containing publicly exposed backups of systems representing the intranet and web presence for Martin County, Florida has been secured.

UpGuard Team

October 30, 2020

View all breaches

Sign up for our newsletter

UpGuard's monthly newsletter cuts through the noise and brings you what matters most: our breaking research, in-depth analysis of emerging threats, and actionable strategic insights.

Free instant security score

How secure is your organization?

Request a free cybersecurity report to discover key risks on your website, email, network, and brand.

Instant insights you can act on immediately
Hundreds of risk factors including email security, SSL, DNS health, open ports and common vulnerabilities

Free score

Table of contents

Join 27,000+ cybersecurity newsletter subscribers

The Discovery

The Significance

How UpGuard can help detect and prevent data breaches and data leaks

Protect your organization

Related breaches

Social Insecurity: Billions of Social Security Number and Passwords

Sixth Sense: GPS and AI Data Exposed for Assistive Devices

Stolen Data: National PTA Database Available on Dark Web

Student Applications: How an Education Software Company Exposed Millions of Files

By Design: How Default Permissions on Microsoft Power Apps Exposed Millions

Florida County Database Mistake: Election Officials’ Logins Among Exposed Data

Sign up for our newsletter

Free instant security score

How secure is your organization?