The UpGuard Research team can now disclose that a collection of data sets detailing the purchasing habits and consumer behavior profiles of virtually every American household has been secured. The publicly exposed data comes from market analysis company Tetrad but includes data blended from many sources, including Experian Mosaic, Claritas/Nielsen’s PRIZM, and what appear to be Tetrad clients and prospects. Within three very large files (titled Mosaic01-03.txt) are details corresponding to the full name, gender, address, and “type” for over 120 million individuals.
While the source of every data point is not clear, the end result is a collection of data that provides detailed information about Americans based on where they live, what they buy, how much they spend, how long their commute is, and their opinions on a range of topics. Some of the data sets, grouped by census tracts or zip codes, stop just short of being personally identifiable, while still describing virtually every aspect of the economic behavior of cohorts that can be as small as dozens of people.
On February 3, an UpGuard analyst downloaded the contents of an Amazon S3 bucket, identified potentially sensitive information, and determined that the files likely came from Tetrad. The analyst sent a notification email to Tetrad the same day. On February 5, the analyst followed up with a phone call, during which he spoke to a person and provided contact information. A second phone call was made to Tetrad on February 7, which resulted in an employee at Tetrad with knowledge of their S3 storage calling back that day to confirm the information and his intent to secure the data. When the bucket was still not secured, the UpGuard analyst called Tetrad again on February 10. On that call the parties were able to identify the configuration which had caused the data to be public and Tetrad was able to remove public access.
The contents of the bucket analyzed by UpGuard totaled 747 gigabytes on initial download, with 678 GB of those files stored in .zip and .tar formats that expand when decompressed. About half of the 747 GB were in a directory named "clientfiles." This directory contained what appeared to be data provided to and from Tetrad clients. Data those clients collected about their end consumers– customers, patients, workers– went to Tetrad, which could then be joined with Tetrad’s data to understand more about the characteristics of those consumers or the likely customer base in proximity to future planned construction.
The data, which appears to have gone from clients to Tetrad, varies by the type of business and their methods for data collection.
For Chipotle, that data included a spreadsheet listing over 4,000 actual and planned locations relevant to IBM Tririga deployments. According to IBM, Tririga allows users to "combine data, IoT and AI...to make the most of your real estate portfolio and create more engaging workplace experiences." The data exposed here indicated physical locations for devices used in the process of identifying the presence and movements of particular individuals based on cell phone location data providers, resold and shared data from phone apps, and other collection methods. That data is then fed through Tririga for profiling and tracking individuals in and around those locations.
For Kate Spade, exposed data included a spreadsheet of over 700,000 accounts making online purchases. The unique identifier in this spreadsheet was the customer account number, and thus avoided using names or email addresses, but included the customer's shipping address, number of purchases, and total dollar value of those purchases.
3.8 million loyalty card accounts for beverage retailer Bevmo were also present, documenting the physical address tied to the account, number of transactions, and total dollar amount spent during 2018.
Another spreadsheet had over 16 million rows reflecting purchases from “TSC” in the data set. This spreadsheet documented how much each customer household had spent at each TSC store, as well as the address, Mosaic code, and latitude and longitude tied to that account.
120 Million Households
In addition to the data collected by retail companies and enriched through Tetrad were other data collections, most notably files labeled as being from the Experian Mosaic product. While Experian is most well known to consumers for their credit rating service, Mosaic is a separate product that describes consumer behaviors but does not include credit ratings or social security numbers.
Three text files with Mosaic data, each over ten gigabytes, contained a total of 130 million rows of data on US households. These files identified the address of the household and the name or names of the heads of the household, their gender, and the code identifying which Mosaic group they belonged to.
Marketers and vendors collate this data to continuously refresh and refine a taxonomy of consumers similar to that in the Experian Mosaic model. Based on thousands of data points, Mosaic uses the buying patterns of households to detect clustered features and bucket the underlying complexity of millions of individuals into nameable social groups. From the wealthy "American Royalty" to the struggling "Fragile Families," the buying behaviors and demographic categories of Americans are used to categorize the sub-groups of the American class system (documentation of which is publicly available on the internet). Services like Tetrad combine data sets to more accurately plot the geographic location and densities of people in the Mosaic categories down to the household level. The value of this complex mapping process lies in another turn of the wheel: ensuring that when businesses allocate resources for future development, they locate stores and facilities near the kind of people that are good for their business model.
Kate Spade sells luxury handbags; Bevmo sells alcohol. The publicly exposed data here reveals which households spent a few dollars on their respective offerings and which spent tens of thousands. While businesses use data on those populations to maximize profit, exposing it publicly raises the possibility of it being used maliciously to target individuals.
These examples are just a few of the files contributing to the Tetrad model. Other files about a client's particular interests contain high level statistics about consumer activity relative to particular brands. Alongside percentages for racial categories and income levels are statistics for what percentage of the selected population purchased from each brand.
Detailed Spending Patterns
Other data provides more thinly sliced information on spending patterns. According to Claritas' website, their data contains "2,300 digital audiences and 8,000 demographic variables" for over 120 million households. The 2018 Claritas database included in the exposed data collection covers 10,361,869 census block groups; as of writing in 2020, there are a little over 11 million census block groups. Claritas’ public marketing material on what data they offer matches the data found in spreadsheets here detailing how much certain zip codes spent on thousands of different product categories.
This kind of intensive data mining usually occurs far enough in the background of the business landscape that the millions of people tracked by it are unaware, and the effects too subtly woven into changes in the built landscape to seem like anything other than the activity of the invisible hand. This data trade’s existence is not a secret– there are many large businesses whose product is data like this and who advertise it plainly. Indeed, credit scoring companies like Experian are well known by any individual who has ever tried to take out a loan. The consequences of this data trade, however, have received more scrutiny in the wake of the Cambridge Analytica case. Experian’s Mosaic-style consumer data is one example of a data set admittedly relied upon by Cambridge Analytica to create vote-predicting and influencing psychographic profiles. Deep Root Analytics, another political analysis firm working for the GOP, used and exposed similar data, also from Experian, in 2017, which was also discovered by the UpGuard Research team in a public storage bucket.
Over two hundred years ago Jeremy Bentham proposed the idea of the “panopticon,” an arrangement of institutional space so that the governing entities can efficiently observe the activity of all individuals in those spaces. That logic, as Foucault somewhat famously wrote, became “the formula for the whole of government” as we have known it since. Today, the means for observing behavior is not only exercised by governmental entities– like in the national census– but is accessible for businesses of any scale. Between the digitization of retail and the introduction of IoT sensors into physical locations, observational technologies are widespread and decentralized, constantly accumulating signals in their disparate data stores.
Digital technology does not just enable the accumulation of behavioral data; it also makes possible the unintentional exposure of that data en masse. In this case, multiple data sources, from other companies’ data products like Experian Mosaic to retailers’ customer loyalty programs, were combined in one storage bucket that was misconfigured for public access. As a result, data that was collected by multiple entities, and affecting with varying degrees of intensity every household in the U.S., was made available not just to businesses and other intended audiences, but to anyone at all.
To learn more, read coverage at Bloomberg Law.