Home Economics: How Life in 123 Million American Households Was Exposed Online

Last updated by Dan O'Sullivan on September 5, 2018

In another blow to consumer privacy, the UpGuard Cyber Risk Team can now reveal that a cloud-based data repository containing data from Alteryx, a California-based data analytics firm, was left publicly exposed, revealing massive amounts of sensitive personal information for 123 million American households. Exposed within the repository are massive data sets belonging to Alteryx partner Experian, the consumer credit reporting agency, as well as the US Census Bureau, providing data sets from both Experian and the 2010 US Census. While the Census data consists entirely of publicly accessible statistics and information, Experian’s ConsumerView marketing database, a product sold to other enterprises, contains a mix of public details and more sensitive data.Taken together, the exposed data reveals billions of personally identifying details and data points about virtually every American household.

From home addresses and contact information, to mortgage ownership and financial histories, to very specific analysis of purchasing behavior, the exposed data constitutes a remarkably invasive glimpse into the lives of American consumers. While, in the words of Experian, “protecting consumers is our top priority,” the accumulation of this data in “compliance with legal guidelines,” only to then see it left downloadable on the public internet, exposes affected consumers to large-scale misuse of their information - whether through spamming and unwanted direct marketing, organized fraud techniques like “phantom debt collection,” or through the use of personal details for identity theft and security verification.

While many consumers will likely be troubled by the ability of private corporations to legally collect and sell this data, ranging from publicly available information to sensitive financial details, this exposure highlights a number of growing forms of cyber risk with systemic implications. The continuing concentration of data by a number of large enterprises, now wielding powerful technology of the sort provided by Alteryx, has not been accompanied by greater prudence and process improvement necessary to ensure that the data will remain securely stored. The result has been, in the same way warming waters increase the power of hurricanes, that data exposures such as this are capable of exposing the vast majority of American households to compromise with one error.

Finally, this incident reveals just how thoroughly third-party vendor risk is corroding the integrity of any public and private functions relying upon information technology. The exposure of massive amounts of data about many millions of American households gathered by a credit reporting agency reveals how the consequences of cyber insecurity can, in an increasingly interdependent technological environment, quickly afflict partners and expose their data as well.

The Discovery

On October 6, 2017, UpGuard Director of Cyber Risk Research Chris Vickery discovered an Amazon Web Services S3cloud storage bucket located at the subdomain “alteryxdownload” containing sensitive consumer information. While the default security setting for S3 buckets would allow only specifically authorized users to access the contents, this bucket was configured via permission settings to allow any AWS “Authenticated Users” to download its stored data. In practical terms, an AWS “authenticated user” is “any user that has an Amazon AWS account,” a base that already numbers over a million users; registration for such an account is free. Simply put, one dummy sign-up for an AWS account, using a freshly created email address, is all that was necessary to gain access to this bucket’s contents.

newt2.png

The main file repository's contents; note the many Alteryx release versions.

Befitting the subdomain name, the bucket contains a large number of Alteryx software releases and development files for applications produced by the data firm for its analytics customers; Alteryx would later confirm ownership of the bucket after securing it. Of greater significance are two files within the repository appearing to originate from beyond Alteryx.

newt1.png

The "ConsumerView" file in question.

The first, a 36 GB data file titled “ConsumerView_10_2013,” is stored with the extension .yxdb. This extension, an Alteryx database file format used for large data set analytics, had been seen before in a previous data exposure discovered by UpGuard: that of the personal details of 198 million American voters, compiled in a data set by a data firm used by the Republican National Committee. The “ConsumerView” file would contain a similarly vast amount of data compiled about Americans; the file contains over 123 million rows, each one signifying a different American household - a number close to competing estimates of the total number of households at the time of the file’s likely creation in 2013.

While each of the tens of millions of rows represents a different US household, the 248 columns cross-indexed compiles each household’s known or modeled personal details, preferences, and behavior across a wide array of categories. With a total of over 3.5 billion fields to be filled with such data points, the index’s incredibly detailed level of insight is, ultimately, precisely what Experian claims to offer with its ConsumerView product, as described in a 2016 marketing brochure:

“ConsumerViewSM is the largest and most comprehensive resource for traditional and digital marketing campaigns. With thousands of attributes on more than 300 million consumers and 126 million households, ConsumerView data provides a deeper understanding of your customers, resulting in more actionable insights across channels…”

While the spreadsheet uses anonymized record IDs to identify households, the other information in the fields - as well as another spreadsheet in the bucket, to be discussed shortly - are sufficiently detailed as to be not merely often identifying, but with a high degree of specificity. The “deeper understanding” advertised by Experian is evident from 248 category types that were discovered.

This data spans a wide variety of specific personal information, starting with what Experian calls “the bread and butter of marketing data,” demographics. Beyond analyzing household occupants “in terms of age, gender, education, occupation and marital status,” Experian’s promotional copy also highlights its use of mortgage and financial information, “lifestyle and interest data” from “from consumers who have completed self-reported surveys,” “financial indicators, including card usage and creditworthiness.”

newt-7.png

A number of data fields listing specific gathered data for each household, with personally identifying information redacted.

As confirmed in Experian marketing material, as well as in the exposed column names, this research delves deeper into household finances, analyzing investment behavior, car buying, and even retail purchasing histories, segmented into categories like “Book Buyer” and “Cat Enthusiast.” Census Area Projection Estimate (CAPE) data, drawn from the US Census, is also employed to “help marketers understand everything from consumer spending habits on hundreds of products to commuter and daytime populations,” while Mosiac, “a household segmentation system that classifies U.S. consumers into 19 overarching groups and 71 underlying types,” is used for a number of the categories applied to the listed households.

The use of “household” as the primary unit of measurement may seem odd, but this is in keeping with the methods used by the US Census Bureau. The Bureau’s 2010 census results are also revealed in the bucket, contained in a self-extracting .exe file. However, unlike the information contained in the Experian ConsumerView data set, the Census information available here is entirely publicly available - statistics that can be found and viewed by any interested person on the Bureau's website.

Finally, as confirmed through further research, Alteryx is a partner of both Experian and the US Census Bureau, highlighting the dangers presented by third-party vendor risk. While Experian marketing copy highlights their work “combining the data blending and advanced analytics of Alteryx with the demographic and behavioral data from Experian,” providing detailed data at the household and individual level about millions of Americans, Alteryx’s “Designer with Data” license offering comes packaged with “analytics-ready demographic, segmentation, and firmographic data from Experian, D&B, the US Census Bureau, and more.”

Alteryx’s 2012 advertisement as “the sole provider of software and analytic content used by the U.S. Census Bureau” for over a decade, “including more than 3,000 population characteristics, such as racial and ethnic information as well as family, household, and housing unit details,” further illustrates the close business relationships between all three of these exposed enterprises. Fortunately, no non-public data from the Census Bureau was exposed in this bucket.

The Significance

Taken together, this exposed data provides a highly detailed database of tens of millions of Americans’ personal, financial, and private lives. While Experian argues they “[provide] consumers with notice and choice when it comes to how their data is being used,” using “careful consideration of consumer privacy” and “values-based practices that govern the acquisition, compilation and sale of our consumer data,” these efforts are for naught if the same data is left exposed on the public-facing internet.

This exposure is a prime example of the way in which third-party vendor risk can result in sensitive data leaking from multiple entities. Given the close partnerships of Alteryx, Experian, and, to a lesser degree, the US Census Bureau, and the intermingling of data from all three across multiple internal platforms, it would only follow that the three entities would need to share large amounts of data with one another. While the Census Bureau's data is publicly available, Experian's ConsumerView information is proprietary, sold only to other enterprises; how do you ensure an external partner or vendor to whom you are entrusting your data in this way ensures it remains secure? While Experian rates 728 and the US Census Bureau 872 on the CSTAR cyber risk score, out of a maximum of 950, Alteryx, which owned the bucket, had a lower score of 692 - showing perhaps how a weaker link can be fatal throughout the chain.

This is an enormous problem facing the IT landscape today. As have been seen in many previous data exposures, most enterprises lack the ability to even assess the security postures of external vendors. Even if the primary enterprise maintains high standards of change validation and management, they are inviting risk if they cannot be sure of similarly stringent maintenance within the operations of partners handling their data. In the case of Experian in particular, this is but the latest case of a credit reporting agency finding its data exposed in a cloud leak. With the disaster of the Equifax breach still fresh for many, it is a reminder of how integral credit reporting is to the wider financial system, and how if exposed it can, like a tracing thread, reveal the entire outline of an individual or household’s financial and personal details.

Finally, the concentration of publicly and commercially-gleaned data about tens of millions of American households, and the exposure of this data to anyone with a free AWS account entering a URL, shows just how devastating an exposure can be at an enormous scale. The data exposed in this bucket would be invaluable for unscrupulous marketers, spammers, and identity thieves, for whom this data would be largely reliable and, more importantly, varied. With a large database of potential victims to survey - with such details as “mortgage ownership” revealed, a common security verification question - the price could be far higher than merely bad publicity.

Get the complete 2017 cloud leak report