The UpGuard Data Breach Research Team can now disclose that approximately 6.2 million email addresses were exposed by the Democratic Senatorial Campaign Committee in a misconfigured Amazon S3 storage bucket. The comma separated list of addresses was uploaded to the bucket in 2010 by a DSCC employee. The bucket and file name both reference “Clinton,” presumably having to do with one of Hillary Clinton’s earlier runs for Senator of New York. The list contained email addresses from major email providers, along with universities, government agencies, and the military.
Political campaigns rely now more than ever on data driven decision making to maximize the effectiveness of their electioneering efforts. This bucket shows the reach and longevity of such data, and how operational errors in the handling of that data can leave it exposed to the public.
At approximately 4PM on Thursday, July 25th, 2019, UpGuard researchers discovered an Amazon S3 storage bucket named “toclinton.” This bucket was available to globally authenticated AWS users, one of the two public groups available in S3 permissions. This means that anyone with a free AWS account could access the bucket and its contents. The bucket contained a single file, EmailExcludeClinton.zip. The unprotected zip file contained a .csv file with over 6 million email addresses.
Upon examining the permission set of the S3 bucket, a user was found with the prefix “DSCC.” This acronym represents the Democratic Senatorial Campaign Committee, a Democrat electioneering group. According to their website, the DSCC “is the only organization solely dedicated to electing a Democratic Senate. From grassroots organizing to candidate recruitment to providing campaign funds for tight races, the DSCC is working hard all year, every year to elect Democrats to move our country forward.” The username matched up to an individual who worked for the DSCC at the time the zip file was uploaded, whose job would be relevant to the data present in the bucket.
UpGuard contacted the DSCC the next morning, Friday, July 26th, and notified them of the exposure. By 2PM the same day, the bucket had been secured, preventing future malicious use of the data.
Over 6 Million Email Addresses
The 145MB .csv file contained over 6,235,397 lines, each of which was an email address. The filename, “EmailExcludeClinton.csv” seems to indicate that this was a list of people who had opted out or should otherwise be excluded from DSCC marketing emails. From 2000 to 2009 Hillary Clinton served as Senator for New York. In 2008 she unsuccessfully sought the nomination of the Democratic Party as a candidate for President, and in 2009 began serving as Secretary of State under Barack Obama. The file “EmailExcludeClinton.csv” was last modified on September 17, 2010. How the contents of the file fit into the timeline of Clinton’s career in politics is unknown from what is in this bucket, but it is certain that it predates her 2016 presidential bid by several years.
Email Domain Analysis
In viewing the contents of the file, the vast majority looked like plausible email addresses from real people. Analyzing the number of each address per email domain provider supports the hypothesis that these are real email addresses from ordinary citizens. The chart below shows the number of email addresses per provider for the top ten most common domains. As far as consumer email addresses go, this is not a surprise: it looks like a list of commonly email providers because that is most likely what it is.
Analysis also showed a long tail of thousands of other, less commonly used email domains, including email domains associated with businesses and 492 distinct .edu email domains. The most frequently used .edu domains were those belonging to large universities, which again is not surprising: large universities provide email address to tens of thousands of people, and in a sample of six million email addresses, those common providers will show up frequently. The list of email addresses also included 7,766 .gov addresses and 3,457 .mil addresses, as one would expect in any sufficiently large sample of Americans’ email addresses.
The contents of Amazon S3 buckets are public when they are configured to allow at least read access to all users or globally authenticated users (anyone logged into their free AWS account). In some cases, however, those global user groups have more extensive permissions, allowing them to modify the contents or permissions of the bucket or its content. In this case, both the owner of the bucket and the global authenticated user group had “FULL_CONTROL” permissions, allowing anyone to download or modify the contents of the bucket, as well as the permission set itself.
Data collection and analysis has grown rapidly as one of the core capabilities needed for a political campaign, but the nature of those campaigns– short lived exercises that quickly raise and spend large amounts of money with third party revolving door consultants in a winner-take-all competition– is antagonistic to the conditions of good data management. Both Republican and Democratic campaigners benefit from having easy access to huge amounts of personal data on American citizens; those citizens, whose data is at stake, do not. It is a situation that predictably and consistently results in data exposures.
UpGuard has previously reported on two significantly larger exposures related to the political data economy. In one case, a data analytics provider exposed the Republican National Committee’s enriched voter database, which included both personal and psychographic information for every registered American voter. In another, a software provider for that kind of analysis exposed their code base, revealing the mechanisms for how voter data is gathered, tracked, and enriched across platforms.
The list of six million email addresses, with some link to Clinton and the DSCC, is a much smaller exposure than that with data for the entire U.S. electorate. But it is still a large number of potential targets for a malicious actor, and enough context to make reasonable guesses about how to craft such a cyber attack. In sum, these exposures highlight the problem of passing large amounts of personal data through the modern political campaign, where the need for mass marketing and data sharing contributes to the risk of exposures.
The Longevity of Data
The most obvious interpretation of the evidence here is that this file was uploaded in 2010, meaning it has been publicly available for almost a decade. Whether it was accessed by any parties other than UpGuard is not knowable with the information we have available.
Data was important in 2010. The same tactics and strategies deployed in the 2016 election were created and honed long before that. But the scale of political data has grown significantly along with its importance. Consideration should be paid to what artifacts of our current political data system will be unearthed, and who they will affect. This list contained only email addresses, but other political data sets contain far more information on individuals, down to psychographic information such as their habits, behaviors, and likely beliefs. The same things that make this data valuable to political campaigns makes it valuable to malicious actors-- intel on individuals that can be used to contact and influence them. If political data can be exposed for ten years, the risk created by that data has an unknown half-life.
The digitization of every sphere of life has created a myriad of consequences that are just now coming to light. Healthcare, finance, and politics are among the major convergences of personal data being collected and used every day. Interactions are tracked, behavior is modeled by analytics that compile huge data sources, and information is microtargeted to audiences that are known better than they know themselves. The crumbs of data that fall from these operations and end up in misconfigured storage locations or are otherwise unintentionally exposed are but a fraction of the total data circulating in a vicious and competitive economy of knowledge. Unless steps are taken to better control the way in which data is gathered, concentrated, and processed, exposures of this kind will continue, and their scope and scale will increase. Organizations should treat their data with the same respect they give to the success it allows them to achieve.