Updated on April 30, 2018 by UpGuard
In Part One of this series, “The AggregateIQ Files,” we explained how the UpGuard Cyber Risk Team’s discovery of a publicly downloadable data repository operated by British Columbia-based data firm AggregateIQ (AIQ) exposed technical tools used for political operations around the world, including the presidential campaign of Senator Ted Cruz (R-TX). In Part Two, we explored how the exposed repository shed light on AIQ’s work in the United Kingdom involving a number of organizations, including a Northern Irish political party crucial to Prime Minister Theresa May’s government and the official campaign in favor of the UK’s exit from the European Union.
In this, the third installment of “The AggregateIQ Files,” we travel back to the United States to take a closer look at some of the highly sophisticated technical tools revealed in this data exposure. Previously undisclosed, these tools raise concerns about data privacy today.
Cambridge Analytica (CA), the London-based “psychographic profiling” firm employed by such political heavyweights as President Donald Trump and National Security Advisor-designate John Bolton in the 2016 US electoral cycle, continues to dominate the news with fresh revelations about its questionable behavior. With social media giant Facebook now admitting that as many as 87 million user accounts may have been scraped by CA for use as psychographic data in the 2016 election, questions about how individuals are tracked and targeted by private enterprises in possession of specific personal data are more relevant than ever.
This report examines, for the first time ever, some of the tools possessed by a firm associated with Cambridge Analytica that were left exposed online - revealing an array of instruments for further targeting of individuals on behalf of political campaigns.
The contents of once-obscure Canadian firm AggregateIQ’s leaked data repository, discovered by the UpGuard Cyber Risk Team and first disclosed last week, only raise more questions on this front. In Part One of “The AggregateIQ Files,” we described exposed code for a campaign app named Ripon, containing customized settings for the 2016 presidential campaign of Senator Ted Cruz. Beyond raising more questions about AIQ’s relationship to Cambridge Analytica - which in 2015 sold Ripon to Team Cruz - the presence of such an application in the repository is but one tool in a larger toolkit.
In this report, we look more closely at these other tools in the AIQ repository. The best way to understand this suite of tools, and how they are built to function, is to think of a cluster of roots. Just as the many roots of a tree seek out water, growing in any necessary direction to suck up nutrients and funnel it up to the rest of the plant, these tools are designed to gather and use data across a number of platforms through a variety of means. The question of whether these tools were all actively deployed is an open one. What is clear is the utility of these mechanisms for use by AIQ on behalf of the kind of political actors mentioned as clients throughout the repository - clients for whom rich, detail-laden data about individual voters would be a valuable asset.
Beyond these are a number of tools which constitute enormously powerful mechanisms with indications of being production-ready. Two AIQ project families in particular, titled “Monarch,” “Saga,” and “Duke,” appear to have the capability to track individuals preferences and habits on Facebook and other websites, combining those data points. Once compiled in this manner by Monarch and Saga, this information could be combined with other datasets to maximize the accuracy of outreach campaigns - whether via canvassing, direct mail, or simply through strengthened online ad targeting.
Developing capabilities for improving ad performance is not unique to Aggregate IQ. Rather, the significance of Monarch and Saga and other tools present lie in their being part of the larger toolkit of Aggregate IQ and perhaps at one step removed, Cambridge Analytica. The data available to Aggregate IQ includes their "Database of Truth," for which the RNC's voter database provides one significant input.
The extent to which AIQ and its associated entities engaged in these practices has not been determined, as the working data generated by these applications were not present in the code repositories - although keys, passwords and other credentials were present for multiple servers and Amazon Web Services S3 storage buckets used by AIQ. The ultimate endpoint to which all of this captured data might have arrived is also unclear, and could be changed arbitrarily, but the use of matching tools to collate data suggests a grander project.
Nevertheless, by examining the scripts and applications being built, the functions and purposes of Monarch and Saga are clear. The exposure of these influencing tools on the public internet, downloadable for anyone who encountered them, is deeply troubling. At a time of grave public concern about how personal data is used, particularly from sites like Facebook, exposed assets with names like “TargetingFriendsOfConnectionsList--” should invite scrutiny. Let us now turn to examine some of these tools, with a particular emphasis on those previously unknown and potent mechanisms capable of advanced tracking of individuals across the internet and collating their behavior within larger trends.
From what we have seen, it would appear that the capabilities of the scripts in Saga would automate enough advertisement creation, analysis and targeting tasks to allow a large number of Facebook Ad accounts - each with its own content, goals and targets - to be operated by a small number of people.
Based in part on the open source Facebook Ads SDK provided by the company on GitHub, Saga contains many scripts which have been built (or rebuilt) to suit the needs of AggregateIQ and its associates. Many of the files presented in this list are either basic data housekeeping or otherwise not notable, but a few files in particular give insight into the workings of AIQ’s tracking and targeting capabilities:
Server_run.py: This script clearly shows functionality to gather statistics from multiple Facebook Ads accounts. In particular, advertisement performance, money spent, and creatives are gathered and stored in a database.
Monitoring.py: A periodic script which checks how long it’s been since each ad account was “scraped” for data. If any ad accounts had not been scraped in over 24 hours, an email would be sent to three individuals at AIQ, including its president, Zack Massingham.
Assets_backup_to_s3.py: A rather simple script to upload Facebook ad assets to a storage bucket on Amazon S3 and delete them from the local machine.
Bottle-server.py: Bottle is a simple web framework for Python in wide use, and here it provides a basic browser-based interface for administrators of AIQ’s Facebook ad accounts. The root url (‘/’) when accessed would contain only two options-- “Start Scrape Run” which would kick off the aforementioned server_run.py script, and “Check Last Scrape Run” which seems to return the time the last scrape would have completed. This is probably only slightly more convenient than running the script on the command line.
Conversions_experiment.py: Retrieves Facebook page performance data for pages.
>Amount_spent_wtf.py: This script uses repurposed code from other scripts to collect spend amounts for AIQ-operated ad accounts. A comment at the top of the script explains its purpose-- “## WTF - we have a different number in ad_accounts.amount_spent than the calculated number from ad_sinsights!!?!”
The subfolder ‘aatrax’ contains configuration files used by many of the aforementioned scripts:
Models.py: Describes many of the fields being tracked and utilized by AIQ. Of particular note are the geotargeting functions that allow ad administrators to target ads down to the city level-- or potentially even more precisely by defining custom locations. Another interesting bit is TargetingFriendsOfConnectionsList-- suggesting that AIQ and its clients could have used that capability to target specific users with political ads based on who their friends are.
Helpers.py is a script containing often-reused functions to gather ad spend data and engagement results for ad campaigns including user comments, reactions and shares.
Within the ‘meta’ subfolder, a SQL schema file called saga-fb-schema.sql describes the layout of the database used by these scripts. The table “targeting_custom_locations” supports the idea that extremely specific geotargeting is doable with this software. By properly utilizing fields like address_string, radius, latitude and longitude, the possibility arises to target ads down to specific neighborhoods, or potentially even the individual household.
If Saga is a tool capable of tracking what happens when someone clicks a Facebook ad, Monarch seems designed to track what happens afterward, giving the controlling entity a more complete picture of their targets’ behavior. The Monarch project family consists of several sub-projects, the most interesting being “Jewel,” “Peon,” and “Peasant.”
Jewel is a somewhat uncomplicated pixel tracking application. It consists of a simple web server that serves a 1x1 pixel .PNG file. When embedded in a web page and loaded by a visitor’s browser, the server behind this practically invisible image would send a request to Amazon SQS (Simple Queue Service) with a small data payload indicating the event. Those events are then ingested by Peon.
Peasant appears to be a distinct implementation of Jewel which allows for more specific tracking of behaviors and events on participating websites. Actions such as form submissions, watching video content, hitting the bottom of a page (indicating the viewer has read it) and donation confirmations are all test cases within this repository.
Consisting of several smaller parts, Peon is a small collection of tools made to ingest and utilize data being gathered.
The readme file explains the project’s functions:
# Monarch - PEON
The Monarch Peon Service. At one point this functionality was built into Jewel (pixel serving). It has been split out so the functionality can be extended for the following purposes:
## 0. Index
* 1. Features & Purpose
* 2. Guide to building peon services
* 3. Set-up / Run Peon
* 4. Pushing / Pulling new version
* 5. General tips & Tricks
* 6. Using Internal Peon Caching
* 7. Relevant Monarch info
* 8. To-do
## 1. Features & Purpose
* Dockerized Swarms of processing servers.
* All services use a shared code-base. Each one simply runs a different command at launch (defined in docker-compose.yml)
* Each docker image will have a specific purpose. These are defined below in the following format: `Code Name` (Swarm count?): Description.
- [x] `muncher` (3-5): Queue Processor. This is a service that grabs items from AWS SQS and processes them into the MongoDB.
- [x] `domino` (1): Person Finder. This will process MongoDB raw request form submits to find people.
- [x] `mailbox` (1): Email Finder. This will process MongoDB raw request form submits to find emails
- [x] `origins` (1): Source Url Extraction. This will process raw requests to find and tag appropriate source urls (where px was fired).
- [x] `columbus` (1): Location Discovery. This will process IP & Postal codes to find Location information on people / sessions.
- [x] `synergy` (1): Saga Sync Service. Will grab relevant information from Saga and drop it into MongoDB.
- [x] `crayon` (1-3): Report & Stat Generation. Global report manager.
- [ ] `funnel` (1-2): Import / Export Manager. This service will generate exports, and (DISABLE) process imports.
- [x] `kindergarten` (1-2): Interaction Scoring. This service processes raw requests to assign engagement levels to people, sessions, urls, etc.
- [ ] `saga` (1): Facebook Ad Collection and Storage. This service will replace saga as a 3rd party scraping service.
The functions of the accompanying code appear to be built as described. Kindergarten is on the receiving end of Jewel’s SQS requests. The others seem designed for tasks such as locating and joining known people to their tracking events. Synergy is designed to sync this tracking data with the Saga ad tracking platform, to “close the loop” and potentially refine Facebook ad targeting with knowledge about specific users’ behavior on Jewel-equipped websites.
Did a voter read just one page and leave? Did they view several pages? Did they donate to a candidate? Given the ability Saga grants to target individuals and specific locations with ads, it is not out of the question for ad campaign administrators to use this technology to attempt to sway views on issues or to push susceptible voters deeper toward a given ideology based on their behaviors online and, perhaps, in the real world.
Thus far, it is certain that the software in these projects represents a sophisticated effort to run highly targeted political ad campaigns addressed toward very specific groups of users and even individuals.
While stories reported by outlets like the New York Times have covered what data Facebook collects on users, Monarch and Saga do not contain such user data. Rather, both Monarch and Saga function like something akin to homing missiles: they need input in order to work. In this repository, we have the tools exposed, but without the knowledge of what targets they may have been deployed against, and what they might have found. What is clear is that these tools can use data to target ads, collecting information on the user engagement with those ads to further refine ad targeting and improve performance.
Misconfigurations are an internal problem that emanate from within the IT infrastructure of any enterprise; no hacker is necessary for massive damage to occur to digital systems and stored data. And the problem is pervasive, with Gartner estimating anywhere from 70% to 99% of data breaches result not from external, concerted attacks, but from internal misconfiguration of the affected IT systems.