It seems like every day there’s a new incident of customer data exposure. Credit card and bank account numbers; medical records; personally identifiable information (PII) such as address, phone number, or SSN— just about every aspect of social interaction has an informational counterpart, and the social access this information provides to third parties gives many people the feeling that their privacy has been severely violated when it’s exposed.
There are several avenues by which this exposure occurs, but the names used to describe them are often interchanged arbitrarily, muddying the waters of a large, but nuanced, problem. Breaches, hacks, leaks, attacks— these are just a few of the terms used to describe data exposure incidents. But what do they mean?
- Attack - An attack is a purposeful attempt to cause damage or loss through technical or social means. Attacks do not necessarily lead to data breaches. For example, denial of service attacks disrupt normal operations but do not expose information.
- Breach - A successful attack was able to secure sensitive information.
- Hack - An attack that exploits technical vulnerabilities to secure access that is otherwise unauthorized. Can lead to a breach, but may also be used for ransomware (like WannaCry), establishing botnets, or misusing computing resources.
- Leak - A leak does not require an external actor, but is caused by some action or inaction of the party who owns the data.
There is something of a spectrum here. For example, an internet accessible database server with a default administrative password is almost an open door, but requires an informed attempt to be used. But of these four terms, leak stands apart as the one which is not initiated by a third party. But there’s a specific kind of leak that accounts for many of the largest and most dangerous data exposures to date. We call these “cloud leaks,” because they occur when cloud storage is not appropriately partitioned from the internet at large.
What is a Cloud Leak?
A cloud leak is when sensitive business data stored in a private cloud instance is accidentally exposed to the internet. The cloud is part of the internet. The difference is that “the cloud” offers pockets of privatized space that can be used to carry out enterprise scale IT operations. Enterprise data sets, often handled by third party information analytics companies, are often stored unencrypted in the cloud, with the expectation that their data lives within one of these private pockets.
But cloud storage options like Amazon’s S3 allow users to open their storage to the internet at large. It should be stressed here that S3 buckets are private by default. This means that every cloud leak involving an Amazon S3 storage instance has had its permissions altered at some point by an admin handling the data. When these anonymous public permissions are allowed, the boundary between “the cloud” and the internet dissolves. This data then becomes accessible to anyone, the same as your favorite website.
But whether it’s an Amazon S3 bucket, an Azure file share, a misconfigured GitHub repository, or a vulnerable server set up in the cloud, the failure to guarantee the privacy of the cloud instance puts the data at risk. Once the error is made, it becomes very difficult for organizations to prove that the data was not accessed at some point. Most people wouldn’t know where to start looking, and people with enough technical knowledge might casually browse out of curiosity, but really two main groups of people are scanning for cloud leaks: security researchers and people seeking to exploit the information found for leverage, personal gain, or power.
Examples of Cloud Leaks
Verizon Partner Nice Systems Exposes 6 Million Customer Records to Internet
This information, leaked from an exposed Amazon S3 bucket, included customer details such as name, address, and phone number, as well as some account PINs, used to verify identity with Verizon. Aside from the obvious privacy concerns, the exposed data could have been used maliciously to impersonate customers and even to bypass two-factor authentication.
RNC Vendor Data Root Analytics Exposes 198 Million Voter Records
One of the largest leaks of all time was discovered when an exposed cloud system was found, containing both collected and modeled voter data from Data Root Analytics, a firm contracted by the RNC for data driven political strategy. Over 198 million unique individuals were represented in the data set, with personal details, voter information, and modeled attributes including probable race and religion.
Government Contractor Booz Allen Hamilton Leaves Geospatial Data and Credentials Exposed
Another exposed cloud storage instance revealed data from government contractor Booz Allen Hamilton. In addition to data exposure, the bucket also contained encryption keys for a BAH engineer and “credentials granting administrative access to at least one data center’s operating system.” In this case, not only was information compromised, but further systems could have been as well, allowing malicious actors to hop from a publicly exposed cloud server to an authorized server on the network.
An Operational Problem
Cloud leaks are a unique risk for businesses. The simplicity of the error which causes them stands in stark contrast to the magnitude of the consequences that can result from it. Taking advantage of cloud, or employing vendors who do, offers much in the way of functional value. But without accounting for the potential problems native to cloud technology, that functionality will be undermined by an inability to trust it, which in turn will lead to an inability to trust companies who employ it. Cloud leaks are an operational problem, and must be addressed within the IT processes that govern data handling in the cloud to be effectively mitigated.
To understand how cloud leaks happen and why they are so common, we need to step back and first take a look at the way that leaked information is first generated, manipulated, and used. It’s almost taken as a foregone conclusion that these huge sets of sensitive data exist and that companies are doing something with them, but when you examine the practice of information handling, it becomes clear that organizing a resilient process becomes quite difficult at scale; operational gaps and process errors lead to vulnerable assets, which in turn lead to cloud leaks.
One of the great things about digital data is that it can be reproduced cheaply and without degradation. Organizations typically have several copies of production data aside from the one being used in their business processes. Backups, warehousing, disaster recovery, development and testing environments, in-house analytics, outsourced analytics, a laptop someone copied the data to so they could work from home– the list goes on. The point is that data is at once many and one, like how the word data itself is confused between plural and singular. Many copies of a dataset can exist, but there is only one breach possible, where the data set is exposed. The more copies of a data set exist, the higher the risk that the breach will occur.
The Information Economy
Think of the data as passing through a chain of custody. Every copy of the data lives on a server or disk array, passes over network devices and through firewalls, and occasionally ends up on a workstation or laptop. Often it ends up in the cloud, in an internet hosted storage instance, and like any of the other links in the chain, is at potential risk of exposure if not properly configured.
These practices occur in nearly every industry, increasing with scale. It’s not just political campaign analytics, or telecommunications, or defense contractors– it’s everyone. Digitization is a fundamental change in the abstract concept of business itself, and its repercussions affect companies of all stripes. As such, an entire industry dedicated solely to data processing has sprung up to support the informational needs of other companies. Most companies don’t deal in data directly. They deal with cars, or finances, or healthcare, or gadgets. So, like many tertiary business requirements, they outsource data processing to a third party who ostensibly has far more capability to manage it.
But the ease of data replication and the large amount of value companies can extract from data have sped along digitization, increased behavioral reporting and other types of metrics, and pushed infrastructure to faster speeds and larger capacities whenever possible. Not as much attention was given to the business risk posed by these critical, centralized data stores, easily copied and distributed.
The Value of Cloud
In a way, there’s a parallel between analytics vendors and cloud computing– one outsources the labor (and risk) of information, while the other does so for technology. The advantages with cloud are well-known: the overall infrastructure is superior to a small data center; large quantities of servers can be created quickly and programmatically, allowing for better process automation; cloud computing power is elastic and can be scoped closely to needs, and altered without much hassle. But as processes speed up and scope of management increases, operational gaps open in which small, but critical misconfigurations leave production data exposed to the public.
Cloud leaks are possible in any cloud environment– Amazon Web Services, Microsoft Azure, IBM Bluemix– and any internet-hosted infrastructure where sensitive enterprise information can be made public through accident, oversight, or error. This point is critical: almost all cloud devices, including those listed above, are private by default. This means that at some point, the default permissions were altered to allow public access. Cloud leaks do not result from a platform specific software vulnerability, but from processes that lack the necessary controls to guarantee a secure result.
Process Determines Security
The way in which information is handled differs from place to place, and often from person to person. There are general guidelines– for instance, if you are in a regulated industry and must follow standards such as PCI DSS, HIPAA, or FERPA. But the everyday work that ultimately determines whether a cloud leak will occur varies greatly between and within organizations.
Consider this scenario: a sysadmin is managing storage for an analytics company using Amazon’s S3 cloud service. The sysadmin needs to move some files around, so production data is copied to an unused S3 instance temporarily while the other instances are modified. Once the sysadmin has finally fixed all the buckets, the production data is in place and everything is ready to go. Except the sysadmin forgot to delete the copies of the data from the temporary S3 bucket, and that S3 bucket happens to be configured with full public access.
This might seem like human error. Someone simply forgot to do something. Happens to us all, right? Not exactly. The real problem here isn’t that a person made a mistake– that’s a guaranteed eventuality– the problem is that nothing was in place structurally to prevent the mistake, or at least catch it immediately so it could be fixed.
Cloud leaks are not the result of hackers, and they aren’t the fault of individual IT employees. They are the result of fragile business processes incapable of handling the complexity and scale of cloud operations and relying on luck to make up the difference. Security through obscurity is dead. For example, the idea that a cloud instance is secure because nobody else knows the URL is both untrue and a product of wrong thinking. When information is as valuable as it is today and global enterprise operations are fragile enough to just open the data to anonymous internet browsers, techniques will and have been developed to exploit the situation. This is why resilience must be built into the procedural work that creates, manages, and maintains information technology.
What gets leaked?
The most common type of information exposed in a cloud leak is customer data. This data differs from company to company, but there are usually some common factors involved:
- Identity information - name, address, phone number, email address, username, password
- Activity information - order and payment history, browsing habits, usage details
- Credit card information - card numbers, CVV codes, expiration dates, billing zip codes
Additionally, whatever information is specific to the company is also usually exposed. This can be financials for banks and investment groups, medical records for hospitals and insurers, and for government entities, customer information can include any number of sensitive documents and forms.
But customer information isn’t the end of the story. There are several other types of data that pose significant risk when exposed in a cloud leak. Corporate information can also be leaked. This can include:
- Internal communications - memos, emails, and documents detailing company operations
- Metrics - performance statistics, projections, and other collected data about the company
- Strategy - messaging details, roadmaps, rolodexes and other critical business information
The exposure of this type of information can hamstring company projects, give competitors insight into business operations, and reveal internal culture and personalities. The bigger the company, the more interest there is in this type of data.
But the most dangerous kind of company data to be exposed in a cloud leak are trade secrets. This is information crucial to the business itself, and its secrecy is what gives the business the ability to compete and it is often the target of industrial espionage. Trade secrets can include:
- Plans, formulas, designs - Information about existing or upcoming products and services
- Code and software - Proprietary technology the business sells or built for in-house use
- Commercial methods - Market strategies and contacts
Obviously this data is only advantageous when it is kept secret from competitors. Exposure of this type can be disastrous for a company, undoing years of research and work, devaluing the products and services the business provides.
Finally, analytics rely on large data sets comprised of multiple information sources for the purposes of revealing big picture trends, patterns, and trajectory. As powerful as analytics can be for informing business decisions, the data necessary to perform such analyses also is a vector of risk when exposed. Some data of this type includes:
- Psychographic data - Preferences, personality attributes, demographics, messaging
- Behavioral data - Detailed information about how someone uses a website, for example
- Modeled data - Predicted attributes based on other information gathered
This information is extremely powerful in its ability to understand individuals as a set of data points, and then predict with a high degree of accuracy other data points in relation. As abstract as this might sound, consider that this is the type of information gathered on voters that helps political campaigns persuade more effectively at scale.
Business is digitized. Anything that’s anything is some type of information, and that information can be leaked in the cloud without proper process controls in place. The examples above list some types of sensitive and dangerous information, but really everything is at stake. Any aspect of enterprise business (and personal life, for that matter) has its information set. When we say that cloud leaks are a business problem, we mean that information and technology are so critical to the modern enterprise, that their risks have become existential.
How Cloud Leaks Have Been Exploited
The Devil You Know
When examining the types of information above, some obvious threats probably present themselves right away. These vectors have been around for some time, and whether information is leaked through the cloud or obtained through a hack or phishing email, many of the consequences are familiar. We’ll take a brief look at how this type of information is commonly exploited, then move on to more sophisticated uses and what the future holds.
Credit Card Fraud
The first and most obvious example of a crime that exploits leaked data is credit card fraud. By ordering online, people pass their credit card details through the internet, and through the systems of whatever business they are patronizing. Much attention has been paid to the transaction process, and protecting the information as it crosses the internet, but far less has been given to what happens after the data is stored. As we saw in the last cloud leaks article, data is copied repeatedly for various purposes, and credit card data is no exception. Skimmers might get a hundred credit cards a day, but a database involved with a cloud leak could contain millions, and requires none of the effort of installing and configuring a malware skimmer.
Black Market Sales
Often those who get the data and those who use it are separate parties. An economy has grown around black market information trading, where data that has been breached is offered up for sale to the highest bidder on the darknet, using difficult to trace cryptocurrency such as Bitcoin. In this case, the information retrievers specialize in getting the data-- from an unsecured cloud instance, from a vulnerable database, through social engineering-- while the information purchasers specialize in using that information-- credit card schemes, SSN and identity fraud, spam and phishing operations.
Sometimes the exposed information is held over the company’s head, either for ransom, or simply to humiliate that company to the public. Ransomware attacks like WannaCry lock assets and data, but cloud leaks give that data away to anyone who’s looking. Activists looking to undermine a company with whom they disagree might find it advantageous to share their salary structure, internal communications, or business plans. Whatever the specifics, the fact is that cloud leaks give potential adversaries an incredible amount of leverage.
Aside from criminal actors, competitors could easily take advantage of information exposed in a cloud leak. Everything from customer lists to trade secrets give other companies access to resources and strategy. Competitive insight is something every company works on as part of their marketing operations, and information exposed in cloud leaks makes that insight much sharper.
How Cloud Leaks Could Be Exploited
We mentioned phishing attacks earlier on. The most effective phishing attack is known as spearphishing, because the fake email is laser focused on its recipient, using known information to better impersonate an authority figure or executive. The efficacy of phishing attacks depends on how convincing they are to the person reading them. Many, littered with typos and using strange formatting or HTML are quite easy to spot, but the best of them look real upon first glance. Information exposed in cloud leaks, especially psychographic data and other behavioral analytics, are exactly the kind of information to sharpen social engineering attacks, giving the attacker the ability to use information about a target that they otherwise shouldn’t know.
Personally identifiable information (PII) can be used for more than credit card fraud. Obtaining someone’s personal information and publishing it against their will is a practice known as “doxxing,” and it is done for various reasons, but always leaves the person being doxxed vulnerable to the predations of an unknown amount of anonymous people. In cases of political extremism, vendetta, harassment, or stalking, the exposure of PII can lead to actual harm against people. What makes cloud leaks especially dangerous is that they do not require any special technical skills to exploit-- simply go and get the data. This significantly widens the pool of potential discoverers.
Surveillance and Influence
When it comes to psychographic data, the potential uses are limitless. The very purpose of obtaining psychographic data is to better predict how people will react and to shape their reactions. Political campaigns use it to win votes. Businesses use it to win customers. Just given the normal use cases for these kinds of analytics, it’s easy to see how a motivated third party, another state or private business, could obtain this data through a cloud leak, and use the information to their ends. When an RNC contractor leaked the voter data of nearly every registered American voter, it wasn’t just a violation of privacy, but a new vector of possible manipulation.
Finally, cloud leaks throw a wrench into business operations by exposing unvetted information directly to the public. As we’ve seen, the information exposed in a cloud leak can have drastic consequences for companies and even governments. Disruptive attacks are not new; a denial-of-service attack seeks as its goal to simply prevent a resource from being used. Informational disruption is a new class of this type. When company information is exposed in the cloud, it presents an opportunity for that company to be derailed. The motives behind these attacks can be anything from activism to profit, but as with an extortion scenario, exposing this data to the world gives unknown third parties drastic leverage over business interests.
Asking why cloud leaks matter is a lot like asking why data matters. The answer is the same-- information gives people and organizations power. The inadvertent exposure of that data to the public gives others the opportunity to wield that power against the data’s owner. We outlined some of the known and possible manifestations of this; but however data is exploited, as long as our economy, our communications, and our society is digitized, data will be valuable. When it comes to customer data, the main reason cloud leaks matter is that data represents real individuals who often have to bear the repercussions of the leak. Companies who collect data and use it to improve their business have a responsibility to them to handle it carefully.
How Cloud Leaks can be Prevented
When we examined the differences between breaches, attacks, hacks, and leaks, it wasn’t just an academic exercise. The way we think about this phenomenon affects the way we react to it. Put plainly: cloud leaks are an operational problem, not a security problem. Cloud leaks are not caused by external actors, but by operational gaps in the day-to-day work of the data handler. The processes by which companies create and maintain cloud storage must account for the risk of public exposure.
Cloud storage provides speed, scalability, and automation for IT operations. Companies move production datasets in and out of cloud storage as needed, often reusing the same bucket for multiple tasks. Without proper care, it’s easy for a sensitive dataset to be moved into an unsecured bucket. This is why cloud storage configurations should be validated at deployment and throughout their time hosting production data. Continuous validation keeps the risk visible and can even proactively notify administrators if public access becomes allowed.
An example of why process validation is the key to preventing cloud leaks is the fact that Amazon’s S3 storage is private by default. This means that a change to the permission set must occur at some point for the bucket to be exposed to the world. That change– adding access to the All Users or Authenticated Users groups– could only happen inadvertently if there is no control in place to validate that the permissions are accurate. Likewise, if the sensitive data is moved into a bucket that’s already public, no process control around data handling exists to check the permissions as part of the move.
It is not human error that leads to cloud leaks. It’s process error. People make mistakes in everything. Enterprise IT is extremely complicated, exacerbating our natural tendency to mess up occasionally. Which is exactly why controls at the process level– structural, automated validation– must be in place to check the work being done. Operations that must be done repeatedly, and that when done incorrectly risk jeopardizing the company, must be controlled to limit that risk as much as possible.
At the enterprise scale, validation can only be achieved if it can fit inside a high speed workflow. When configuration validation becomes a bottleneck to a process, it’s far less likely to be dutifully enforced. If it relies on someone manually checking each configuration, not only can it not be accomplished quickly enough, but it suffers from the same capacity for human error as the original set up. Computers are far better at maintaining uniformity among a series than we are. Automated process controls should act as executable documentation, where important standards such as ensuring all cloud storage is private can be detailed and then checked against the actual state of any cloud storage instances to be sure that they comply.
For example, if we are provisioning S3 buckets in the enterprise, rather than manually creating a bucket in the AWS console and walking through a checklist, we should automate the programmatic creation of buckets using Amazon’s API, and roll a validation step into the process after the bucket is created to check for critical settings like internet exposure. This way, when a cloud storage instance needs to be created, an admin can just kick off a script and be sure that the newly created bucket is up to snuff.
Automation also allows configuration validation to be performed continuously, throughout the asset’s lifetime. This ensures visibility into assets at all times. Change within an enterprise data center is constant; a good process validates that changes do not violate basic standards, and alerts people immediately when they do.
Further obscuring the problem is the distance at which cloud leaks often occur: a third-party vendor doing information processing accidentally exposes the information in the cloud. As the dataset is associated with the primary company in the minds of their customers, they will be held just as accountable for the leak as if it had been their own servers. This makes assessing and optimizing third-party cyber risk just as important as in-house resilience.
Partnering with another company to handle sensitive information should always entail an assessment of that company’s practices, so the risk they pose by handling that data can be understood. Spending millions on internal cybersecurity only to outsource the same data to someone who leaves it exposed in the cloud doesn’t make sense. Vendors should be selected and appraised with the same care a company takes in protecting their in-house assets and information.
How UpGuard Helps
UpGuard tackles cloud leaks by automating cloud storage validation. Public access is the most dangerous, but like any digital surface, the total configuration state of cloud storage determines its resilience. UpGuard not only scans storage instances for public exposure, but also checks cloud platforms and servers themselves for misconfigurations that can lead to data exposure. UpGuard validates other internet-facing sources of leaks as well, like misconfigured GitHub repositories, and vulnerable rsync servers.
With UpGuard’s visual policies, admins can know in a glance which of their cloud storage instances are public and which are private. New buckets can be validated during the provisioning process automatically with UpGuard’s API and integration with tools like Puppet, Chef, and Ansible.
With UpGuard Procedures, cloud provisioning and maintenance processes can be automated and validated from end to end, reducing operational risk. Executable documentation works best when arranged by process, so that procedural steps can be chained together logically and validated in turn. This produces trustworthy assets and drastically reduces the risk of misconfigurations, such as accidental public exposure.
For example, a procedure automating the creation of a new Linux web server on AWS could:
- Validate that S3 buckets are private and properly configured.
- Validate AWS settings for each cloud server, such as instance type and location.
- Validate server settings against company policy.
- Test the server against CIS security benchmarks.
- Validate specific web server configurations, such as http.conf and SSL.
Cloud servers and storage deployed in this manner has a significantly lower risk of data exposure than those lacking these controls.
UpGuard also provides external vendor assessment, analyzing and visualizing the relative risk posed by third parties charged with handling your data. Compare vendors and partners to similar companies to see how they measure up within their field. Our external assessment aggregates every relevant security practice visible from the internet into a single risk score.
This includes website details for all of a vendor’s URLs; email and domain safety, such as protocols against phishing; open ports, like Microsoft’s SMB which has been exploited by ransomware attacks like WannaCry and Petya, and business details including employee satisfaction and CEO approval ratings.
Cloud leaks are the result of operational error– not human error. A process is missing the necessary controls to reliably produce good results over time. The way to prevent cloud leaks is to shore up those operational gaps by instituting automated validation across all critical assets. The reason cloud leaks happen is because nobody knows that sensitive data is exposed to the internet. Process controls, like those outlined above, guarantee that such knowledge is surfaced immediately, so that it can be fixed before it becomes a bigger problem. UpGuard helps prevent data breaches in the cloud and on-premises by automating and visualizing these process controls.