Data Leaks and the Deep Web: Swimming in the Deep End

Get a demo

Free trial

Download the PDF guide

Free trial

Written by

Kaushik Sen

Chief Marketing Officer

Kaushik has a background in software engineering, enterprise solution architecture, and data analytics. He brings a unique, data-driven perspective to cybersecurity education.

Reviewed by

Abi Tyas Tunggal

Writer and Senior Product Manager at UpGuard.

Abi's work has influenced leaders across cybersecurity, technology, and financial services.

Table of contents

Those interested in how data breaches occur should be familiar with the general topography of the Internet. In our previous piece, we discussed the difference between the surface web, deep web and dark web. Most estimates about the topography of the Internet conclude that the deep web makes up between 95%-99% of all web sites. The dark web likely comprises less than 1%, while the surface web accounts for only a few percentage points itself. Nearly the entire Internet is the deep web. In our history of examining data breaches, we have found that while much focus has been put on the criminal activity occurring in the dark web, the biggest cause of data exposure is when a system or dataset ostensibly in the deep web inadvertently bubbles up to the surface.

Data Leaks on the Dark Web

Controversy and intrigue surrounds the dark web. Rightly so, as it is regularly used by criminals to disseminate stolen information. However, despite the presence of actual criminal activity on the dark web, the process of stolen corporate dataset ending up for sale there is long and complex, requiring malicious human actors with well above average technical know-how with access to a closed network of trusted individuals. Thus the bar for fencing data by this method is comparatively high, meaning the payout would have to be substantial to make the labor worthwhile.

Furthermore, data sold on the dark web has to first be obtained somewhere else. Aside from a small number of insider attacks, where someone who was purposefully granted access misuses it for personal gain or malicious intent, much of that data is gleaned through the surface and deep web. By the time it ends up for sale on the dark web, it has long since been exposed elsewhere. The mystery of the dark web, conjured simply by its name, can lure organizations into mis-prioritizing efforts there over places they would better be spent, namely, the deep web.

Indexing the Deep Web

But we know that the deep web isn't indexed by search engines, so how are deep web data leaks discoverable? Even though it's true that the deep web is defined as all that which isn’t indexed on the major search engines, the truth is that Google can be used to find documents, pages, and even entire systems thought to be restricted by their owners. Maybe technically, if they can be found, they don’t qualify as the deep web. But such a semantic argument is irrelevant to an organization facing a major data exposure due to a misunderstanding of how obscured their resources actually were.

Various Google search techniques have been developed to make finding such assets much easier, thus surfacing them and exposing their contents. For example, including environment variables inside a web-server directory can lead to those files being indexed. This makes searching for them easy, finding common environment variables like database credentials easy to find for websites that are misconfigured in this way.

But even for sites that can’t be reached via the major search engines, other methods exist to find vulnerable resources on the deep web. Just as Google and Bing were built to facilitate finding surface web resources, other indexes have been built to facilitate finding deep web resources. Some of these are as simple as a port scanner, randomly polling networks for active connections and documenting them in a database. Some powerful scanners like ZMAP can poll nearly the entire internet in hours. Other mechanisms catalog registered certificates by domain, capturing hostnames of systems otherwise unknown to the public. Many of these types of indices exist, and more are being built every day.

Unauthorized Access vs. Unauthenticated Access

It’s important to point out the difference between anonymous access and unauthorized access. Authentication is the process of establishing one's identity. Authorization is the process of determining whether one has the permission to access a resource. All of the exposures that the UpGuard research team reports– that is, all data leaks– are available to anonymous access, meaning anyone regardless of their identity would be able to access that information. In cases where anonymous access is permitted, everyone is "authorized," in that the rules governing permission allow them access; or maybe more to the point, if there is no authentication layer, there can be no sorting of authorized versus unauthorized.

Examples of deep web breaches include UpGuard’s discovery of Facebook user data on Amazon’s S3 cloud storage platform. We've previously written about this design flaw in S3's security model, but in summary, cloud storage is typically considered part of the deep web because it should require authentication to access; however, because cloud storage can be configured to be publicly accessible, many misconfigured storage instances expose sensitive data believed to be secure in the deep web.

Another example is the exposure of 57 million records of personally identifiable information (PII) for American citizens in a misconfigured ElasticSearch system. Elastic is a powerful data analysis platform, but can allow anonymous access if not configured properly. Such applications, including databases like MongoDB and file transfer protocols like rsync are typically considered part of the protected deep web infrastructure, but in many cases that protection is absent, and these systems are being indexed, scanned, and ultimately compromised.

While data traded on the dark web may be gathered using unauthorized– that is, illegal– access, data leaks from the deep web occur due to a lack of authentication. The barrier to accessing this information is much, much lower. While unauthorized access requires breaking into a system, accessing unauthenticated data merely requires waiting for it to be opened to the world. And with the vast majority of all digital resources residing in the deep web, it is inevitable that some of them will be.

How To Prevent Deep Web Data Exposures

Treat cybersecurity as business risk management

Where authentication falls short is in its administration. Creating and maintaining secured online resources, especially at large scale, is a complex task. The business realities of most companies often mean that IT departments are fighting for resources like everyone else, and that things that “would be nice” are deferred in lieu of items more relevant to the bottom line. Security has problematically become one of the things first put into this “would be nice” bucket, since it doesn’t actually provide direct functionality, and is thus seen as an additional cost to service. Because of this, the authentication security of many online resources is inconsistent, leading to the data breaches we’re all familiar with. These don’t occur because of the actions of malicious outsiders, but because of the internal processes of the organizations themselves.

A multilayered approach to authentication and authorization can greatly reduce the risk of exposure. For example, requiring someone to be on a corporate VPN in order to reach a server that they then must log in to minimizes the footprint of the server itself, preventing the system from being detected and indexed from the wider internet, and still controlling authorization through user-based credentials, which can be held to best practices to stay more secure.

Rather than being seen as an additional cost to the service provided by technology, security should instead be seen as a risk management technique, mitigating potential damage to the organization. Like all risk, it can seem intangible and far away when things are running smoothly, and larger than life when an incident has already occurred.

Evaluate and control third-party risk

But business technology is an ecosystem, and even if primary operations are squared away, when data processing is outsourced to a third party, the risk of exposure carries over to their systems and processes as well. Understanding and quantifying this third-party risk is key in preventing third party data breaches from undermining the security efforts of in-house IT staff and exposing the organization to just as much scrutiny and consequence as if the breach had been on a primary system-- it is the data that gets exposed-- where it lived at the time of exposure is immaterial to most people who trust an organization with such information.

Proactively monitor cybersecurity risk

Finally, the only way for an organization to be sure what their deep web profile looks like is to continuously check. Without proactive efforts to understand what assets are obscured, which require authentication, and whether sensitive data is actually available on the deep (or surface) web, there is no way to prepare for or defend against the potential issues that occur here. Furthermore, proactive efforts go a long way toward company good faith efforts in the event a breach does occur, paying off by both preventing breaches and showing customers they actually care. Like all risk, there is no complete mitigation of the risk taken on by the adoption of technology. But those who act proactively to minimize this risk stand a far better chance of navigating their business through the deep and the dark.

The scale of the problem, as well as the ease of exposure is why UpGuard Breach Risk team has chosen to help companies proactively defend against deep web data leaks.

Final Thoughts

Ever since digital transformation began, the image of the hoodie-wearing hacker, skateboarding up to an access terminal in the middle of the night and running some code on a laptop reflected in their sunglasses has dominated the imagination of people considering cybersecurity. This analog to the material world, to crime and property, has some merit, or data wouldn’t be valuable. But it can be overstated to the point of obscuring the real cyber threats to an online presence, the misunderstanding and mismanagement of deep web assets and data.

Understanding the concepts of each layer of the internet, if not the technical details, can help remove some of the mystery around what are very practical cybersecurity issues surrounding online data.

Watch the video below for an overview of UpGuard's data leak detection features.