Most descriptions of the Internet contain three layers, or levels: the surface web, the deep web, and the dark web. These categorizations can be both useful and misleading. The words “deep” and “dark” carry connotations with them that often obscure the technical and logical reasons for their designation. We’ll define these concepts, and review the three internet layers mentioned above, with an eye to the difference between what they sound like they might be, and what they really are.
What is the Surface Web?
As with many things in life, the things we believe ourselves to be most familiar with are the least known to us. We, the general public, are immersed in the surface web. It’s where our daily online activities take place, and what most people consider to be the Internet. This overfamiliarity with it gives us a sense that we understand what it is and how it works, because, well, it works.
But even beyond the technical complexities of the internet, the essence of the surface web can be difficult to grasp. Most definitions of the surface web reference conventional search engines' ability to index a site. i.e. If I can Google something, it’s on the surface web. As we’ll see later when discussing the deep web, that’s not quite true, but for most normal online processes that definition will suffice.
Web Indexing Explained
Search indexing helps make the Internet useable by most people. How do most people actually find things online? The answer is obvious: by using one of a handful of powerful search engines like Google or Bing. Increasingly, people don’t remember or have never experienced what it was like to navigate the internet before web search engines such as Yahoo, and then Google took over. Websites used to be indexes for their particular topic. Structures like webrings helped like-minded people find each other. Search engine indexing relies on a process referred to as “crawling”, which refers to the idea that a search engine acts as a sort of virtual spider, crawling between web pages by following hyperlinks. Google's crawler reads each indexable page on the Internet, turning into structured data that can be processed. Algorithms then organize the structured data so that it can be returned to a Google user instantly when they perform a search.
To most people, it seems like Google is simply the homepage of the World Wide Web, and that their search functionality is an attribute of the Internet itself. Of course, we know this isn’t true, and the surface web only comprises just a small percentage of the entire internet. Most of the internet is divided up into restricted pockets known as the deep web.
What is the Deep Web?
So if the surface web is the part of the Internet we can easily see, then the deep web by definition is the part of the web that is less visible to the naked eye. The deep web is massive. It's estimated that deep web content makes up anywhere from 500-5000x as much as what is readily accessible on the surface web.
The ways in which the deep web is “deep” can be put into two categories: obscurity and authentication. Obscurity simply refers to the inability to find an internet resource in one of the major search engines. This can be achieved easily by adding a suitable robots.txt file to a website, which prevents search engine crawlers from indexing the site, and therefore preventing those pages displaying in SERPs (Search Engine Results Pages). As the old, and as we will see outdated, saying goes, security by obscurity. If nobody can find my stuff, how can they access it?
The other portion of the deep web is authenticated. Authentication refers to a requirement of credentials establishing one's identity in order to access the systems and information. Whether or not the resources are indexed on a search engine, a visitor must possess valid credentials, such as a username and password, to go “deeper” into the site than the login wall. When a customer accesses a bank's internet banking service or a private social media account, the portion behind the login page is considered part of the deep web, because a search engine's crawler is typically prevented from accessing the resources behind the login page.
Both of these methods have their advantages and disadvantages. Obscurity is cheap to implement, both financially and in administrative overhead. The unfortunate disadvantage is that the obscurity itself is illusory, and growing more so everyday, giving those who depend on it a false sense of security about their assets. Authentication is much better at preventing unauthorized access to resources. However, depending on the systems and software involved, authentication can be very complex to implement, and errors in implementation can undermine the purpose entirely.
What is the Dark Web?
The dark web has more popular recognition than the deep web, especially in its association with illicit online activities and depiction on mainstream TV shows. It’s illegal to sell corporate data, for example, which seems to imply that the dark web would be the place to find the greatest treasure troves of maliciously obtained datasets. But what is the dark web? The dark web refers to any number of self-contained, encrypted overlay networks that live on top of the internet, inaccessible save for special tools and protocols, hence dark.
TOR (The Onion Router) Network
The most popular dark web network is The Onion Network (TOR network). There are several others, but none with as large a user base as TOR. Accessing the TOR network requires a special web browser called a TOR browser. The TOR network allows people to access private, specially encrypted pages ending in ”.onion”. TOR also lets users connect to websites in the surface and deep web anonymously, to prevent internet service providers from seeing what they are browsing. Similarly, the websites themselves cannot track these users easily, as their browsing traffic (including IP address) is encrypted and routed through a series of volunteer-operated servers called nodes (or relays).
History of the Dark Web
We can't discuss the history of the dark web without understanding the history of TOR, the most popular dark web network. The principle of ‘onion routing’, which underpins TOR, was developed by the researchers Paul Syverson, Michael G. Reed and David Goldschlag at the United States Naval Research Laboratory in the 1990’s. The very first version of Tor, named ‘The Onion Routing Project’ or simply TOR Project, was developed by Roger Dingledine and Nick Mathewson. It was launched on September 20, 2002. The development of TOR was then carried out under the EFF (Electronic Frontier Foundation).
Ultimately, TOR was taken over by a non-profit organization, The Tor Project, Inc. which was founded by Dingledine, Mathewson and five others. The history of the company has been a chequered one, with allegations of bullying and sexual harassment against Jacob Appelbaum, core developer and public face of TOR which were determined accurate after a private investigation in 2016.
The Tor Project Inc. reconstituted its board in 2016. Since this turmoil, management and oversight of the software powering the largest dark web network has been fairly stable and consistent.
Who uses the Dark Web?
Like the deep web, the dark web is used for positive as well as nefarious purposes. On the positive side, dissidents, journalists, whistleblowers and advocates for freedom of speech are able to express themselves on the dark web with less risk of exposing their real-world identity. For example in countries with restrictions of freedom of speech such as China, users have successfully bypassed Internet controls with the TOR project.
However, the darker side of the dark web is more infamous. A case in point is the Silk Road dark web site. Founded in 2011 and shut down by the FBI in 2013, the Silk Road was a black market for all kinds of illegal activity. From trafficking in illegal drugs, credit cards and stolen subscription accounts from services such as Netflix and Spotify, the Silk Road was an online portal for people to engage anonymously in illicit activities. Since then, the Silk Road has reappeared and been shut down again, but has never become as popular.
Monitoring the dark web for illegal activity is something that law enforcement agencies all over the world are grappling with. Due to encryption and other privacy features, catching criminals on the dark web is difficult, and often requires agencies such as the FBI going into deep cover to try and surface criminal activity.
The Final Word: Dark Web vs. Deep Web
Curiosity is natural, and the general public is rightfully curious about the difference between these parts of the Internet. Despite listings of dark web sites in places like Reddit, and through Twitter, the average internet user will not experience anything on the dark web without setting up complex software such as a TOR browser. The deep web however is another story. Most of our usage of the "private" internet is through authenticated portions of the deep web. Correspondingly, most of the data which we deem private and important to us is contained within the deep web, and poses a far higher threat if compromised.
In our next piece, we go further into data exposures on the deep web, and what specific risks companies need to be worried about when dealing with this emerging threat.
99% of the Internet is on the deep web. Learn how UpGuard helps companies proactively monitor the deep web for data leaks, and close them.