When we think about cyber attacks, we usually think about the malicious actors behind the attacks, the people who profit or gain from exploiting digital vulnerabilities and trafficking sensitive data. In doing so, we can make the mistake of ascribing the same humanity to their methods, thinking of people sitting in front of laptops, typing code into a terminal window. But the reality is both more banal and more dangerous: just like businesses, governments, and other organizations, hackers have begun to index data and automate hacking processes: the work of finding and exploiting internet-connected systems is largely performed by computers. There’s no security in obscurity if there’s no obscurity.
If the first stage of the internet focused on building up an information superhighway, the next phase is about finding ways to effectively parse that amount of information within a human context. Big data has only begun to be explored in its predictive and revelatory capacities, because only now do we have the processing power and applications to attempt it. Cybercrime is no less affected by this than any other aspect of digitized society; and understanding how indexing contributes to the discovery and exploitation of vulnerabilities across the internet will help make legitimate systems and data more resilient.
“Let me Google that for you” is a snarky retort to perceived laziness on behalf of question askers in a world where the answer to nearly every question, with enough legwork, can be found on a single website. The implication behind this sentiment is that anyone can know anything without having to personally employ another person to pass along the knowledge. Google doesn’t really let you search for vulnerable systems and exposed data. But several speciality search engines do, and in doing so, reveal the power of creating huge data sets that can be searched and organized according to specific needs.
By default, most web servers advertise information about themselves, such as what web server software is powering the site, and what version that software is running at. Security best practices have long recommended obscuring these headers, because they significantly narrow down the attack path for a malicious actor, who can easily research vulnerabilities on specific software versions. But a determined person, specifically targeting an organization, will likely discover these details anyway, so why bother obscuring it?
Consider a structured data set of all the advertised headers for every site on the internet. It is now possible to search for sites running specific versions of software, those for which an exploit is readily at hand, perhaps. This reduces the time investment of finding vulnerable sites, increasing the payoff for any data that can be exfiltrated or system resources that can be redirected.
It’s a difference of process. In a non-indexed environment, it might look something like this:
In an indexed environment, however, the process can be reversed:
The way the work of hacking is performed has changed, which in turn has changed the scope and properties of many threat vectors. Understanding the process by which cyber attacks occur helps inform how operational and security processes should be undertaken to defend against them.
Shodan.io is a search engine to find specific types of devices on the internet. They collect over a billion banners a month from devices on the internet, index them, and provide a frontend search utility. For almost five years, people have been using Shodan to find internet connected devices ranging from traffic light controls, to vulnerable TRENDnet cameras, to everything IoT.
Not to be outdone, a newer service called Censys has provided even more functionality, with internet wide search capabilities based on constantly updated IP maps. Here searches can be made based on IP, software type and version, keyword, and port number. Just recently, Censys was used to validate 220,000 vulnerable Arris modems on the internet.
Another interesting index is PublicWWW, which lets users search “source code” and find who is advertising where, what kind of scripts are running, what software is powering web applications, what kind of analytics are being gathered, and other details buried among the billions of lines of code across the internet. While this can be used for marketing and advertising metrics and research, it can also glean useful information about web application operations and security practices.
The sites listed above are but a sampling of what can be done with big data collection and analysis across the internet. Other programs, such as the now defunct PunkSpider, sought to provide direct search for exploitable vulnerabilities in applications across the web. We don’t know exactly what kind of potentially malicious indices will be created in the future, but we do know that everything from business websites, to search engines, to smartphone apps are collecting and recording as much data as possible from their users— hackers and cybercriminals are no different and will be driving their future efforts with all the data at their disposal.
The indexing of the internet has taken us to a strange place where massive data stores have been created and given searchable and intuitive frontends, creating efficient channels for finding needles in haystacks. The less information about systems that is openly available on the internet, the fewer indices they will end up on, reducing the likelihood of being targeted among lower-hanging fruit. Advertising server headers was once perhaps beneficial for compatibility and troubleshooting, or at worst it was innocuous because the method of exploiting them was impractical. But those same practices in today’s environment become much more dangerous when indexed among a searchable, internet-wide data set. Practicing cyber resilience, and shoring up assets and data as they are deployed and handled will prevent things like unnecessary open ports, which can be found easily through services like Shodan and used to access sensitive data across the internet.