According to a 2021 study by UpGuard, over 51% of analyzed Fortune 500 companies were unknowingly leaking sensitive metadata in public documents - data leaks that could be very useful in a reconnaissance campaign preceding a major data breach. Without timely detection solutions, all corporate (and personal) accounts impacted by data leaks are at a critical risk of compromise, which also places any associated private internal networks at a high risk of unauthorized access and sensitive data theft.

Given the volatile nature of digital identifiers, false positives are an unavoidable by-product of all data leak detection initiatives. Thankfully, these false positives can be significantly reduced with the implementation of specific strategies.

To learn how to reduce data leaks, false positives, and improve the efficiency of your data leak remediation efforts, read on.

What is a False Positive in Data Leak Detection?

In the context of data leaks, a false positive is a false detection that triggers a fraudulent alert of exposed sensitive data. Such false security alerts further pollute the already noisy stream of potentially malicious data being continuously monitored by security teams.

False positives are more than just an inconvenience. They force cybersecurity teams to deploy incident response plans for events that aren’t detrimental to an organization’s security posture. If false-positive rates don’t decrease over time, security personnel become desensitized to their triggers, resulting in real threats being mistaken for false alarms. Cybercriminals aware of this phenomenon could leverage alert fatigue to progress their malicious activity without disruption.

Other clever cybercriminals purposely create false positives to divert attention away from their cyberattack lifecycle.

Why Do False Positives in Data Leak Detection Happen?

There are two primary reasons false positives occur during data leak detection.

1. Data leak Detection Mechanisms Dependent on Pattern-Matching Mechanisms are too Binary

Data leak detection tools trigger alerts by comparing surface and dark web scans against a database of monitored keywords. These keywords could include business email addresses, passwords, or secret keys.

The problem with this approach is that it’s too myopic and fails to consider the broader context of the data set containing the potential cyber threat.

For example, suppose an organization's secret key begins with the string “UpuQ.” If this string was set as the keyword for a data leak detection campaign, and the following source code dump was detected on Github, a data leak alert would be triggered.

Example of data leak false positive.

This, however, would be a false alarm since a broader contextual evaluation reveals that this keyword sits within a random string list.

A more optimal keyword to monitor would contain more characters to increase contextualization and the potential of a true positive. By slightly expanding the monitored keyword to include a character commonly used in the organization’s secret key, for example, “UpuQ-” the accuracy of notifications could increase.

This solution, however, doesn’t account for all threat detection scenarios. Some data leaks only reveal portions of compromised security information and sensitive data, which could trigger a false critical alert.

For example, a data leak detection mechanism that primarily depends on a match for the keyword “UpuQ-” would trigger an alert for the following find:

APP_API_SECRET=UpuQ-***55d

While remediation efforts for such a detection are necessary, they aren’t critical. Besides reducing the number of false positives, an ideal data leak detection solution should support the optimization of remediation efforts, allowing cyber teams to focus on the most critical threats to an organization’s security posture.

2. Cybercriminals Purposely Create False Positives to Obfuscate their Trails and Identify Investigation Efforts

Advanced threats, aware of how vulnerable pattern recognition mechanisms are to triggering false positives, purposely create fraudulent data dumps likely to trigger fraudulent data leak alerts. This is done to either detect whether threat intelligence and security operations are tracking a cybercriminal gang’s activities or to divert real-time monitoring efforts away from legitimate cybercriminal movements.

This is a tactic advanced ransomware gangs are likely to implement. Following a ransomware malware infection of a high-profile target, law enforcement agencies will attempt to track down responsible parties by either monitoring activity across cryptocurrency exchanges or monitoring for compromised data deposits on dark web marketplaces and Telegram groups.

A strategic false data leak trajectory design could successfully divert law-enforcement away from a cybercriminal gang’s trail, allowing them to avoid apprehension. Ransomware gangs often post false data breach announcements on ransomware blogs to mislead and disrupt security investigations.

Learn how to obfuscate ransomware attack attempts >

How to Reduce False Positives in Data Leak Detection

Given the rapidly expanding domain of data leak dumps (dark, deep, and surface marketplaces, cybercriminal forums, Telegram groups, etc.), rapid data leak detection is only possible with the support of solutions powered by artificial intelligence and machine learning mechanisms.

However, since even the most advanced machine learning algorithms struggle to contextualize data leak finds correctly, such security tools should be coupled with a manual evaluation by security analysts.

Venn diagram with AI mechanisms and Human Factor overlapping at Optimal Data Leak Detection

Data leaks confirmed by security analysts are less likely to be false positives and false negatives. By responding to these filtered results, organizations are less likely to waste their limited security resources on innocuous events.

Here’s an example of a data leak detection likely to be a true positive following contextual consideration.

Example of data leak true positive.

Since the monitored keyword is detected within a visible string and this leak was included in a data set containing administrator account information, this data leak should be classified as critical. An automated detection solution is unlikely to accurately measure the significance of such a find and the criticality of rapid response efforts. This data leak find can only be trusted to be prioritized if it’s flagged by a security expert that understands the broader contextual significance of such an exposure.

Here’s an example of an optimal data leak management model that includes a human factor in its workflow.

  • Step 1: Data leaks that match monitored keywords are detected with an AI-assisted platform.
  • Step 2: Detected data leaks are evaluated by security analysts who manually evaluate their context and filter out false positives.
  • Step 3: Security analysts pass all concerning data leaks and their corresponding prioritization and remediation suggestions to internal security teams.
  • Step 4: Internal security teams address the shortlist of detected data leaks based on the prioritization suggestions of the security analysts.

How UpGuard Can Help You Reduce False Positive Data Leaks

UpGuard offers a data leak detection managed service combining an AI-assisted attack surface search engine with manual analysis and verification from security analysts to reduce false positives and unnecessary response efforts.

By also advising internal security teams how to best respond to detected data leaks, UpGuard can help your organization implement the most efficient data leak detection program.

Watch the video below for an overview of some of UpGuard's data leak detection features.

Ready to see
UpGuard in action?