Is having all your data in one place a good idea? The answer is yes, but only if you can adopt what we’ve learned from collecting and storing security and non-security data over the past decade.

In 2019, companies experienced persistent threats holding a tight grip on them for months at a time. We have learned that data to identify these threats can be everywhere and anywhere within the organization, from DNS data to mail conversations and even banking transactions. With this knowledge, you need a way to correlate and analyze data over a longer period of time. A data lake might just be the answer.

What Is a Security-Driven Data Lake?

While a security data lake is meant to store mainly security data, a security-driven data lake is meant to store big data and events in a secure way, giving valuable insights into events beyond traditional security events. This wider intent requires a specific data lake architecture and, just as importantly, buy-in from other key stakeholders within the organization.

The Importance of Non-Security Data

If you take a look at a threat report, you will find many indicators marked as IDS, which indicates that you will be able to find these indicators in your traditional security events from firewalls, intrusion detection and prevention systems (IDPS), and anti-malware systems. But there are other indicators such as bank accounts, phone numbers and more, which are not available in traditional security systems. These can be found in non-security data you stored somewhere else, like transactions for bank accounts or call registers as part of your telecommunication systems.

When you are tracking and tracing a security incident, this additional data could very well make a difference.

Building Data Lake Architecture

Building such a security-driven data lake is rather simple. The tools and methods have been widely adopted in the field of big data for years and are just waiting to be used for and by security. However, there are some topics you may want to consider to ensure a successful project.

First, the data lake itself. There are two main projects — Hadoop and Elasticsearch — that can give a solid foundation for your data lake. This allows you to benefit from fast searching and access to data combined with a robust foundation for future growth.

  • Hadoop is a framework of tools and software to run big data stores and a flagship of the Apache Foundation, which has been stable for several years. The high availability and data separation into catalogs make Hadoop a good fit for corporate data lakes with data views for security.
  • Elasticsearch, in my opinion, is the only true data lake when it comes to unstructured data. Especially when you are just starting out, Elasticsearch can be a good initial starting point when adopting a data lake strategy, as it gives you all the features needed to start investigating your data and apply basic analytics tools right on the spot.

Data ingestion is the next thing you need to think about, and with ingestion comes the need to normalize your data, meaning that you need to apply a common structure for the different types of data you want to ingest. The data lake itself can handle unstructured data, so you do not need to worry if you miss a data type or two. If you want to have your analysts profit from the data and you want to have real-time processing, then a common structure for data and alerting is an important step.

Data Pipelining and Advanced Analytics

If you have your security team running a data lake, you also have to deal with the wider audience and consumer of the lake. This makes data pipelining a key concern. You need to ensure that data flows to the right consumers, such as your security information and event management (SIEM) system, which will need security events to be fully operational. Keep in mind, a security-driven data lake is no replacement for a SIEM in the first place. Real-time alerting and event correlation of security events is what a SIEM does best and might be needed to do so for another decade.

Advanced analytics can also enable more informative security events. One of the data pipelines you want to create from the start is toward an analytics platform so data can be processed and better insight can be created. Even if you miss security events being forwarded to your SIEM, with the right analytics you can create security events directly from the data stream in the pipeline or by retrospective hunting on the lake. The simplest example would be a blacklist search of non-IDS data in the lake and reporting the findings to the SIEM.

Final Considerations

Here are some final considerations to keep in mind as you begin building your security-driven data lake:

  • What is the right technology to use? Think about Hadoop and Kafka.
  • Who are my consumers and stakeholders within the organization? Running a security-driven data lake with security and non-security data will need buy-in from many departments. Start discussing with the chief information security officer (CISO) and other leadership who may have their own data lake already.
  • What data and volume do I expect? You will need to ingest data and normalize it, so knowing what is coming your way helps you prepare. One way is adopting knowledge from your SIEM, which has been dealing with this data already and is capable of ingesting hundreds of thousands of events per second.
  • What value can I add with advanced security analytics, machine learning and artificial intelligence? The possibilities are endless, so encouraging your data and security analysts to play with the data will show what is possible.

More from Intelligence & Analytics

Email campaigns leverage updated DBatLoader to deliver RATs, stealers

11 min read - IBM X-Force has identified new capabilities in DBatLoader malware samples delivered in recent email campaigns, signaling a heightened risk of infection from commodity malware families associated with DBatLoader activity. X-Force has observed nearly two dozen email campaigns since late June leveraging the updated DBatLoader loader to deliver payloads such as Remcos, Warzone, Formbook, and AgentTesla. DBatLoader malware has been used since 2020 by cybercriminals to install commodity malware remote access Trojans (RATs) and infostealers, primarily via malicious spam (malspam). DBatLoader…

New Hive0117 phishing campaign imitates conscription summons to deliver DarkWatchman malware

8 min read - IBM X-Force uncovered a new phishing campaign likely conducted by Hive0117 delivering the fileless malware DarkWatchman, directed at individuals associated with major energy, finance, transport, and software security industries based in Russia, Kazakhstan, Latvia, and Estonia. DarkWatchman malware is capable of keylogging, collecting system information, and deploying secondary payloads. Imitating official correspondence from the Russian government in phishing emails aligns with previous Hive0117 campaigns delivering DarkWatchman malware, and shows a possible significant effort to induce a sense of urgency as…

X-Force releases detection & response framework for managed file transfer software

5 min read - How AI can help defenders scale detection guidance for enterprise software tools If we look back at mass exploitation events that shook the security industry like Log4j, Atlassian, and Microsoft Exchange when these solutions were actively being exploited by attackers, the exploits may have been associated with a different CVE, but the detection and response guidance being released by the various security vendors had many similarities (e.g., Log4shell vs. Log4j2 vs. MOVEit vs. Spring4Shell vs. Microsoft Exchange vs. ProxyShell vs.…

Unmasking hypnotized AI: The hidden risks of large language models

11 min read - The emergence of Large Language Models (LLMs) is redefining how cybersecurity teams and cybercriminals operate. As security teams leverage the capabilities of generative AI to bring more simplicity and speed into their operations, it's important we recognize that cybercriminals are seeking the same benefits. LLMs are a new type of attack surface poised to make certain types of attacks easier, more cost-effective, and even more persistent. In a bid to explore security risks posed by these innovations, we attempted to…