Is having all your data in one place a good idea? The answer is yes, but only if you can adopt what we’ve learned from collecting and storing security and non-security data over the past decade.

In 2019, companies experienced persistent threats holding a tight grip on them for months at a time. We have learned that data to identify these threats can be everywhere and anywhere within the organization, from DNS data to mail conversations and even banking transactions. With this knowledge, you need a way to correlate and analyze data over a longer period of time. A data lake might just be the answer.

What Is a Security-Driven Data Lake?

While a security data lake is meant to store mainly security data, a security-driven data lake is meant to store big data and events in a secure way, giving valuable insights into events beyond traditional security events. This wider intent requires a specific data lake architecture and, just as importantly, buy-in from other key stakeholders within the organization.

The Importance of Non-Security Data

If you take a look at a threat report, you will find many indicators marked as IDS, which indicates that you will be able to find these indicators in your traditional security events from firewalls, intrusion detection and prevention systems (IDPS), and anti-malware systems. But there are other indicators such as bank accounts, phone numbers and more, which are not available in traditional security systems. These can be found in non-security data you stored somewhere else, like transactions for bank accounts or call registers as part of your telecommunication systems.

When you are tracking and tracing a security incident, this additional data could very well make a difference.

Building Data Lake Architecture

Building such a security-driven data lake is rather simple. The tools and methods have been widely adopted in the field of big data for years and are just waiting to be used for and by security. However, there are some topics you may want to consider to ensure a successful project.

First, the data lake itself. There are two main projects — Hadoop and Elasticsearch — that can give a solid foundation for your data lake. This allows you to benefit from fast searching and access to data combined with a robust foundation for future growth.

  • Hadoop is a framework of tools and software to run big data stores and a flagship of the Apache Foundation, which has been stable for several years. The high availability and data separation into catalogs make Hadoop a good fit for corporate data lakes with data views for security.
  • Elasticsearch, in my opinion, is the only true data lake when it comes to unstructured data. Especially when you are just starting out, Elasticsearch can be a good initial starting point when adopting a data lake strategy, as it gives you all the features needed to start investigating your data and apply basic analytics tools right on the spot.

Data ingestion is the next thing you need to think about, and with ingestion comes the need to normalize your data, meaning that you need to apply a common structure for the different types of data you want to ingest. The data lake itself can handle unstructured data, so you do not need to worry if you miss a data type or two. If you want to have your analysts profit from the data and you want to have real-time processing, then a common structure for data and alerting is an important step.

Data Pipelining and Advanced Analytics

If you have your security team running a data lake, you also have to deal with the wider audience and consumer of the lake. This makes data pipelining a key concern. You need to ensure that data flows to the right consumers, such as your security information and event management (SIEM) system, which will need security events to be fully operational. Keep in mind, a security-driven data lake is no replacement for a SIEM in the first place. Real-time alerting and event correlation of security events is what a SIEM does best and might be needed to do so for another decade.

Advanced analytics can also enable more informative security events. One of the data pipelines you want to create from the start is toward an analytics platform so data can be processed and better insight can be created. Even if you miss security events being forwarded to your SIEM, with the right analytics you can create security events directly from the data stream in the pipeline or by retrospective hunting on the lake. The simplest example would be a blacklist search of non-IDS data in the lake and reporting the findings to the SIEM.

Final Considerations

Here are some final considerations to keep in mind as you begin building your security-driven data lake:

  • What is the right technology to use? Think about Hadoop and Kafka.
  • Who are my consumers and stakeholders within the organization? Running a security-driven data lake with security and non-security data will need buy-in from many departments. Start discussing with the chief information security officer (CISO) and other leadership who may have their own data lake already.
  • What data and volume do I expect? You will need to ingest data and normalize it, so knowing what is coming your way helps you prepare. One way is adopting knowledge from your SIEM, which has been dealing with this data already and is capable of ingesting hundreds of thousands of events per second.
  • What value can I add with advanced security analytics, machine learning and artificial intelligence? The possibilities are endless, so encouraging your data and security analysts to play with the data will show what is possible.

More from Intelligence & Analytics

RansomExx Upgrades to Rust

IBM Security X-Force Threat Researchers have discovered a new variant of the RansomExx ransomware that has been rewritten in the Rust programming language, joining a growing trend of ransomware developers switching to the language. Malware written in Rust often benefits from lower AV detection rates (compared to those written in more common languages) and this may have been the primary reason to use the language. For example, the sample analyzed in this report was not detected as malicious in the…

Moving at the Speed of Business — Challenging Our Assumptions About Cybersecurity

The traditional narrative for cybersecurity has been about limited visibility and operational constraints — not business opportunities. These conversations are grounded in various assumptions, such as limited budgets, scarce resources, skills being at a premium, the attack surface growing, and increased complexity. For years, conventional thinking has been that cybersecurity costs a lot, takes a long time, and is more of a cost center than an enabler of growth. In our upcoming paper, Prosper in the Cyber Economy, published by…

Overcoming Distrust in Information Sharing: What More is There to Do?

As cyber threats increase in frequency and intensity worldwide, it has never been more crucial for governments and private organizations to work together to identify, analyze and combat attacks. Yet while the federal government has strongly supported this model of private-public information sharing, the reality is less than impressive. Many companies feel that intel sharing is too one-sided, as businesses share as much threat intel as governments want but receive very little in return. The question is, have government entities…

Tackling Today’s Attacks and Preparing for Tomorrow’s Threats: A Leader in 2022 Gartner® Magic Quadrant™ for SIEM

Get the latest on IBM Security QRadar SIEM, recognized as a Leader in the 2022 Gartner Magic Quadrant. As I talk to security leaders across the globe, four main themes teams constantly struggle to keep up with are: The ever-evolving and increasing threat landscape Access to and retaining skilled security analysts Learning and managing increasingly complex IT environments and subsequent security tooling The ability to act on the insights from their security tools including security information and event management software…