The Charm of Security-Driven Data Lake Architecture

January 20, 2020
| |
4 min read

Is having all your data in one place a good idea? The answer is yes, but only if you can adopt what we’ve learned from collecting and storing security and non-security data over the past decade.

In 2019, companies experienced persistent threats holding a tight grip on them for months at a time. We have learned that data to identify these threats can be everywhere and anywhere within the organization, from DNS data to mail conversations and even banking transactions. With this knowledge, you need a way to correlate and analyze data over a longer period of time. A data lake might just be the answer.

What Is a Security-Driven Data Lake?

While a security data lake is meant to store mainly security data, a security-driven data lake is meant to store big data and events in a secure way, giving valuable insights into events beyond traditional security events. This wider intent requires a specific data lake architecture and, just as importantly, buy-in from other key stakeholders within the organization.

The Importance of Non-Security Data

If you take a look at a threat report, you will find many indicators marked as IDS, which indicates that you will be able to find these indicators in your traditional security events from firewalls, intrusion detection and prevention systems (IDPS), and anti-malware systems. But there are other indicators such as bank accounts, phone numbers and more, which are not available in traditional security systems. These can be found in non-security data you stored somewhere else, like transactions for bank accounts or call registers as part of your telecommunication systems.

When you are tracking and tracing a security incident, this additional data could very well make a difference.

Building Data Lake Architecture

Building such a security-driven data lake is rather simple. The tools and methods have been widely adopted in the field of big data for years and are just waiting to be used for and by security. However, there are some topics you may want to consider to ensure a successful project.

First, the data lake itself. There are two main projects — Hadoop and Elasticsearch — that can give a solid foundation for your data lake. This allows you to benefit from fast searching and access to data combined with a robust foundation for future growth.

  • Hadoop is a framework of tools and software to run big data stores and a flagship of the Apache Foundation, which has been stable for several years. The high availability and data separation into catalogs make Hadoop a good fit for corporate data lakes with data views for security.
  • Elasticsearch, in my opinion, is the only true data lake when it comes to unstructured data. Especially when you are just starting out, Elasticsearch can be a good initial starting point when adopting a data lake strategy, as it gives you all the features needed to start investigating your data and apply basic analytics tools right on the spot.

Data ingestion is the next thing you need to think about, and with ingestion comes the need to normalize your data, meaning that you need to apply a common structure for the different types of data you want to ingest. The data lake itself can handle unstructured data, so you do not need to worry if you miss a data type or two. If you want to have your analysts profit from the data and you want to have real-time processing, then a common structure for data and alerting is an important step.

Data Pipelining and Advanced Analytics

If you have your security team running a data lake, you also have to deal with the wider audience and consumer of the lake. This makes data pipelining a key concern. You need to ensure that data flows to the right consumers, such as your security information and event management (SIEM) system, which will need security events to be fully operational. Keep in mind, a security-driven data lake is no replacement for a SIEM in the first place. Real-time alerting and event correlation of security events is what a SIEM does best and might be needed to do so for another decade.

Advanced analytics can also enable more informative security events. One of the data pipelines you want to create from the start is toward an analytics platform so data can be processed and better insight can be created. Even if you miss security events being forwarded to your SIEM, with the right analytics you can create security events directly from the data stream in the pipeline or by retrospective hunting on the lake. The simplest example would be a blacklist search of non-IDS data in the lake and reporting the findings to the SIEM.

Final Considerations

Here are some final considerations to keep in mind as you begin building your security-driven data lake:

  • What is the right technology to use? Think about Hadoop and Kafka.
  • Who are my consumers and stakeholders within the organization? Running a security-driven data lake with security and non-security data will need buy-in from many departments. Start discussing with the chief information security officer (CISO) and other leadership who may have their own data lake already.
  • What data and volume do I expect? You will need to ingest data and normalize it, so knowing what is coming your way helps you prepare. One way is adopting knowledge from your SIEM, which has been dealing with this data already and is capable of ingesting hundreds of thousands of events per second.
  • What value can I add with advanced security analytics, machine learning and artificial intelligence? The possibilities are endless, so encouraging your data and security analysts to play with the data will show what is possible.
Joerg Stephan
Security Consultant, IBM

Joerg Stephan is a security consultant with IBM and a former member of an international IBM team of dedicated customer analysts. Before joining IBM in 2014, ...
read more

Your browser doesn’t support HTML5 audio
Press play to continue listening
00:00 00:00