Chaos Engineering: Building the Next Generation of Cyber Ranges

November 30, 2020
|
co-authored by Itzik Kotler
|
6 min read

In one of our past posts on the same subject, we discussed how to apply chaos engineering principles to cyber war-games and team simulation exercises in broad brush strokes.

In short, ‘chaos engineering’ is the discipline of working and experimenting with new features and changes on a system that’s already in live production. The purpose is, among others, to test the system’s ability to implement changes and remain resilient.

Using the applicable principles, IBM Security X-Force is building out next-generation cyber ranges that encompass security concerns and business imperatives, mapped to likely attack scenarios. In a sense, it’s chaos engineering with intent. We want to inject enough chaos and unexpected elements into the exercises to force security and other types of organizational teams to think on their feet and learn how to react to new threats.

In the range, we also want teams to encounter scenarios and exploits that are relevant in their domains, geographies and technology environments. A key element in our simulations is to include not just the infosec teams but also all other affected units — marketing and public relations, legal, finance and even human resources.

The goal is to leverage the best parts of chaos engineering to create an immersive and highly relevant experience that can better prepare an organization for the jolting and uncomfortable experience of a successful cyberattack and resulting damage — to live out the worst-case scenario. If they can handle the worst case, then all other cases can be more manageable.

So, let’s consider a hypothetical case as an example: A large chain of hospitals faces a broad array of cyber threats. What types of attacks may come? Ready, set, go.

A Big Hospital Chain Gets Attacked

This hypothetical example is rooted in an unfortunate real-life case. On Sept. 28, 2020, a large chain of over 400 U.S. hospitals suffered a devastating ransomware attack. The attack reportedly forced the postponement of patient procedures and, by some accounts, a total manual shutdown of the organizations’ IT functions. Early reports indicated that the ransomware used in the attack was the Ryuk variant. Let’s start from there.

Health care institutions are increasingly finding themselves squarely in the crosshairs of ransomware gangs, likely because they tend to pay quickly and pay a lot when human lives are on the line. When helping organizations in this sector plan for responding to that sort of threat, we start our cyber range war-game by mounting ransomware attacks against the infrastructure of a typical health care organization. To make this attack realistic, we will use the actual Ryuk ransomware. Ryuk tends to enter systems through spam emails or rogue attachments and is often ferried in by the Emotet malware.

Red Alert: Ryuk Ransomware TTPs

The basic playbook is known but still hard to stop; Ryuk typically uses obfuscated PowerShell scripts to connect to remote IP addresses and download a reverse shell. Ryuk executes anti-logging scripts to hide its tracks. Once Ryuk is installed, the malicious operator uses it to scan the internal network and identify vulnerable hosts on the network that have the right level of privileges to execute an attack. Ryuk then shuts down backup services and launches the ransomware attack. This can happen in a matter of weeks or over the course of an hour.

These are some initial technical details, and we set up the attack tactics and techniques in a specialized simulations platform, SafeBreach, to run the simulation against an infrastructure and application environment very close to what the client likely is running inside their own firewall.

A key part here is that using modern cloud-native technologies, like containers and Kubernetes, we can generate and tear down virtual environments on our cyber range more quickly and easily. This allows us to adapt the cyber range to more closely fit three major key client parameters: industry, geography and technology footprint.

All attack information and system responses are then piped into a SIEM, in this case, a special version of IBM’s QRadar, to aggregate and view the necessary log files and other telemetry that describes what happens as a Ryuk attack progresses.

Participants also have access to a full suite of security tools for in-depth investigation, like endpoint detection and response, forensics and more. To make this realistic, we can factor in Ryuk’s ability to hide its tracks in logs and limit indicators of compromise to those that would most likely appear in a real attack. This also can help the security operations teams learn how to spot a Ryuk attack as it would actually happen in the wild. All of this lives in an integrated system that can be visualized, monitored and analyzed after the attack to see what damage Ryuk might have caused.

During the attack simulation, the health care organization’s team can make adjustments to security controls and configurations in their simulated environment. Then, with SafeBreach modules, we continue to run attack scenarios to see the potential impact of the team’s changes. This provides a stronger rapid feedback loop to help security operations teams understand the impact of their actions — what works and what doesn’t — and learn how to apply the lessons in real time.

Keeping close tabs on new threats, we develop and run attack scenarios in a short period of time, and even focus the attacks on more recently released exploits that few security teams have experienced in the wild. This effectively compresses the learning curve compared with previous cyber range exercises and allows teams to iteratively learn more by being exposed to more in an environment that is like their daily workspace and tooling.

Making Cyber Range Attack Scenarios Relevant to Non-Technical Folks

To make our new cyber range realistic for non-technical participants, we include business risks and scenarios. For example, we like to include public relations and marketing teams so they can experience and react to the attacks. These teams are crucial in helping an impacted organization communicate the attack to the outside world and to clients, and in responding to negative press coverage that will likely kick off after the attack is revealed. We even simulate a positive or negative news cycle based upon participant actions or inactions. The marketing and PR leads need to communicate closely with the CISO team and the CEO to plot an appropriate communications strategy.

For legal teams, the attack kicks off a discussion of regulatory and legal compliance. This means planning out who to notify and how quickly and what to say in a notification. Many states and countries have strict notification regulations for breaches that compromise customer information. Health care organizations are governed by additional regulatory requirements, like HIPAA and managing personal health information (PHI).

The notification rules do not always mandate public disclosure, but enterprising reporters often check for mandated company breach notification filings with government agencies and use those as a source for story ideas. In addition, legal teams have to determine what is the best path forward to minimize legal risk. We include legal teams in this part of the exercise as well and have game rules where a breach triggers adverse effects, such as threats of a lawsuit from angry customers. Legal teams need to think through the ramifications of their actions and come up with a plan.

Since this is a health care organization, we would need to include chief medical officers and other professional leads to craft a breach response strategy that can be quickly communicated inside the organization.

The crucial part here is that the people working for the organization play a critical role in defending it. Something as simple as learning whether to immediately shut down your systems when a ransomware attack is underway can prevent mass infections in a matter of minutes. We witnessed IT security teams in the NotPetya attack frantically notifying everyone they could to turn off their laptops to avoid spreading the attack. A very good internal response strategy may have been in place but drilling the necessary steps and identifying potential barriers of bottlenecks in advance can yield an even better outcome.

There are other wrinkles to iron out for non-technical personnel, and their roles and responsibilities are dependent on the type of organization they work for. Creating these types of attack scenarios for non-technical players does not require a specialized technology stack: existing SOAR, email and collaboration tools are what teams would use in a crisis and are the best tools to use.

Better Virtual Cyber Range Experience, Better Technology, Better Outcomes

Learning to deal with chaos requires chaos.

The ability to quickly modify and run complex attack scenarios enables the type of directed chaos security engineering we discussed above.

Generating chaotic, unpredictable environments and scenarios requires variability and variance. That said, you need enough structure and rigor to allow teams to function and to learn from repeating simulations, improve their responses, and solidify systematic and practiced approaches to dealing with dangerous situations.

The cyber range of the future will deliver that but, above all, be fun and engaging. It will stretch the minds and response muscles of participants and their organizations when it comes to chaos engineering, helping prepare them for a world of cybersecurity that grows more chaotic by the month.

Learn more about IBM Security’s Command Centers here. If you wish to learn more about our overall SOAR capabilities, check out IBM X-Force Incident Response and Threat Intelligence Services.

Matthew Dobbs
Chief Integration Architect, IBM Security Cyber Ranges

Matt has been with IBM Security for over 10 years, coming over as part of the Internet Security System acquisition by in IBM in 2006. As the Chief Integratio...
read more