Machine learning algorithms are essential for finding patterns in data. Data mining and predictive analytics can identify patterns in extremely high-dimensional data — using millions and millions of data points for millions and millions of users — that no human could possibly detect. However, machine learning algorithms still struggle with visual patterns. Object recognition is a hot area of research for everything from robotics to autonomous cars. Hongxing Wang, Gangqiang Zhao and Junsong Yuan have written an excellent paper, “Visual Pattern Discovery in Image and Video Data: A Brief Survey,” that surveys the current techniques and the challenges of the task.
This is where humans come into play. The human brain is amazingly good at finding visual patterns. If data can be visualized, a human can often discover the correlations that machine object recognition cannot. While it is true that humans can sometimes discover false positives or patterns that are actually meaningless, they are still better at visual pattern parsing than any machine learning algorithm.
On March 24, 2015, at the Digital Forensics Research Workshop for Europe (DFRWS EU), researchers presented a fascinating paper, “Hviz: HTTP(S) Traffic Aggregation and Visualization for Network Forensics,” which went on to win paper of the year honors. In it, they discuss Hviz, a new visualization approach to HTTP and HTTPS traffic that has the potential to simplify how cybersecurity experts investigate malicious events.
Hviz allows investigators to visualize the metadata of HTTP and HTTPS traffic and classify it as originating from a user or a process, such as malware. It helps to visualize the event timeline of Web traffic and allows a forensic investigator to “see” the relationships of Web requests in a hierarchical manner. This allows investigators to identify patterns that they might never have found looking through the raw data.
The authors investigated trace data from a college campus and found that the average Web request resulted in 110 follow-up requests from each individual page. More disturbing is the observation that those Web pages, on average, came from 20 different domains. This is a daunting concept for an investigator, and without some form of aggregation and filtering, it is nearly impossible to summarize and analyze.
How Does Hviz Work?
To get the full description, including specifics such as building the request graphs and the preprocessing and visualization steps, I urge you to read the paper. However, to summarize, Hviz looks for referrer and host headers to paths through the data. For example, a request to a website from a search engine sends that page’s server a referrer header telling it that the link came from that engine. The referrer headers form a stream of HTTP or HTTPS traffic, originating at a head request generated by a user or process. This allows a path to be drawn from each node of the request graph to the next, forming a chain of links that the user followed from page to page. Note that in its current incarnation, these are the only headers examined.
It also looks for traffic that occurs by multiple workstations and allows the user to fade them out so they do not “clutter” the visualization. This would generally allow malware to be spotted unless it happens to be a worm. If the malware is a worm, then potentially the same or similar traffic would originate from each workstation. Hviz would consider this as “popular” and would mark them as such, therefore fading them out of the display by default. The investigator can elect to expand these, however, to get a visualization of the sites that multiple workstations visited.
Understanding the Specifics of Data Collection
The visualization process uses several strategies to manage data and eliminate excess information. For example, Hviz collapses domains. That means a.example.com and b.example.com are reduced into one visual node. This is based on the intuition that pages from the same domain are probably related to each other and may be an embedded object, so they probably originated from one request.
Hviz takes advantage of heuristic analysis when it comes to third-party requests. Requests to those 20 or so domains occur on multiple pages of the primary website and can be collapsed, as well. This is because they most likely all originated from a single user’s request.
Uploads are noted as special since they are rare in normal traffic. If Hviz sees an upload, it is flagged as unique. These uploads may mark malware attempting to exfiltrate data.
Moving Beyond Referrers to Graphing
The referrer links are only part of the process. For example, Hviz must be able to identify head nodes, or those nodes requested by a client, from embedded requests. This is critical in order to collapse the visualization. To do this, the authors use a heuristic approach outlined in the paper “ReSurf: Reconstructing Web-Surfing Activity From Network Traffic.”
See the paper for a more complete description of the ReSurf heuristic used by Hviz. Essentially, it looks for how many incoming edges of a referrer graph are connected to a node versus outgoing edges. Head requests are those that generally come with no referrer, or perhaps one referrer that is itself a head request. A “referrer graph” is a directed graph generated from Web requests from a client IP address. ReSurf then uses the referrer and host headers to construct a directed graph of the IP’s Web requests and capture timing relationships between the requests. In this case, the heuristic is that closer timing indicates an embedded request.
Section 3.1 of the ReSurf paper goes into a thorough description of the rules and heuristics used to identify head requests and separate out embedded requests from the head page. It uses several thresholds that can be tuned to make the referrer graph more accurate. Hviz uses these methods to identify head requests.
A directed graph is a graph in which the edges are singly linked. That is, there is an edge from node A to node B, but not one from node B to node A unless that edge from B to A represents a forward path in the graph. A standard directed graph may rejoin itself, but in Hviz’s implementation, there is no reciprocation. Hviz also has no notion of weighted edges, which many directed graphs utilize.
Much of the processing in Hviz is done by the construction of a “request graph.” This graph is also a directed graph and is similar to a referrer graph. It is designed to visually demonstrate the relationships between head nodes and those nodes triggered by head nodes, such as embedded nodes or pages referred to by that top-level domain. This allows the investigator to move from head node to head node and get a graphical understanding of how a user’s Web traffic proceeded chronologically.
For a picture of what the typical interface looks like, see Figure II of the original Hviz paper, or go to an interactive interface demonstration.
There are a few problems with this approach as it applies to cybersecurity investigations. Malware could start using a popular website for communications so there is a valid referrer header, for example. In this scenario, investigators would see a potentially larger-than-normal number of referrals from a popular site. One solution to this problem might be to add the notion of weighted data to the graph. This way, one could collapse referrer edges from one workstation from page to page and show them as a different color based on their weight. Weights increase the more that particular edge is created, so it would be easy to see which chain of sites is most often visited from each workstation. This should visually reveal malware that is forging referrer headers.
If the graph included the notion of edges connecting back to other points, as many directed graphs do, the visualization would include users who “double back” on themselves — for example, a user who performs a search, clicks a link, then goes back to the search page and clicks the link again. This would also help to visualize malware. Many malware samples will forge several referrer headers and visit a few sites for command-and-control and exfiltration.
The Challenges of Hviz
One could track head nodes used as referrers that never load the associated embedded objects of that head node. For example, a search even from the address bar loads the search results page, which comes with either no referrers or very few. This would identify the search page as a head node, as described in the ReSurf heuristic. The search results page is loaded along with its embedded objects. If there are head node referrers without the loading of embedded objects, this could be called out as a node that does not collapse.
There might need to be some extra data checking to clear up the visualization. If there’s a worm present, for instance, it will make requests from multiple computers, and it will be faded out due to popularity unless the investigator specifically wants to see those requests. However, if the embedded objects are not loaded from an identified head node, this will allow the investigator to identify those nodes in the visualization, which needs to be expanded for the identification of worms.
Filling a Need
Real-world investigative processes influence how researchers can apply core forensics tools in the world of cybersecurity. Whether these resources include timelines, categorization of artifacts, visualization of entity relationships or similar strategies, they can assist investigators in organizing data in a way that streamlines forensics and helps them locate malicious activity.
The sheer volume of HTTP(S) traffic makes it incredibly difficult for an investigator to find malware communication. There is a need for a tool that helps the investigator use the human brain’s remarkable ability to find visual patterns to sort through normal versus malicious traffic. Hviz fills this exact need. It has the potential to be some of the most fascinating network forensics to come along in a good while.