Reducing malware analysis with code comparison techniques

This is another topic that I file under "someone must have certainly done this already"...

We're struggling with the influx of custom malware that has exploded since 2006. The skills necessary to reverse engineer code are hard to find, and expensive when they surface. As a result, bandwidth is always limited for an organization faced with the need to understand the inner-workings of malware to assess damage, scope, and impact of a system compromised by custom code.

There have been a few discussions within my team recently about how these valuable skills can be focused. For years we've worked to reduce the set of malware that necessitates deep analysis by identifying techniques that enable us to make inferences about the unknown code by comparing it to similar known code, or making assumptions based on its context. Discussion has heated up on this topic of late, especially since a colleague began using an intriguing, if unproven, statistical technique to group malware.

The first question that should come to the reader's mind is, "haven't the anti-virus companies already solved this problem?" They should have. But we've seen first-hand that if they know how to solve this problem, it is either ineffectively implemented or not implemented at all in their code. I could tell stories, but that's not the point of this entry.

The technique that keeps coming to my mind as promising is an analysis of code which represents its flow control as a graph, and then searches for isomorphisms in other code flow graphs to identify identical or similar executables. Identifying complete isomorphisms between graphs is a well-studied problem. For one such example, this paper discusses its utility with VLSI hardware, comparing circuit diagrams to chip layout. It stands to reason that a similar technique could be used with what I'll call the identical software flow problem.

Those with an interest in computational complexity theory would find the following both relevant and intriguing: the graph isomorphism problem has not been proven to be NP-complete, nor is it known to be solvable in polynomial time, meaning it is only NP. Special thanks to Wikipedia for this link (huge PDF), which discusses solving the graph isomorphism problem efficiently despite being NP.

The problem of identifying similar pieces of code, which I'll call the software flow similarity problem, is much more involved and from what I can tell much less studied. In this case, flow control graph subsets would be compared between pieces of code. Some key questions here are:
  1. How big or complex must the subset be, as compared to the complete flow graph, to be meaningful?
  2. How many matches of graph subsets must be identified to confidently call code segments similar?
This is but one technique, and determining software similarity is likely to involve a number of other techniques - computed, observed, statistical, or what have you. I feel this approach would be a very strong indicator on its own, although it would be far more difficult to implement and study than some other heuristic approaches. I'm going to continue searching for papers which discuss these techniques; it seems hard to believe no one has done this before.

No comments: