Security Academia: Stop Using Worthless Data

I have a new litmus test that I use to help me vet the many intrusion detection related academic papers that come across my desk. I call it the "relevant data test." If your approach does not study relevant data, I will not read it. You may indeed have found a new way to leverage Hidden Markov Models in some neat heuristic, layered approach. I do not care. Novel or precise as your approach may be, the applicability of it is predicated upon the relevancy of your data. You may as well have found a new way to model the spotting of a banana as it ripens, if your data has nothing to do with intrusions in 2010.

It's time to wake up, folks. A 10-year-old data set for intrusion detection is utterly worthless, as your conclusions will be if you use it. I will never again read further than "benchmark KDD '99 intrusion data set." There is no faster way to communicate to an informed audience that you just don't understand intrusions than by analyzing data that is this old. Such attacks are generations behind those that modern network defenders face today. Understand this: you are solving the problems exemplified by your data set. If your data is 11 years old, so is your problem, and your solution is only as effective as that problem is relevant. Few, if any, attacks from 1999 are relevant today.

Make no mistake about it, I understand the researcher's lament! There is no modern pre-classified data set like those relics of careers gone by. Finding a good corpus is excruciatingly difficult. But in legitimate, scientific, empirical studies, this is absolutely no excuse for using irrelevant data. In fact, without first establishing the relevancy of ANY data set, even those used in the past, one's findings fall apart.

To pick but one example, in the last two issues of IEEE Transactions on Dependable and Secure Computing, two of the three IDS-related articles based their findings on data sets that are 7 or more years old. This is emblematic of why so much research is ignored by industry, and that which isn't often falls flat in practice. If I were an editor of that periodical, which I have been reading for quite some time, I would have rejected nearly every intrusion detection paper submitted in the last 3 years outright on this basis alone.

The data commonly considered the "gold standard" by academics has not been relevant for at least half a decade. Research done in that period whose findings relied on 2001 and prior data is not in any way conclusive, in my professional opinion.


Jason Trost said...

Totally agree... Do you have any practical advice for research who want to acquire realistic data sets for their IDS prototypes?

Michael Cloppert said...

Thanks for the question, I should have covered this in my post.

If you're having trouble collecting a good corpus of data through analysis of your university's perimeter, or are looking to detect attacks typically directed at victims other than in an academic setting, find a partner in industry. Many large companies are desperate to find new methods for detecting sophisticated attacks and will be willing to work with you to get access to data under an NDA or in an obfuscated manner. Your university or professor(s) should have contacts to help grease the skids here.

rma said...

Yet another paper published in 2011 using 1999 DARPA data. Sigh.

Perhaps reviewers, not just the authors, are also at fault.