2010-06-06

Security Academia: Stop Using Worthless Data

I have a new litmus test that I use to help me vet the many intrusion detection related academic papers that come across my desk. I call it the "relevant data test." If your approach does not study relevant data, I will not read it. You may indeed have found a new way to leverage Hidden Markov Models in some neat heuristic, layered approach. I do not care. Novel or precise as your approach may be, the applicability of it is predicated upon the relevancy of your data. You may as well have found a new way to model the spotting of a banana as it ripens, if your data has nothing to do with intrusions in 2010.

It's time to wake up, folks. A 10-year-old data set for intrusion detection is utterly worthless, as your conclusions will be if you use it. I will never again read further than "benchmark KDD '99 intrusion data set." There is no faster way to communicate to an informed audience that you just don't understand intrusions than by analyzing data that is this old. Such attacks are generations behind those that modern network defenders face today. Understand this: you are solving the problems exemplified by your data set. If your data is 11 years old, so is your problem, and your solution is only as effective as that problem is relevant. Few, if any, attacks from 1999 are relevant today.

Make no mistake about it, I understand the researcher's lament! There is no modern pre-classified data set like those relics of careers gone by. Finding a good corpus is excruciatingly difficult. But in legitimate, scientific, empirical studies, this is absolutely no excuse for using irrelevant data. In fact, without first establishing the relevancy of ANY data set, even those used in the past, one's findings fall apart.

To pick but one example, in the last two issues of IEEE Transactions on Dependable and Secure Computing, two of the three IDS-related articles based their findings on data sets that are 7 or more years old. This is emblematic of why so much research is ignored by industry, and that which isn't often falls flat in practice. If I were an editor of that periodical, which I have been reading for quite some time, I would have rejected nearly every intrusion detection paper submitted in the last 3 years outright on this basis alone.

The data commonly considered the "gold standard" by academics has not been relevant for at least half a decade. Research done in that period whose findings relied on 2001 and prior data is not in any way conclusive, in my professional opinion.