For a field that loves statistics, computer security sure treats them casually. In order to get my humble BA in Psychology, I absorbed my share of course hours in statistics and testing methods, including a set of lectures based upon Darrell Huff's brilliant book, "How to Lie with Statistics" - which I highly recommend. It's fun easy reading satire - those lectures had the effect of making me hyper-skeptical about any large, round, number that's thrown my way. Sometimes, I get the urge to play and this is one of those times. Please don't take anything from this point forward very seriously, OK? I'm going to cheerfully lie and throw bogus, useless figures at you, but (unlike the throwers of most of the bogus figures you see) I've given you the courtesy of warning you, first.
Verizon recently published their annual Data Breach Investigation Report (henceforth: DBIR) and one of the charts, Figure 13, that really jumped out at me was the money shot:
Well, that's pretty clear. And, it's a statistic. So, I'm convinced!
Immediately my hyper-skeptical subconscious started nagging me, wondering about sampling bias, and nodding in admiration at the sciency-sounding vagueness of "external actors", and so forth. For example, I wondered how on earth one could accurately categorize what an attack is, given that an attack might consist of anything between grandma's home computer getting amoeba'd into a botnet or a massive trojan-based corporate penetration with data exfiltration and horizontal spread. Then I started wondering whether, perhaps, Verizon gets a disproportionate view of the landscape because of who they are, and I decided to simply ignore the chart and turn the page.
Obviously, the case is being made that China is a problem. I'm not arguing with that... In fact, I think it probably is, but I don't think that the computer security community is close to understanding what's going on or why, and simplistic statistics don't serve us very well. I found myself wishing that it were possible to make a really good stab at the problem, but realized there's no way I could do that in the remaining time allotted for me to walk the Earth. So I decided to experiment with a few bad statistics of my own, instead.
The Verizon statistic allegedly tells us something about overall hacking by volume (percentage of hacks) and type (target: espionage, financial, other) and encourages us to conclude that China is the problem. Since China represents 1/4 of the world's population, does it have 1/4 of the world's hackers? Of course we don't have any data that would help us conclude that hacking is normally distributed within a population, but what if we did?
I read a figure a couple years ago that China's Internet-using population (i.e., total headcount) surpassed the US Internet-using population. Let's hypothesize that, for the sake of simplicity, the US and China are approximately neck and neck, which would mean that the US and China together represent substantially more than half of the world's Internet users, probably some 70%. Well, that would tell us that proportionally, the Chinese are about as badly-behaved as the rest of us, but looking at the Verizon chart - the Romanians are exceedingly naughty.
I didn't have a credible way of reasonably estimating population of Internet users by country, so I just normalized against global population and used Verizon's statistics on hacking as a percentage. China's 30% of the hacking was caused by 25% of the world's population and suddenly China's hacking rate seems pretty reasonable. I'll note that Americans appear to be behaving much better than I'd expect, but perhaps that's a result of bias in Verizon's sampling methods. We really need to fear the Romanians, Bulgarians and Armenians.
Of course it's not fair to assume that a farmer in Northern China is a hacker, but this is a posting about making unfair assumptions so, why not? Lacking data to clarify our assumptions, my statistical lie is as good as anyone else's - what it represents is a measure of the hacking energy of populations. That sounds mighty sciency. However, it might be possible to filter out the farmer in Northern China by using some other measure of a society's overall Internet-connectedness. How about IPv4 addresses? That will give us a statistic that might be a measure of how much hacking a country does (according to Verizon) per IP address: how relatively hack-efficient they are in terms of their infrastructure's footprint.
Scaling the Verizon percentages by IP address allocation per country, we see that the typical US IP address is vastly less hacky than even a Dutch IP address. The Chinese are pretty feisty, but the Armenians, IP address for IP address, are the hackiest hackers on Earth! Wow! I admit that I look at that chart and wonder what significant percentage of Armenia's 500,000 or so IP addresses are being used by hackers, and I am in awe. Do Armenia's hackers ever sleep?
I'm joking, remember?
Now, to be serious, perhaps this serves to illustrate a bit about how easy it is to flip bogus statistics in people's faces and make them look credible even if the underlying data and assumptions are extremely sketchy. This is not the kind of data that public policy decisions should be based on. If we wanted to do this right, we'd need a good definition of what constituted an attack, and that's a huge problem in and of itself.
If my network is compromised with a trojan is that one unit of attack (object: my network) or ten units (object: my home servers, desktop, laptop, and the machine I play World of Warcraft on) or two units (object: my log server and my file server)? And, if somone was attempting to steal intellectual property such as my recipe for salted mint lemonade (link: http://www.ranum.com/fun/recipes/mint_lemonade.html ), how do I know if that was one attempt at espionage (object: the recipe) or hundreds (object: the encrypted volume where I keep private reports I've written for customers over the years) or tens of thousands (object: my Sisters of Mercy MP3s)? When you're looking at a well-crafted statistic, if the resolution of the data at the top level chart appears to be very broad, it's probably been massaged to a fare-thee-well. Or it was just made up out of thin air.
Another thing to pay attention to is the Y-axis. You can manipulate the data by deciding what goes in which bucket, thereby controlling which bucket is bigger. In the Verizon chart at the top, does "other" include home users whose systems were part of a botnet? What if that botnet was used for an extortion attack? Is the unit of measure the number of targets in the attack, the number of attackers in the botnet, or the number of times the extortionist asked for money? It gets complicated, quickly, and that's one of the reasons why computer security statistics have traditionally been, to put it generously, absolutely horrible.
Back when I was on the SANS Newsbites editorial board, every year when someone would publish the results of a survey ("hacking is on the rise!") I'd point out the problem of self-selected sample (link: http://en.wikipedia.org/wiki/Self-selection_bias ) bias in surveys. I'd suggest that the title should be something like: "9/10 of people who were bored enough to fill in a survey at a conference, and who claimed their job title was CSO, CTO, or CEO checked the box that says 'hacking is on the rise'." A common response I'd get to such sallies was "hey, even questionable statistics are better than nothing." Except that they're not, because if you don't know what you're measuring, you don't know that the results have anything to do with what you wish you were measuring. I was kidding with my statistics above, because I don't think "hacking activity as a percentage of Verizon's global measure of hacking activity per percentage of IPv4 address global footprint" means anything. Anything other than that: we need to be really, really worried about the Armenians.
I think it's only fair to Verizon to point out that they've been appropriately cautious with the statistic. For example, one Verizon spokesperson told ZDNet that the high number of data breaches attributed to China should not mean it is the most active perpetrator of cyber espionage activities. And, the high number could be because Internet regulations in the country are not as strict as other countries, and it may be easier for criminals to conduct their hacking activities from there. Further, "We are not going with the 'China is bad and scary' message." Oh, really?