Tuesday, September 25, 2012

Cyberspace Polluting the Real World

The New York Times had an impressive article last week, about data centers of Cyberspace giants polluting the air we real people breathe. Whatever you do on the Cyberspace, some of the actions require computer processing. These are done remotely in data centers, where huge (though nothing to compare to ENIAC) servers are run throughout the day to take your orders. Data centers are huge buildings packed with servers, which means 1) they require a lot of energy, and 2) they emit a lot of energy, in the form of heat. You might remember news about Google or Facebook worrying about cooling systems in their new data centers.

For energy use, it is not just about how prodigal these data centers are in terms of energy consumption. They do eat up a whole lot of energy, for sure. But they are doubly polluting because they all have backup electricity generator, mostly run on diesel. The following picture shows a backup diesel generator at a large computer data center. (Picture at the courtesy of Richard Perry, New York Times, copied from the news article.) According to the annotation on the picture, there are six generators in total in the data center, enough to power 7,000 homes. And yes, they are polluting. Amazon was once fined by the environmental authority of North Virginia because they were running diesel generators without permits.




In addition to generators, large data centers also hoard on flywheels and lead-acid batteries for additional backup. Indeed a super-conservative insurance measure. Why do they do this, then? Because data center operators lose jobs if there’s any failure in the system. These people are not paid for energy efficiency. They are paid to keep the system running 24/7.

The amount of electricity used in data centers is also massive. In 2010, data centers used about 2% of all the electricity used in the US, or 76 billion kilowatt-hours. The whole paper industry used 67 billion kilowatt-hours that year. Doing things in the Cyberspace does not save energy as we expect.

According to a research done by a consulting company, the utilization rate of facilities in data centers is around 6 to 12 percent. Up to 94% of computing resources at data centers are idle, in other words, in a standby mode to be prepared for a surge in usage.

Pooling risks reduces risks. Suppose a data center have a peak time utilization rate of 90, but most of the day runs at 10. Since utilization rate of 100 is a nonsense, this data center keeps its facility level at 100, giving a safety margin of 10. Suppose another data center with exactly the same specs. If we can combine the two data centers into one, we do not need resources of 200, because the peak time will be most probably different for the two data centers. We might end up needing only, say, 150. The actual necessary resources depend on the covariance of the facility utilization rates of the two facilities. This is the basic concept of risk pooling.

So, the answer can be found in risk pooling. According to the article, there are 2,094 data centers in the US in 2010. Using a simple binomial probability function, I could get interesting numbers as follows:

1) Assuming simple binomial probability distribution (and normal distribution), a data center that has average utilization rate of 12% can operate on a probability of failure of 1/10^50, with only 19.3% of its current resources. This shows the hyper-conservatism of the data center industry. They are massing up more than 5 times necessary resources to prevent any blackout. Perhaps you think 10^50 processing orders (such as clicks of a mouse) can happen in your lifetime? No way. It takes 1.06*10^33 years if 3 billion people clicks at the speed of once per second 24/7 nonstop.

2) Pooling two such data centers will require only 34.2% of one data center’s resources, not 19.3% * 2 = 38.6%. Pooling ten such data centers will require the resources of only 1.43 data centers, not 1.93. Huge savings on resources keeping the same probability of failure.

Seeing the calculation makes the pooling solution so attractive. Then why don’t they do it? Because the data they are dealing with are sensitive data such as personal information. There are also business reasons, political and legal complications, and technical feasibility to some extent. So what? Cyberspace polluting the real world is a real phenomenon. We must start addressing the issue first by realizing it. Before people realized the significance of it, climate change was a kind of urban legend to many people.