Login or signup
36
TECHNOLOGY

Data Analysis Overload?

How to use Hadoop to analyze big chunks of data

Ken Orvidas

Advertisement

It's easier than ever for companies to collect all kinds of data about their customers. Often, the hard part is figuring out how to analyze it all. What starts as a useful database of customer information can become a slow or unresponsive monster when it grows to more than a terabyte -- or about 1,000 gigabytes -- of data. In some cases, a database can be so enormous that no single computer is capable of processing the information.

That was an issue for ImageShack, a photo and video hosting site in Los Gatos, California. ImageShack serves up images and other content billions of times each day. It records information about its visitors, such as their locations and which websites led them to ImageShack, and then uses that data to deliver other relevant images, which keeps users on the site longer, clicking links and generating more ad revenue. Making those calculations isn't simple. "Imagine writing a database that gets refreshed and processes three billion records every day," says Jack Levin, ImageShack's founder and CEO.

To analyze all this data, ImageShack decided to tap the same technology developed by search-engine companies to index the Web: Hadoop, a program designed to process massive amounts of data. Inspired by Google's MapReduce technology, Hadoop is an open-source project developed primarily by engineers at Yahoo. Hadoop takes massive amounts of data, breaks it into smaller chunks, and distributes the pieces across a cluster of computers. In other words, instead of using one computer to analyze data, Hadoop lets you spread the task over several machines, with each one analyzing a portion of the information. Generally, the more computers in the Hadoop cluster, the faster it works.

The professional networking site LinkedIn found it could use Hadoop to speed up an important feature by a factor of 10. LinkedIn now employs Hadoop for its "people you may know" feature, which uses complex formulas to suggest possible acquaintances who aren't yet in users' networks. Hadoop's efficiency allowed LinkedIn to use a more sophisticated algorithm that required more computing power but improved results, says engineer Jay Kreps. "We saw a dramatic increase in the number of LinkedIn connections," he says.

Hadoop's usefulness isn't limited to Internet companies. Cloudera, a software start-up, aims to bring Hadoop's processing power to a variety of industries. Cloudera's CEO, Mike Olson, says even old-line businesses can amass terabytes of useful data. For instance, some retailers ask for customers' phone numbers at the register as a way of identifying them and tracking their purchases. Using these customer logs, stores could, say, look for shoppers who bought diapers six years ago and target them with back-to-school promotions.

Setting up Hadoop does require technical prowess. LinkedIn and ImageShack were able to turn to their own engineers, but other businesses may find they need to hire a programmer or consultant familiar with Hadoop. Some businesses may also need more hardware. Cloudera's rule of thumb is at least one server per terabyte of data, but the required computing power varies depending on the complexity of the analysis. ImageShack ended up buying 10 top-of-the-line servers for its Hadoop cluster. Companies can avoid buying equipment by using cloud computing services, like Amazon EC2. Amazon offers an additional service, Amazon Elastic MapReduce, specifically to run Hadoop using its cloud computing services.

Whether you run Hadoop on your servers or in the cloud, it can be a powerful tool. One of Cloudera's clients, Rackspace, an Internet hosting provider based in San Antonio, uses Hadoop and its own servers to run a technical support application that simplifies customer service. The company's servers collect information about every e-mail that passes through its system, such as who sent the message and where it's going. Hadoop crunches and indexes that data, so that when a customer calls to report an e-mail that didn't go through, the representative answering the phone can quickly pull up the details and pinpoint the problem. "We like Hadoop," says Olson, "because we think it's in position to solve a problem that's just about to happen."

For more information about cloud computing services as well as tips on securely managing customer information, visit www.inc.com/keyword/oct09.

Last updated: Oct 1, 2009




Register on Inc.com today to get full access to:
All articles  |  Magazine archives | Comment and share features
EMAIL
PASSWORD
EMAIL
FIRST NAME
LAST NAME
EMAIL
PASSWORD

Or sign up using: