What's Next: Data Disasters
When it comes to the Internet, nothing is ever really forgotten and everything leaves a trail. This can be good or bad for business, depending on where you stand in relation to the law. These data trails can be used to find who has been stealing your trade secrets--or to bust you if you are the thief. They can show who is working and who is goofing off. They can tell you a heck of a lot about who your online customers are, allowing you to make better decisions and more money. This information is extraordinarily valuable, and there are laws that require companies to produce it, and do it right now. But it hasn't been easy to do until a San Francisco start-up called Addamark Technologies figured it out.
In the pre-Enron, pre-WorldCom, pre-Tyco, pre-you-name-the-crooked-company days, the legal rules for retaining communication records said only that a company had to be consistent. You couldn't, for example, keep all e-mails except those having to do with a hostile takeover or a case under litigation. If it was your company's policy to erase all old e-mails once a year or once a month, that was okay, as long as the policy was in writing and was strictly followed. Enron, for example, wiped clean its e-mail slate every 72 hours, which is hardly a surprise. Today the rules have changed. Public and many private companies have to keep a copy of written communication of every type (letters, e-mails, even Internet instant messages) for up to seven years. You have to keep the copies in a form that allows their authenticity to be verified, whatever that means. Not only that, but you must keep a second copy of every message in a different location in case of fire or natural disaster. The second copies must be on nonerasable storage media, such as optical disks. And if the SEC asks you to provide a copy of any given document or every given document you have until close of business today to do it. Almost no company can do this.
If you are a health care organization, an insurance company, or even a human resources department, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) requires as of this year that if a client asks you for a list of every person or organization with whom you have shared his or her medical records you have to provide that list...on the spot. Almost no organization can do this.
And if you aren't a public company, don't engage in health care, or have no human resources department, you still aren't off the hook, because these are becoming the accepted standards for all companies. If you still dump e-mail every 72 hours and end up in court, you are effectively guilty as charged.
Penalties for noncompliance right now are mild, but they are sure to get stronger in the future, right up to sending people to jail. The new SEC regulations, for example, hold the CEO personally responsible for record retention, meaning he or she, not some nerd in the computer room, will be doing time. Then there are the civil penalties that will come from the inevitable lawsuits. It is possible that every customer of a hospital or clinic could walk into small claims court tomorrow and walk out with $1,000 or more because the paper trail of who got their records couldn't be produced or was incomplete. Every hospital and clinic in America is vulnerable, for they are all in violation. And while HIPAA doesn't specifically provide for private legal actions, neither does it prohibit them if other laws are being violated too. So we're in a whole lot of legal trouble and most companies don't have the technology to comply with laws already on the books, much less the even stricter ones likely to follow.
It could have been argued that these legal requirements are unreasonable, but then along came Addamark.
And then there is data theft. Electronic documents are stolen all the time, and it usually isn't through some high-tech cracking scheme but an inside job. The bad guy is often a disgruntled employee, or someone who appears to be an employee but is really a competitor using an employee's login name and password obtained through a process called "social engineering." "This is Mitch in IT; we're working on the network and need your login and password to check something out." Only Mitch is calling from your top competitor. This really happens. There is an evidence trail of all this in your phone system and on your servers, if only it could be found.
The problem here isn't generating the information, which is done automatically by every e-mail, database, or Web server application. The problem isn't storage, because data storage is cheap and always getting cheaper. The problem is finding what you need--a problem that until recently looked insurmountable. Log data, which is what we are talking about, is huge. Just the e-mail system for a large company can generate terabytes of log data per day (that's one thousand billion bytes) concerning who said what to whom and what path the message followed. That's for one day. The new SEC regulations say a company has to hold those records for approximately 2,000 days, and most companies are deciding just to keep them forever.
Finding what you want in this pile of data would seem to be an easy problem for computers to solve, given that they are so good at fetching and carrying. Servers generate log files indicating what happened to every file or message, log files go into a huge database, and you run queries against this database, right? Unfortunately these log files are bigger than any database ever. They are bigger than database designers ever expected files to be. They are almost too big to even function in a database application. That's because when data is inserted into an Oracle or IBM DB2 database application, the data gets bigger. It grows by about 30% as metadata (data describing the data) is added. The result is a pile of data petabytes in size (one thousand terabytes). That's not too much data to store but it's too much data to search. It could take days, weeks, months to find what you need.
Until very recently the only searchable logging databases of such size I had heard of were at Amazon.com, eBay, and Google--each developed privately over a period of years and costing, in the case of Amazon at least, hundreds of millions of dollars. Amazon.com says it has so far spent more than $900 million on computer technology for its business and continues to invest at a rate of $200 million per year, a lot of it going to massaging log data. Faced with spending $200 million to avoid a $25,000 fine from the SEC, most companies would pay the fine--except for that little part about the CEO going to jail.
It could have been argued that these legal requirements are unreasonable, even unenforceable. But then along came Addamark Technologies, which changed everything. Addamark makes the storage and searching of petabyte logging databases not simple but easy, and easy is what counts. What couldn't be done at all can now be done in seconds and for around 1% of what Amazon.com paid for the same capability.
Addamark began as an idea in the mind of Adam Sah, who was at that time head techie at Internet Pictures, or iPix, which owns the servers that hold all those pictures of goods for sale on eBay and throws them onto your screen. With an average of 16 million items for sale each day on eBay, most of them having one or more pictures, that's a lot of images. It is also a lot of surfing, since iPix had to transmit those pictures over and over again as required by 50 million potential bidders. Because iPix was paid every time a picture was transmitted, its log files were essentially its billing system and Sah wanted to find a way to generate a detailed bill every day.
Rather than just throw the log data into Oracle or DB2, Sah thought about log data and how it is different from other kinds of database entries. It doesn't change, for one thing, since logs are entirely retrospective and are supposed to tell the truth. Sah found that you can strip log data down to its barest form, then compress it at least 10-to-1 (something you can't do in a regular database), then actually search the compressed data for what you need.
The result is a new type of specialized database that can be of almost limitless size yet can be searched in seconds. Addamark can be filled with any kind of log data from any logging application, and if you want to see every e-mail that mentions Microsoft or which times and by whom a confidential document was transmitted, Addamark produces the goods almost instantly. All this and it runs not on mainframes or even big servers but on clusters of commodity PCs. Expanding your Addamark system can mean a trip to BestBuy.
Addamark is shipping today, to customers that include Agilent Technology, Blue Cross-Blue Shield of North Dakota, Lehman Brothers, and Yahoo. In a high-tech depression this is a company that turned away venture capitalists. It is a 30-person firm at which 12 of those 30 are former CEOs or founders. Addamark, with its patented technology, could be the next Oracle. Remember the name; you might need it.
Robert X. Cringely is a writer, broadcaster, and entrepreneur specializing in technology. Contact him at firstname.lastname@example.org.