Mention the term "Big Data," and most people's eyes glaze over. One can imagine from their vacant stares that they have transported to "The Matrix," and a series of ones and zeros have begun to scroll by in an infinite number of rows and columns.
If you think about data as it is laid out on a spreadsheet, then Big Data is not just about lots of rows. It's not just about having more data; it's about knowing enough about each data point so we can understand what actually matters. In fact, all the excitement is actually about the columns. It's a subtle point--but the reason machine learning is making an impact here is because it is able to look across thousands of columns, or to use machine learning lingo – "features" – and identify which ones matter or have predictive power.
In "Moneyball," for example, this involved learning that the feature "on base percentage" mattered disproportionately. To know this, "on base percentage" had to first be something that was measured, so statisticians could realize that it mattered. Similarly, the earliest work on facial recognition "eigenfaces" started with a dataset of 300 "features" and "learned" that only 12 features were needed to reconstruct a face recognizable by a person.
Collecting data on as many features as possible is crucial to broadening the scope of potential inputs and independent variables driving results. Organizations often collect data, but they overly emphasize the collection of rows rather than columns. More data is not big data. They collect occurrences of data rather than potential causal inputs. They aren't testing hypotheses; they're collecting data exhaust.
The web is our largest repository of digital information, making it one of the best places to look for more features on almost anything. There are fundamentally two ways to extract data from the web and add it to your "spreadsheet". One is manual--this scales up by building an team of data entry professionals, often in India or the Philippines to visit websites and type relevant data into a database. The other method involves hiring programmers to write web scrapers and web crawlers (or in a tiny amount of cases, use APIs where they exist) to automatically get the data. Neither method works particularly well, and both have limits on the scale of data collection that is feasible.
Speaking to Pratap Ranade, CEO of Kimono Labs, the problem is this binary approach of either machine automation or more people. Ranade believes in Cyborgs. Not implants, but in creating systems and interfaces that let humans and computers each do what they are best at, and achieve a goal together, just as in a symbiotic biological system. Palantir Director of Engineering, Shyam Sankar, makes this point in his TED talk, pointing to the strength of systems leveraging composite human and computing power. He makes the point that in fact a weak human working with an average machine can beat a superior machine alone.
Kimono Labs is a platform that puts data science into the hands of individuals. The premise is that non-programmers are adept at looking at a website and recognizing what data matters to them, what it is, and where it is. They can identify, point, click and label relevant data. Based on their clicks, kimono learns rules to automatically identify the data marked by the person, and it then builds a "robot" to return to the website and pull the desired content on a schedule. What's counterintuitive is that it is the highly paid programmer who's the commoditized skill, and it's the hypothesis-wielding non-programmer who can accurately identify classes of information and use kimono at scale to build a uniquely valuable set of structured data feeds.
As the problem of retrieving data is a perennial challenge for large companies, particularly in data hungry fields like finance and marketing, many have begun turning to tools like Kimono. It provides them with a competitive advantage, allowing companies to leverage the domain knowledge of non-technical employees, framing questions and data in the right way, and automating the process of data collection with Kimono.
In example cases with Fortune 500 companies and top consulting firms, Kimono has been able to reduce costs needed to obtain datasets by 100x, making data that was previously inaccessible to even the most sophisticated players, possible. Big data is coming, and as tools like Kimono infiltrate Fortune 500 skyscrapers, the nature of work will continue to change, making non-technical domain experts even more valuable than their technical counterparts, as technological solutions will begin to make the commoditized skill the tech itself.