HICSS-45 DISTINGUISHED LECTURE
Friday, January 6
12:45 Haleakala Ballroom
Grand Wailea, Maui

Apache Hadoop and the Big Data Revolution


Dr. Doug Cutting, Cloudera Architect
Co-founder Apache Hadoop

Doug Cutting is an advocate and creator of open-source search technology. He originated Lucene and Nutch, the latter with Mike Cafarella, both open-source search technology projects that are now managed through the Apache Software Foundation, of which he is currently chairman. He also is Architect of Cloudera.

Mr Cutting holds a bachelor's degree from Stanford University. Prior to developing Lucene, he held search technology positions at Excite, Apple Inc. and Xerox PARC. Lucene, a search indexer, and Nutch, a web crawler, are two key components of an open-source search platform, which first crawls the Web for content, and then structures it into a searchable index.

Cutting's contributions to these two projects extended the concepts and capabilities of general open-source software projects such as Linux and MySQL into the important vertical domain of search. While it is difficult to track the total number of installations of these platforms, public announcements of the use of Lucene and its direct descendant Soir by various venture-backed startups indicate a significant level of adoption. Perhaps the most significant deployment of Lucene is Wikipedia, where it powers search for the entire site.

In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large-scale computations to be trivially parallelized across large clusters of servers. Cutting realized the importance of this paper to extending Lucene into the realm of extremely large (web-scale) search problems, and created the open-source Hadoop framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware. Cutting was an employee of Yahoo! when he started the Hadoop project.

Abstract

For decades, as computing hardware has improved, the affordable amount of data processing and storage has increased exponentially. Software that takes advantage of this increase has lagged, creating a large gap between the raw capacity that folks could afford, and what they can actually use. In late 2003, Google began describing its approach to fill this gap. Soon after, a colleague and I began commoditizing Google's approach through an open-source implementation called Apache Hadoop. Now a data revolution is in full bloom as institutions are able to save and productively use vastly larger amounts of data.