
Conference Keynote Speaker
Tuesday, January 8, 2008
12:45pm, Monarchy BallroomHilton Waikoloa Village
Old
ideas from classical statistics, as well new ideas from machine learning and
data mining can be used to extract useful information from data archives.
Abstract:
From
Gauss to Google: Data Analysis in the Digital Age (ppt)
Dr. Padhraic Smyth
is a Professor in the Department of Computer Science at the University of
California, Irvine. He is also Director of the Center for Machine Learning and
Intelligent Systems, and a member of the Institute for Mathematical Behavioral
Sciences and the Institute for Genomics and Bioinformatics (all at UC Irvine).
Dr. Smyth's research interests include machine learning, data mining,
statistical pattern recognition, and applied statistics. He has published over
100 papers in these areas.
He was a recipient of best paper awards at the 2002 and 1997 ACM SIGKDD Conferences, an IBM Faculty Partnership Award in 2001, and a National Science Foundation Faculty CAREER award in 1997. He is co-author of Modeling the Internet and the Web: Probabilistic Methods and Algorithms (with Pierre Baldi and Paolo Frasconi), published by Wiley in 2003. He is also co-author of a graduate text in data mining, Principles of Data Mining, MIT Press, August 2001, with David Hand and Heikki Mannila. He has served as an associate editor for the Journal of the American Statistical Association, the IEEE Transactions on Knowledge and Data Engineering, and the Machine Learning Journal. He is a founding editor for the Journal of Data Mining and Knowledge Discovery, and a founding editorial board member of the Journal of Machine Learning Research and the journal Bayesian Analysis. He received a first class honors degree in Electronic Engineering from University College Galway (National University of Ireland) in 1984, and the MSEE and PhD degrees from the Electrical Engineering Department at the California Institute of Technology in 1985 and 1988 respectively. From 1988 to 1996 he conducted research at NASA's Jet Propulsion Laboratory and has been on the faculty at UC Irvine since 1996.
Dr. Smyth also has extensive experience in working in the private sector in various capacities. He is a founding partner in TopicSeek LLC (www.topicseek.com), a startup company in Irvine that specializes in the application of statistical text mining algorithms to large text data sets. He has consulted extensively on development of novel data mining and statistical techniques for analysis of large data sets with companies such as Nokia, Yahoo!, Microsoft, Glaxo SmithKline and AT&T. A number of his algorithms have been commercially developed - for example, Dr. Smyth was the inventor of the sequence-mining algorithm in the 2005 release of Microsoft's SQL-Server product.
From Gauss to Google: Data Analysis in the Digital Age
The growth
of the Web is producing unprecedented volumes of data in digital form, including
massive databases of search queries, vast logs of user navigation patterns, huge
online text archives, and more.
This talk will discuss how old ideas from classical statistics, as well new
ideas from machine learning and data mining, can be used to extract useful
information from such data archives. The talk will begin with an overview of
general concepts in Web data analysis, illustrated via a number of real-world
examples. The remainder of the presentation will focus on recent advances in the
area of statistical text mining, illustrating how useful information can be
automatically extracted from large text data archives.
A number of real-world data sets will be used as examples to illustrate the
underlying ideas, including archives of New York Times articles, historical
records of the Pennsylvania Gazette from the 18th century, large databases of
scientific publications such as PubMed and CiteSeer, and emails
from Enron obtained by the US Department of Justice.