Information Extraction at Columbia
Computer Science Department
Columbia University


Project Summary

The text available on the Web and beyond embeds unprecedented volumes of valuable structured data, "hidden" in natural language. For example, a news article might discuss an outbreak of an infectious disease, reporting the name of the disease, the number of people affected, and the geographical regions involved. Traditional keyword search, the prevalent query paradigm for text, is often insufficiently expressive for complex information needs that require structured data embedded in text. For such needs, users (e.g., an epidemiologist compiling statistics, as reported in the media, on recent food-borne disease outbreaks in a remote country) are forced to embark in labor-intensive cycles of keyword-based document retrieval and manual document filtering, until they locate the appropriate (structured) information.

To move beyond traditional keyword search, this project exploits information extraction technology, which identifies structured data in text, to enable structured querying. To capture diverse user information needs and depart from a "one-size-fits-all" querying approach, which is inappropriate for this extraction-based scenario, this project explores a wealth of structured query paradigms: sometimes users (e.g., a high-school student in need of some quick examples and statistics for a report on recent salmonella outbreaks in developing countries) are after a few exploratory results, which should be returned fast; some other times, users (e.g., the above epidemiologist investigating food-borne diseases) are after comprehensive results, for which waiting a longer time is acceptable. The project develops specialized cost-based query optimizers for each query paradigm, accounting for the efficiency and, critically, the result quality of the query execution plans. The technology produced will assist a vast range of users and information needs, by enabling efficient, diverse interactions with text databases --for sophisticated searching and data mining-- that are cumbersome or impossible with today's technology.

The research and educational components of the project relies on --and encourages-- a tight integration of three complementary Computer Science disciplines, namely, natural language processing, information retrieval, and databases. The project also provides source code, for experimentation and evaluation, to the community at large over the Web, on the website at

Acknowledgments: This research is supported by the National Science Foundation under Grant IIS-0811038, as well as by two Yahoo! Faculty Research and Engagement Gifts. Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the National Science Foundation or of Yahoo!


At Columbia:

Current and Former External Collaborators:


Luis Gravano