The text available on the Web and beyond embeds unprecedented volumes of valuable structured data, "hidden" in natural language. For example, a news article might discuss an outbreak of an infectious disease, reporting the name of the disease, the number of people affected, and the geographical regions involved. Traditional keyword search, the prevalent query paradigm for text, is often insufficiently expressive for complex information needs that require structured data embedded in text. For such needs, users (e.g., an epidemiologist compiling statistics, as reported in the media, on recent food-borne disease outbreaks in a remote country) are forced to embark in labor-intensive cycles of keyword-based document retrieval and manual document filtering, until they locate the appropriate (structured) information.
To move beyond traditional keyword search, this project exploits information extraction technology, which identifies structured data in text, to enable structured querying. To capture diverse user information needs and depart from a "one-size-fits-all" querying approach, which is inappropriate for this extraction-based scenario, this project explores a wealth of structured query paradigms: sometimes users (e.g., a high-school student in need of some quick examples and statistics for a report on recent salmonella outbreaks in developing countries) are after a few exploratory results, which should be returned fast; some other times, users (e.g., the above epidemiologist investigating food-borne diseases) are after comprehensive results, for which waiting a longer time is acceptable. The project develops specialized cost-based query optimizers for each query paradigm, accounting for the efficiency and, critically, the result quality of the query execution plans. The technology produced will assist a vast range of users and information needs, by enabling efficient, diverse interactions with text databases --for sophisticated searching and data mining-- that are cumbersome or impossible with today's technology.
The research and educational components of the project relies on --and encourages-- a tight integration of three complementary Computer Science disciplines, namely, natural language processing, information retrieval, and databases. The project also provides source code, for experimentation and evaluation, to the community at large over the Web, on the website at http://reel.cs.columbia.edu/.
Acknowledgments: This research is supported by the National Science Foundation under Grant IIS-0811038, as well as by two Yahoo! Faculty Research and Engagement Gifts. Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the National Science Foundation or of Yahoo!
At Columbia:
- Pablo Barrio (graduated)
- Luis Gravano (contact)
- Alpa Jain (graduated)
- Matthew Solomon (graduated)
Current and Former External Collaborators:
- Eugene Agichtein (Emory University)
- Chris Develder (Ghent University)
- AnHai Doan (University of Wisconsin-Madison)
- Helena Galhardas (University of Lisbon)
- Panagiotis Ipeirotis (NYU)
- Gonçalo Simões (University of Lisbon)
- Cong Yu (Google Research)
- Sampling Strategies for Information Extraction over the Deep Web, P. Barrio and L. Gravano, in Information Processing & Management, vol. 53, no. 2, pages 309–331, Mar. 2017.
- Ranking Deep Web Text Collections for Scalable Information Extraction, P. Barrio, L. Gravano, and C. Develder, in Proc. of the 24th ACM Conference on Information and Knowledge Management (CIKM 2015), 2015.
- Learning to Rank Adaptively for Scalable Information Extraction, P. Barrio, G. Simões, H. Galhardas, and L. Gravano, in Proc. of the 18th International Conference on Extending Database Technology (EDBT 2015), 2015.
- REEL: A Relation Extraction Learning Framework (poster), P. Barrio, G. Simões, H. Galhardas, and L. Gravano, in Proc. of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2014), 2014.
- When Speed Has a Price: Fast Information Extraction Using Approximate Algorithms, G. Simões, H. Galhardas, and L. Gravano, in Proc. of the VLDB Endowment, vol. 6, no. 13, 2013.
- Quality Impact of Value Matching and Scoring in Top-k Entity Attribute Extraction, M. Solomon, L. Gravano, and C. Yu, in Proc. of the 5th International Workshop on Ranking in Databases (DBRank 2011), 2011.
- Popularity-Guided Top-k Extraction of Entity Attributes, M. Solomon, C. Yu, and L. Gravano, in Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB 2010), 2010.
- Join Optimization of Information Extraction Output: Quality Matters!, A. Jain, P. Ipeirotis, A. Doan, and L. Gravano, in Proc. of the 25th IEEE International Conference on Data Engineering (ICDE 2009), 2009.
- Building Query Optimizers for Information Extraction: The SQoUT Project, A. Jain, P. Ipeirotis, and L. Gravano, in SIGMOD Record, Special Issue on "Managing Information Extraction," vol. 37, no. 4, December 2008.
- Optimizing SQL Queries over Text Databases, A. Jain, A. Doan, and L. Gravano, in Proc. of the 24th IEEE International Conference on Data Engineering (ICDE 2008), 2008.
- Towards a Query Optimizer for Text-Centric Tasks, P. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, in ACM Transactions on Database Systems, vol. 32, no. 4, Nov. 2007.
- SQL Queries Over Unstructured Text Databases, A. Jain, A. Doan, and L. Gravano, in Proc. of the 23rd IEEE International Conference on Data Engineering (ICDE 2007), 2007 (short 3-page "poster" paper).
- To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks ("Best Paper" Award), P. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, in Proc. of the 2006 ACM SIGMOD International Conference on Management of Data, 2006.