"Table Extraction Using Conditional Random Fields" by David Pinto

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Table Extraction Using Conditional Random Fields

(2003)

David Pinto
Andrew McCallum, University of Massachusetts - Amherst
Xing Wei
W. Bruce Croft

Download

Abstract

The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout and language features, and as a result, they perform significantly better. We show experimental results on plain-text government statistical reports in which tables are located with 92% F1, and their constituent lines are classified into 12 table-related categories with 94% accuracy. We also discuss future work on undirected graphical models for segmenting columns, finding cells, and classifying them as data cells or label cells.

Keywords

Information Systems Applications,
Software Engineering,
Tables,
conditional random fields,
hidden Markov models,
information extraction,
metadata,
question answering

Disciplines

Computer Sciences

Publication Date

2003

Comments

This is the pre-published version harvested from CIIR.

Citation Information

David Pinto, Andrew McCallum, Xing Wei and W. Bruce Croft. "Table Extraction Using Conditional Random Fields" (2003)
Available at: http://works.bepress.com/andrew_mccallum/37/