 |
NRI scientist developed
a search engine, ranking algorithm tables
by title, document content and text reference
NRI scientist, Prasenjit Mitra. Assistant Professor, School of
Information Sciences and Technology,.The Pennsylvania State University
has developed a search engine, which identifies and extracts tables
from PDF documents including indexes and ranks the search results
using factors that include the table's title, text references to
the table and the date of its publication.
ChemXSeer: In this project, he was involved in constructing an
integrated database and digital library for chemical kinetics data.
He said:
- We have developed a chemical name and formula search engine.
We are investigating novel information extraction, document segmentation,
and indexing schemes.
- We have also developed a table search engine, TableSeer that
uses a novel ranking function TableRank to rank tables extracted
automatically from digital documents.
- Other topics of interest are web crawling (especially focused
crawling), query expansion, and analysis of blogs and social networks.
- In tests with documents from the Royal Society of Chemistry,
TableSeer correctly identified and retrieved 93.5 percent of tables
created in text-based formats
In a search of 10,000 documents from conferences, Prof. Mitra and
his team found that more than 70 percent of papers in chemistry,
biology and computer science included tables. Furthermore, most
of those documents had multiple tables.
TableSeer automates that process and captures data not only within
the table but also in tables' titles and footnotes. In addition,
it enables column-name-based search so that a user can search for
a particular column in a table
TableSeer can be tested online at http://chemxseer.ist.psu.edu,
but the source code will be made available near the completion of
the project.
The study, "TableSeer: Automatic Table Metadata Extraction
and Searching in Digital Libraries," was presented at the recent
16th International World Wide Web Conference in Alberta, Canada.

|