Download | - View accepted manuscript: From HTML documents to web tables and rules (PDF, 890 KiB)
|
---|
Author | Search for: Simon, K.; Search for: Lausen, G.; Search for: Boley, Harold1 |
---|
Affiliation | - National Research Council of Canada. Security and Disruptive Technologies
|
---|
Format | Text, Article |
---|
Conference | The Eighth International Conference on Electronic Commerce (ICEC 2006), August 14-16, 2006, Fredericton, New Brunswick, Canada |
---|
Subject | data extraction; data record alignment; rule-based languages |
---|
Abstract | We present a browser-extending Semantic Web extraction system that maps HTML documents to tables and, where possible, to rules. First, the basic data extractor ViPER distills and reorganizes semi-structured information into a tabular data structure, which can again be browsed and/or submitted to further machine processing. Second, exemplifying the latter, the extended knowledge extractor Rex ViPER mines the resulting tables for structural properties and functional dependencies. Rules are generated to obtain a more compact and manageable, often also enriched, knowledge representation. The resulting fully structured information, RuleML-serialized facts and rules, can be stored along with the orginal documents, queried by rule engines such as OO jDREW and FLORID, and interchanged between Web Services. Thus Rex ViPER contributes to automating the construction of a machine-processable Semantic Web. |
---|
Publication date | 2006 |
---|
Publisher | National Research Council of Canada. Institute for Information Technology |
---|
Copyright statement | - © 2006 National Research Council of Canada
|
---|
Related publication | |
---|
Language | English |
---|
Peer reviewed | Yes |
---|
NRC number | NRCC 49310 |
---|
NPARC number | 5764332 |
---|
Export citation | Export as RIS |
---|
Report a correction | Report a correction (opens in a new tab) |
---|
Record identifier | 4b3ab6b5-8cb0-4ed8-838b-7ed43067e340 |
---|
Record created | 2009-03-29 |
---|
Record modified | 2024-02-29 |
---|