Truecasing For The Portage System

From National Research Council Canada

Download	View accepted manuscript: Truecasing For The Portage System (PDF, 354 KiB)
Author	Search for: Agbago, Akakpo; Search for: Kuhn, Roland; Search for: Foster, George
Format	Text, Article
Conference	International Conference on Recent Advances in Natural Language Processing (RANLP-05), September 21-24, 2005, Borovets, Bulgaria
Abstract	This paper presents a truecasing technique - that is, a technique for restoring the normal case form to an all lowercased or partially cased text. The technique uses a combination of statistical components, including an N-gram language model, a case mapping model, and a specialized language model for unknown words. The system is also capable of distinguishing between “title” and “non-title” lines, and can apply different statistical models to each type of line. The system was trained on the data taken from the English portion of the Canadian parliamentary Hansard corpus and on some English-language texts taken from a corpus of China-related stories; it was tested on a separate set of texts from the China-related corpus. The system achieved 96% case accuracy when the China-related test corpus had been completely lowercased; this represents 80% relative error rate reduction over the unigram baseline technique. Subsequently, our technique was implemented as a module called Portage-Truecasing inside a machine translation system called Portage, and its effect on the overall performance of Portage was tested. In this paper, we explore the truecasing concept, and then we explain the models used.
Publication date	2005
In	International Conference on Recent Advances in Natural Language Processing (RANLP-05) [Proceedings].
Language	English
NRC number	NRCC 48515
NPARC number	5763859
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	eab2cda0-07bf-403a-af3c-835ae30583ab
Record created	2009-03-29
Record modified	2020-10-09

Date modified:: 2024-07-27