Unsupervised learning of semantic orientation from a hundred-billion-word corpus

From National Research Council Canada

Download	View final version: Unsupervised learning of semantic orientation from a hundred-billion-word corpus (PDF, 175 KiB)
DOI	Resolve DOI: https://doi.org/10.4224/8914027
Author	Search for: Turney, Peter¹; Search for: Littman, M.L.
Affiliation	National Research Council of Canada. NRC Institute for Information Technology
Format	Text, Technical Report
Physical description	9 leaves
Abstract	The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing queries to a Web search engine and using pointwise mutual information to analyse the results. The algorithm is empirically evaluated using a training corpus of approximately one hundred billion words - the subset of the Web that is indexed by the chosen search engine. Tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80 percent. The 3,596 test words include adjectives, adverbs, nouns, and verbs. The accuracy is comparable with the results achieved by Hatzivassiloglou and McKeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives.
Publication date	2002-05-12
Publisher	National Research Council of Canada
In	Report (National Research Council of Canada. Radio and Electrical Engineering Division : ERB), ERB-1094 (12 May 2002).
Series	Report (National Research Council of Canada. Radio and Electrical Engineering Division : ERB), no. ERB-1094 (12 May 2002).
Language	English
Peer reviewed	No
NRC number	NRCC 44929
NPARC number	8914027
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	3d270c0f-73ce-4c1f-9641-05e85aff3620
Record created	2009-04-22
Record modified	2023-06-19

Date modified:: 2024-07-27