Download | - View final version: Online near-duplicate detection of news articles (PDF, 397 KiB)
|
---|
Link | https://www.aclweb.org/anthology/2020.lrec-1.156/ |
---|
Author | Search for: Rodier, Simon1; Search for: Carter, David1 |
---|
Affiliation | - National Research Council of Canada. Digital Technologies
|
---|
Format | Text, Article |
---|
Conference | The 12th Language Resources and Evaluation Conference, LREC 2020, 11–16 May 2020, Marseille, France |
---|
Subject | duplicate detection; near-duplicate detection; news corpora; natural language processing; situational awareness |
---|
Abstract | Near-duplicate documents are particularly common in news media corpora. Editors often update wirefeed articles to address space constraints in print editions or to add local context; journalists often lightly modify previous articles with new information or minor corrections. Near-duplicate documents have potentially significant costs, including bloating corpora with redundant information (biasing techniques built upon such corpora) and requiring additional human and computational analytic resources for marginal benefit. Filtering near-duplicates out of a collection is thus important, and is particularly challenging in applications that require them to be filtered out in real-time with high precision. Previous near-duplicate detection methods typically work offline to identify all near-duplicate pairs in a set of documents. We propose an online system which flags a near-duplicate document by finding its most likely original. This system adapts the shingling algorithm proposed by Broder (1997), and we test it on a challenging dataset of web-based news articles. Our online system presents state-of-the-art F1-scores, and can be tuned to trade precision for recall and vice-versa. Given its performance and online nature, our method can be used in many real-world applications. We present one such application, filtering near-duplicates to improve productivity of human analysts in a situational awareness tool. |
---|
Publication date | 2020-05 |
---|
Publisher | European Language Resources Association |
---|
Licence | |
---|
In | |
---|
Language | English |
---|
Peer reviewed | Yes |
---|
Export citation | Export as RIS |
---|
Report a correction | Report a correction (opens in a new tab) |
---|
Record identifier | ef1d98c6-6606-4b04-9549-5e002aa1826f |
---|
Record created | 2020-11-02 |
---|
Record modified | 2021-09-17 |
---|