Detection of tortured phrases in scientific literature

This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers. These tortured phrases, e.g. flag to clamor instead of signal to noise, are the results of paraphrasing tools used to escape plagiarism detection. We built a dataset and evaluated several strategies to flag previously undocumented tortured phrases. The proposed and tested methods are based on language models and either on embeddings similarities or on predictions of masked token. We found that an approach using token prediction and that propagates the scores to the chunk level gives the best results. With a recall value of .87 and a precision value of .61, it could retrieve new tortured phrases to be submitted to domain experts for validation.

Domains

Artificial Intelligence [cs.AI] Computation and Language [cs.CL] Digital Libraries [cs.DL] Document and Text Processing

Fichier principal

WIESP.pdf (266.58 Ko)

Origin : Files produced by the author(s)

Cyril Labbé : Connect in order to contact the contributor

https://inria.hal.science/hal-04423458

Submitted on : Thursday, February 1, 2024-4:21:18 PM

Last modification on : Thursday, April 4, 2024-9:37:08 PM

Dates and versions

hal-04423458 , version 1 (01-02-2024)

Licence

Attribution

Identifiers

HAL Id : hal-04423458 , version 1
ARXIV : 2402.03370

Cite

Eléna Martel, Martin Lentschat, Cyril Labbé. Detection of tortured phrases in scientific literature. Proceedings of the 2nd Workshop on Information Extraction from Scientific Publications, Nov 2023, Bali, Indonesia. ⟨hal-04423458⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_GLSI_SIGMA LIG_TDCGE_GETALP LIG_SIDCH

29 View

11 Download