A large dataset of scientific text reuse in Open-Access publications

Abstract We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains 91 million cases of reused text passages found in 4.2 million unique open-access publications. Cases range from overlap of as few as eight words to near-duplicate p...

Full description

Bibliographic Details
Main Authors: Lukas Gienapp, Wolfgang Kircheis, Bjarne Sievers, Benno Stein, Martin Potthast
Format: Article
Language:English
Published: Nature Portfolio 2023-01-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-022-01908-z