The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing

The NLP4NLP corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. T...

Full description

Bibliographic Details
Main Authors:	Joseph Mariani, Gil Francopoulo, Patrick Paroubek, Frédéric Vernier
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2019-02-01
Series:	Frontiers in Research Metrics and Analytics
Subjects:	speech processing natural language processing text analytics bibliometrics scientometrics informetrics
Online Access:	https://www.frontiersin.org/article/10.3389/frma.2018.00037/full

_version_	1818673056129744896
author	Joseph Mariani Gil Francopoulo Patrick Paroubek Frédéric Vernier
author_facet	Joseph Mariani Gil Francopoulo Patrick Paroubek Frédéric Vernier
author_sort	Joseph Mariani
collection	DOAJ
description	The NLP4NLP corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. This paper presents an analysis of this corpus regarding the evolution of the research topics, with the identification of the authors who introduced them and of the publication where they were first presented, and the detection of epistemological ruptures. Linking the metadata, the paper content and the references allowed us to propose a measure of innovation for the research topics, the authors and the publications. In addition, it allowed us to study the use of language resources, in the framework of the paradigm shift between knowledge-based approaches and content-based approaches, and the reuse of articles and plagiarism between sources over time. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, resources or publications.
first_indexed	2024-12-17T07:49:43Z
format	Article
id	doaj.art-2f904a75886f4100aff49b938b2bcbdb
institution	Directory Open Access Journal
issn	2504-0537
language	English
last_indexed	2024-12-17T07:49:43Z
publishDate	2019-02-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Research Metrics and Analytics
spelling	doaj.art-2f904a75886f4100aff49b938b2bcbdb2022-12-21T21:57:53ZengFrontiers Media S.A.Frontiers in Research Metrics and Analytics2504-05372019-02-01310.3389/frma.2018.00037357846The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language ProcessingJoseph Mariani0Gil Francopoulo1Patrick Paroubek2Frédéric Vernier3LIMSI-CNRS, Université Paris-Saclay, Orsay, FranceTagmatica, Paris, FranceLIMSI-CNRS, Université Paris-Saclay, Orsay, FranceLIMSI-CNRS, Université Paris-Saclay, Orsay, FranceThe NLP4NLP corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. This paper presents an analysis of this corpus regarding the evolution of the research topics, with the identification of the authors who introduced them and of the publication where they were first presented, and the detection of epistemological ruptures. Linking the metadata, the paper content and the references allowed us to propose a measure of innovation for the research topics, the authors and the publications. In addition, it allowed us to study the use of language resources, in the framework of the paradigm shift between knowledge-based approaches and content-based approaches, and the reuse of articles and plagiarism between sources over time. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, resources or publications.https://www.frontiersin.org/article/10.3389/frma.2018.00037/fullspeech processingnatural language processingtext analyticsbibliometricsscientometricsinformetrics
spellingShingle	Joseph Mariani Gil Francopoulo Patrick Paroubek Frédéric Vernier The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing Frontiers in Research Metrics and Analytics speech processing natural language processing text analytics bibliometrics scientometrics informetrics
title	The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing
title_full	The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing
title_fullStr	The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing
title_full_unstemmed	The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing
title_short	The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing
title_sort	nlp4nlp corpus ii 50 years of research in speech and language processing
topic	speech processing natural language processing text analytics bibliometrics scientometrics informetrics
url	https://www.frontiersin.org/article/10.3389/frma.2018.00037/full
work_keys_str_mv	AT josephmariani thenlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT gilfrancopoulo thenlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT patrickparoubek thenlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT fredericvernier thenlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT josephmariani nlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT gilfrancopoulo nlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT patrickparoubek nlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing AT fredericvernier nlp4nlpcorpusii50yearsofresearchinspeechandlanguageprocessing

The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing

Similar Items