Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Ann...
Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Korea Genome Organization
2020-09-01
|
Series: | Genomics & Informatics |
Subjects: | |
Online Access: | http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdf |
_version_ | 1819114649914703872 |
---|---|
author | Sunho Kim Royoung Kim Hee-Jo Nam Ryeo-Gyeong Kim Enjin Ko Han-Su Kim Jihye Shin Daeun Cho Yurhee Jin Soyeon Bae Ye Won Jo San Ah Jeong Yena Kim Seoyeon Ahn Bomi Jang Jiheyon Seong Yujin Lee Si Eun Seo Yujin Kim Ha-Jeong Kim Hyeji Kim Hye-Lynn Sung Hyoyoung Lho Jaywon Koo Jion Chu Juwon Lim Youngju Kim Kyungyeon Lee Yuri Lim Meongeun Kim Seonjeong Hwang Shinhye Han Sohyeun Bae Sua Kim Suhyeon Yoo Yeonjeong Seo Yerim Shin Yonsoo Kim You-Jung Ko Jihee Baek Hyejin Hyun Hyemin Choi Ji-Hye Oh Da-Young Kim Hyun-Seok Park |
author_facet | Sunho Kim Royoung Kim Hee-Jo Nam Ryeo-Gyeong Kim Enjin Ko Han-Su Kim Jihye Shin Daeun Cho Yurhee Jin Soyeon Bae Ye Won Jo San Ah Jeong Yena Kim Seoyeon Ahn Bomi Jang Jiheyon Seong Yujin Lee Si Eun Seo Yujin Kim Ha-Jeong Kim Hyeji Kim Hye-Lynn Sung Hyoyoung Lho Jaywon Koo Jion Chu Juwon Lim Youngju Kim Kyungyeon Lee Yuri Lim Meongeun Kim Seonjeong Hwang Shinhye Han Sohyeun Bae Sua Kim Suhyeon Yoo Yeonjeong Seo Yerim Shin Yonsoo Kim You-Jung Ko Jihee Baek Hyejin Hyun Hyemin Choi Ji-Hye Oh Da-Young Kim Hyun-Seok Park |
author_sort | Sunho Kim |
collection | DOAJ |
description | This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations. |
first_indexed | 2024-12-22T04:48:40Z |
format | Article |
id | doaj.art-08959ab38bd6445085b440f3881bf177 |
institution | Directory Open Access Journal |
issn | 2234-0742 |
language | English |
last_indexed | 2024-12-22T04:48:40Z |
publishDate | 2020-09-01 |
publisher | Korea Genome Organization |
record_format | Article |
series | Genomics & Informatics |
spelling | doaj.art-08959ab38bd6445085b440f3881bf1772022-12-21T18:38:32ZengKorea Genome OrganizationGenomics & Informatics2234-07422020-09-01183e3310.5808/GI.2020.18.3.e33620Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0Sunho Kim0Royoung Kim1Hee-Jo Nam2Ryeo-Gyeong Kim3Enjin Ko4Han-Su Kim5Jihye Shin6Daeun Cho7Yurhee Jin8Soyeon Bae9Ye Won Jo10San Ah Jeong11Yena Kim12Seoyeon Ahn13Bomi Jang14Jiheyon Seong15Yujin Lee16Si Eun Seo17Yujin Kim18Ha-Jeong Kim19Hyeji Kim20Hye-Lynn Sung21Hyoyoung Lho22Jaywon Koo23Jion Chu24Juwon Lim25Youngju Kim26Kyungyeon Lee27Yuri Lim28Meongeun Kim29Seonjeong Hwang30Shinhye Han31Sohyeun Bae32Sua Kim33Suhyeon Yoo34Yeonjeong Seo35Yerim Shin36Yonsoo Kim37You-Jung Ko38Jihee Baek39Hyejin Hyun40Hyemin Choi41Ji-Hye Oh42Da-Young KimHyun-Seok Park43 Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, KoreaThis paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdfbiomedical text miningcorpustext analytics |
spellingShingle | Sunho Kim Royoung Kim Hee-Jo Nam Ryeo-Gyeong Kim Enjin Ko Han-Su Kim Jihye Shin Daeun Cho Yurhee Jin Soyeon Bae Ye Won Jo San Ah Jeong Yena Kim Seoyeon Ahn Bomi Jang Jiheyon Seong Yujin Lee Si Eun Seo Yujin Kim Ha-Jeong Kim Hyeji Kim Hye-Lynn Sung Hyoyoung Lho Jaywon Koo Jion Chu Juwon Lim Youngju Kim Kyungyeon Lee Yuri Lim Meongeun Kim Seonjeong Hwang Shinhye Han Sohyeun Bae Sua Kim Suhyeon Yoo Yeonjeong Seo Yerim Shin Yonsoo Kim You-Jung Ko Jihee Baek Hyejin Hyun Hyemin Choi Ji-Hye Oh Da-Young Kim Hyun-Seok Park Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 Genomics & Informatics biomedical text mining corpus text analytics |
title | Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 |
title_full | Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 |
title_fullStr | Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 |
title_full_unstemmed | Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 |
title_short | Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0 |
title_sort | organizing an in class hackathon to correct pdf to text conversion errors of 1 0 |
topic | biomedical text mining corpus text analytics |
url | http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdf |
work_keys_str_mv | AT sunhokim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT royoungkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT heejonam organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT ryeogyeongkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT enjinko organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hansukim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jihyeshin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT daeuncho organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yurheejin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT soyeonbae organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yewonjo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT sanahjeong organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yenakim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT seoyeonahn organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT bomijang organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jiheyonseong organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yujinlee organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT sieunseo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yujinkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hajeongkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyejikim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyelynnsung organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyoyounglho organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jaywonkoo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jionchu organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT juwonlim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT youngjukim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT kyungyeonlee organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yurilim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT meongeunkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT seonjeonghwang organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT shinhyehan organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT sohyeunbae organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT suakim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT suhyeonyoo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yeonjeongseo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yerimshin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT yonsookim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT youjungko organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jiheebaek organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyejinhyun organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyeminchoi organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT jihyeoh organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT dayoungkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 AT hyunseokpark organizinganinclasshackathontocorrectpdftotextconversionerrorsof10 |