Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0

This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Ann...

Full description

Bibliographic Details
Main Authors: Sunho Kim, Royoung Kim, Hee-Jo Nam, Ryeo-Gyeong Kim, Enjin Ko, Han-Su Kim, Jihye Shin, Daeun Cho, Yurhee Jin, Soyeon Bae, Ye Won Jo, San Ah Jeong, Yena Kim, Seoyeon Ahn, Bomi Jang, Jiheyon Seong, Yujin Lee, Si Eun Seo, Yujin Kim, Ha-Jeong Kim, Hyeji Kim, Hye-Lynn Sung, Hyoyoung Lho, Jaywon Koo, Jion Chu, Juwon Lim, Youngju Kim, Kyungyeon Lee, Yuri Lim, Meongeun Kim, Seonjeong Hwang, Shinhye Han, Sohyeun Bae, Sua Kim, Suhyeon Yoo, Yeonjeong Seo, Yerim Shin, Yonsoo Kim, You-Jung Ko, Jihee Baek, Hyejin Hyun, Hyemin Choi, Ji-Hye Oh, Da-Young Kim, Hyun-Seok Park
Format: Article
Language:English
Published: Korea Genome Organization 2020-09-01
Series:Genomics & Informatics
Subjects:
Online Access:http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdf
_version_ 1819114649914703872
author Sunho Kim
Royoung Kim
Hee-Jo Nam
Ryeo-Gyeong Kim
Enjin Ko
Han-Su Kim
Jihye Shin
Daeun Cho
Yurhee Jin
Soyeon Bae
Ye Won Jo
San Ah Jeong
Yena Kim
Seoyeon Ahn
Bomi Jang
Jiheyon Seong
Yujin Lee
Si Eun Seo
Yujin Kim
Ha-Jeong Kim
Hyeji Kim
Hye-Lynn Sung
Hyoyoung Lho
Jaywon Koo
Jion Chu
Juwon Lim
Youngju Kim
Kyungyeon Lee
Yuri Lim
Meongeun Kim
Seonjeong Hwang
Shinhye Han
Sohyeun Bae
Sua Kim
Suhyeon Yoo
Yeonjeong Seo
Yerim Shin
Yonsoo Kim
You-Jung Ko
Jihee Baek
Hyejin Hyun
Hyemin Choi
Ji-Hye Oh
Da-Young Kim
Hyun-Seok Park
author_facet Sunho Kim
Royoung Kim
Hee-Jo Nam
Ryeo-Gyeong Kim
Enjin Ko
Han-Su Kim
Jihye Shin
Daeun Cho
Yurhee Jin
Soyeon Bae
Ye Won Jo
San Ah Jeong
Yena Kim
Seoyeon Ahn
Bomi Jang
Jiheyon Seong
Yujin Lee
Si Eun Seo
Yujin Kim
Ha-Jeong Kim
Hyeji Kim
Hye-Lynn Sung
Hyoyoung Lho
Jaywon Koo
Jion Chu
Juwon Lim
Youngju Kim
Kyungyeon Lee
Yuri Lim
Meongeun Kim
Seonjeong Hwang
Shinhye Han
Sohyeun Bae
Sua Kim
Suhyeon Yoo
Yeonjeong Seo
Yerim Shin
Yonsoo Kim
You-Jung Ko
Jihee Baek
Hyejin Hyun
Hyemin Choi
Ji-Hye Oh
Da-Young Kim
Hyun-Seok Park
author_sort Sunho Kim
collection DOAJ
description This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.
first_indexed 2024-12-22T04:48:40Z
format Article
id doaj.art-08959ab38bd6445085b440f3881bf177
institution Directory Open Access Journal
issn 2234-0742
language English
last_indexed 2024-12-22T04:48:40Z
publishDate 2020-09-01
publisher Korea Genome Organization
record_format Article
series Genomics & Informatics
spelling doaj.art-08959ab38bd6445085b440f3881bf1772022-12-21T18:38:32ZengKorea Genome OrganizationGenomics & Informatics2234-07422020-09-01183e3310.5808/GI.2020.18.3.e33620Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0Sunho Kim0Royoung Kim1Hee-Jo Nam2Ryeo-Gyeong Kim3Enjin Ko4Han-Su Kim5Jihye Shin6Daeun Cho7Yurhee Jin8Soyeon Bae9Ye Won Jo10San Ah Jeong11Yena Kim12Seoyeon Ahn13Bomi Jang14Jiheyon Seong15Yujin Lee16Si Eun Seo17Yujin Kim18Ha-Jeong Kim19Hyeji Kim20Hye-Lynn Sung21Hyoyoung Lho22Jaywon Koo23Jion Chu24Juwon Lim25Youngju Kim26Kyungyeon Lee27Yuri Lim28Meongeun Kim29Seonjeong Hwang30Shinhye Han31Sohyeun Bae32Sua Kim33Suhyeon Yoo34Yeonjeong Seo35Yerim Shin36Yonsoo Kim37You-Jung Ko38Jihee Baek39Hyejin Hyun40Hyemin Choi41Ji-Hye Oh42Da-Young KimHyun-Seok Park43 Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, KoreaThis paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdfbiomedical text miningcorpustext analytics
spellingShingle Sunho Kim
Royoung Kim
Hee-Jo Nam
Ryeo-Gyeong Kim
Enjin Ko
Han-Su Kim
Jihye Shin
Daeun Cho
Yurhee Jin
Soyeon Bae
Ye Won Jo
San Ah Jeong
Yena Kim
Seoyeon Ahn
Bomi Jang
Jiheyon Seong
Yujin Lee
Si Eun Seo
Yujin Kim
Ha-Jeong Kim
Hyeji Kim
Hye-Lynn Sung
Hyoyoung Lho
Jaywon Koo
Jion Chu
Juwon Lim
Youngju Kim
Kyungyeon Lee
Yuri Lim
Meongeun Kim
Seonjeong Hwang
Shinhye Han
Sohyeun Bae
Sua Kim
Suhyeon Yoo
Yeonjeong Seo
Yerim Shin
Yonsoo Kim
You-Jung Ko
Jihee Baek
Hyejin Hyun
Hyemin Choi
Ji-Hye Oh
Da-Young Kim
Hyun-Seok Park
Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
Genomics & Informatics
biomedical text mining
corpus
text analytics
title Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
title_full Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
title_fullStr Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
title_full_unstemmed Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
title_short Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
title_sort organizing an in class hackathon to correct pdf to text conversion errors of 1 0
topic biomedical text mining
corpus
text analytics
url http://genominfo.org/upload/pdf/gi-2020-18-3-e33.pdf
work_keys_str_mv AT sunhokim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT royoungkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT heejonam organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT ryeogyeongkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT enjinko organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hansukim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jihyeshin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT daeuncho organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yurheejin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT soyeonbae organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yewonjo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT sanahjeong organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yenakim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT seoyeonahn organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT bomijang organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jiheyonseong organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yujinlee organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT sieunseo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yujinkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hajeongkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyejikim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyelynnsung organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyoyounglho organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jaywonkoo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jionchu organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT juwonlim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT youngjukim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT kyungyeonlee organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yurilim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT meongeunkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT seonjeonghwang organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT shinhyehan organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT sohyeunbae organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT suakim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT suhyeonyoo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yeonjeongseo organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yerimshin organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT yonsookim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT youjungko organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jiheebaek organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyejinhyun organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyeminchoi organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT jihyeoh organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT dayoungkim organizinganinclasshackathontocorrectpdftotextconversionerrorsof10
AT hyunseokpark organizinganinclasshackathontocorrectpdftotextconversionerrorsof10