Hierarchical multi-label news article classification with distributed semantic model based features

Automatic news categorization is essential to automatically handle the classification of multi-label news articles in online portal. This research employs some potential methods to improve performance of hierarchical multi-label classifier for Indonesian news article. First potential method is using...

Full description

Bibliographic Details
Main Authors: Ivana Clairine Irsan, Masayu Leylia Khodra
Format: Article
Language:English
Published: Universitas Ahmad Dahlan 2019-03-01
Series:IJAIN (International Journal of Advances in Intelligent Informatics)
Subjects:
Online Access:http://ijain.org/index.php/IJAIN/article/view/168
_version_ 1811324051191234560
author Ivana Clairine Irsan
Masayu Leylia Khodra
author_facet Ivana Clairine Irsan
Masayu Leylia Khodra
author_sort Ivana Clairine Irsan
collection DOAJ
description Automatic news categorization is essential to automatically handle the classification of multi-label news articles in online portal. This research employs some potential methods to improve performance of hierarchical multi-label classifier for Indonesian news article. First potential method is using Convolutional Neural Network (CNN) to build the top level classifier. The second method could improve the classification performance by calculating the average of the word vectors obtained from distributed semantic model. The third method combines lexical and semantic method to extract documents features, which multiplied word term frequency (lexical) with word vector average (semantic). Model build using Calibrated Label Ranking as multi-label classification method, and trained using Naïve Bayes algorithm has the best F1-measure of 0.7531. Multiplication of word term frequency and the average of word vectors were also used to build this classifiers. This configuration improved multi-label classification performance by 4.25%, compared to the baseline. The distributed semantic model that gave best performance in this experiment obtained from 300-dimension word2vec of Wikipedia’s articles. The multi-label classification model performance is also influenced by news’ released date. The difference period between training and testing data would also decrease models’ performance.
first_indexed 2024-04-13T14:06:15Z
format Article
id doaj.art-3040b887a8fa418b92bfa28bbc9eb48b
institution Directory Open Access Journal
issn 2442-6571
2548-3161
language English
last_indexed 2024-04-13T14:06:15Z
publishDate 2019-03-01
publisher Universitas Ahmad Dahlan
record_format Article
series IJAIN (International Journal of Advances in Intelligent Informatics)
spelling doaj.art-3040b887a8fa418b92bfa28bbc9eb48b2022-12-22T02:43:53ZengUniversitas Ahmad DahlanIJAIN (International Journal of Advances in Intelligent Informatics)2442-65712548-31612019-03-0151404710.26555/ijain.v5i1.168108Hierarchical multi-label news article classification with distributed semantic model based featuresIvana Clairine Irsan0Masayu Leylia Khodra1Institut Teknologi BandungInstitut Teknologi BandungAutomatic news categorization is essential to automatically handle the classification of multi-label news articles in online portal. This research employs some potential methods to improve performance of hierarchical multi-label classifier for Indonesian news article. First potential method is using Convolutional Neural Network (CNN) to build the top level classifier. The second method could improve the classification performance by calculating the average of the word vectors obtained from distributed semantic model. The third method combines lexical and semantic method to extract documents features, which multiplied word term frequency (lexical) with word vector average (semantic). Model build using Calibrated Label Ranking as multi-label classification method, and trained using Naïve Bayes algorithm has the best F1-measure of 0.7531. Multiplication of word term frequency and the average of word vectors were also used to build this classifiers. This configuration improved multi-label classification performance by 4.25%, compared to the baseline. The distributed semantic model that gave best performance in this experiment obtained from 300-dimension word2vec of Wikipedia’s articles. The multi-label classification model performance is also influenced by news’ released date. The difference period between training and testing data would also decrease models’ performance.http://ijain.org/index.php/IJAIN/article/view/168Multi-label classificationHierarchical multi-label classificationCNNWord embeddingNews
spellingShingle Ivana Clairine Irsan
Masayu Leylia Khodra
Hierarchical multi-label news article classification with distributed semantic model based features
IJAIN (International Journal of Advances in Intelligent Informatics)
Multi-label classification
Hierarchical multi-label classification
CNN
Word embedding
News
title Hierarchical multi-label news article classification with distributed semantic model based features
title_full Hierarchical multi-label news article classification with distributed semantic model based features
title_fullStr Hierarchical multi-label news article classification with distributed semantic model based features
title_full_unstemmed Hierarchical multi-label news article classification with distributed semantic model based features
title_short Hierarchical multi-label news article classification with distributed semantic model based features
title_sort hierarchical multi label news article classification with distributed semantic model based features
topic Multi-label classification
Hierarchical multi-label classification
CNN
Word embedding
News
url http://ijain.org/index.php/IJAIN/article/view/168
work_keys_str_mv AT ivanaclairineirsan hierarchicalmultilabelnewsarticleclassificationwithdistributedsemanticmodelbasedfeatures
AT masayuleyliakhodra hierarchicalmultilabelnewsarticleclassificationwithdistributedsemanticmodelbasedfeatures