HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding

Word shape coding (WSC) is a method of document image retrieval (DIR) based on keyword spotting. By using this method, a word can be recognized in the document image, only by identifying some of the features of the word. In this paper, a hierarchical word spotting method, namely HWS, is presented fo...

Full description

Bibliographic Details
Main Authors: Mohammadreza Keyvanpour, Reza Tavoli, Saeed Mozaffari
Format: Article
Language:English
Published: Iran Telecom Research Center 2015-06-01
Series:International Journal of Information and Communication Technology Research
Subjects:
Online Access:http://ijict.itrc.ac.ir/article-1-102-en.html
_version_ 1811169309159849984
author Mohammadreza Keyvanpour
Reza Tavoli
Saeed Mozaffari
author_facet Mohammadreza Keyvanpour
Reza Tavoli
Saeed Mozaffari
author_sort Mohammadreza Keyvanpour
collection DOAJ
description Word shape coding (WSC) is a method of document image retrieval (DIR) based on keyword spotting. By using this method, a word can be recognized in the document image, only by identifying some of the features of the word. In this paper, a hierarchical word spotting method, namely HWS, is presented for Farsi document image retrieval through WSC. In HWS method, document images are retrieved by using a new indexing method. In HWS, at first the words in the document images are shape coded based on topological properties. These features include number of sub-words, ascenders, descenders, and holes.A new feature that has been used for this paper is dot's position in word. Six features are obtained which are one top dot, two top dots, three top dots and one bottom dot, two bottom dots, and three bottom dots. Precision of retrieval increases by using these features. Then, all of the shape codes are indexed by building a tree. Retrieval is done based on keyword query in the tree. The results show that the proposed technique is very fast for large volumes of documents. Time complexity for successful and non-successful searching is O(logkn) .This value is better than values in ordinal method. Also, time complexity for indexing is O(logkn) . The HWS method is tested on Bijankhan database. 87867 common words from this database are used for building the dictionary. Test results show that average of precision is 0.83 and average recall is 0.94.
first_indexed 2024-04-10T16:40:16Z
format Article
id doaj.art-a5084144d3b64c4cb755e61c644f1961
institution Directory Open Access Journal
issn 2251-6107
2783-4425
language English
last_indexed 2024-04-10T16:40:16Z
publishDate 2015-06-01
publisher Iran Telecom Research Center
record_format Article
series International Journal of Information and Communication Technology Research
spelling doaj.art-a5084144d3b64c4cb755e61c644f19612023-02-08T07:54:43ZengIran Telecom Research CenterInternational Journal of Information and Communication Technology Research2251-61072783-44252015-06-01725970HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape CodingMohammadreza Keyvanpour0Reza Tavoli1Saeed Mozaffari2 Word shape coding (WSC) is a method of document image retrieval (DIR) based on keyword spotting. By using this method, a word can be recognized in the document image, only by identifying some of the features of the word. In this paper, a hierarchical word spotting method, namely HWS, is presented for Farsi document image retrieval through WSC. In HWS method, document images are retrieved by using a new indexing method. In HWS, at first the words in the document images are shape coded based on topological properties. These features include number of sub-words, ascenders, descenders, and holes.A new feature that has been used for this paper is dot's position in word. Six features are obtained which are one top dot, two top dots, three top dots and one bottom dot, two bottom dots, and three bottom dots. Precision of retrieval increases by using these features. Then, all of the shape codes are indexed by building a tree. Retrieval is done based on keyword query in the tree. The results show that the proposed technique is very fast for large volumes of documents. Time complexity for successful and non-successful searching is O(logkn) .This value is better than values in ordinal method. Also, time complexity for indexing is O(logkn) . The HWS method is tested on Bijankhan database. 87867 common words from this database are used for building the dictionary. Test results show that average of precision is 0.83 and average recall is 0.94.http://ijict.itrc.ac.ir/article-1-102-en.htmltree indexinginformation retrievaldocument imageword shape codingfarsi document
spellingShingle Mohammadreza Keyvanpour
Reza Tavoli
Saeed Mozaffari
HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
International Journal of Information and Communication Technology Research
tree indexing
information retrieval
document image
word shape coding
farsi document
title HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
title_full HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
title_fullStr HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
title_full_unstemmed HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
title_short HWS: A Hierarchical Word Spotting Method for Farsi Printed Words Through Word Shape Coding
title_sort hws a hierarchical word spotting method for farsi printed words through word shape coding
topic tree indexing
information retrieval
document image
word shape coding
farsi document
url http://ijict.itrc.ac.ir/article-1-102-en.html
work_keys_str_mv AT mohammadrezakeyvanpour hwsahierarchicalwordspottingmethodforfarsiprintedwordsthroughwordshapecoding
AT rezatavoli hwsahierarchicalwordspottingmethodforfarsiprintedwordsthroughwordshapecoding
AT saeedmozaffari hwsahierarchicalwordspottingmethodforfarsiprintedwordsthroughwordshapecoding