Crawling the German Health Web: Exploratory Study and Graph Analysis

BackgroundThe internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of a...

Full description

Bibliographic Details
Main Authors: Zowalla, Richard, Wetter, Thomas, Pfeifer, Daniel
Format: Article
Language:English
Published: JMIR Publications 2020-07-01
Series:Journal of Medical Internet Research
Online Access:http://www.jmir.org/2020/7/e17853/
_version_ 1818788150505373696
author Zowalla, Richard
Wetter, Thomas
Pfeifer, Daniel
author_facet Zowalla, Richard
Wetter, Thomas
Pfeifer, Daniel
author_sort Zowalla, Richard
collection DOAJ
description BackgroundThe internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). ObjectiveThis study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. MethodsA support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. ResultsIn total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. ConclusionsThe results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.
first_indexed 2024-12-18T14:19:06Z
format Article
id doaj.art-1bca216adbf94786aede67a0e7f6c39b
institution Directory Open Access Journal
issn 1438-8871
language English
last_indexed 2024-12-18T14:19:06Z
publishDate 2020-07-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj.art-1bca216adbf94786aede67a0e7f6c39b2022-12-21T21:04:54ZengJMIR PublicationsJournal of Medical Internet Research1438-88712020-07-01227e1785310.2196/17853Crawling the German Health Web: Exploratory Study and Graph AnalysisZowalla, RichardWetter, ThomasPfeifer, DanielBackgroundThe internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). ObjectiveThis study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW’s graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. MethodsA support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non–health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. ResultsIn total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. ConclusionsThe results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.http://www.jmir.org/2020/7/e17853/
spellingShingle Zowalla, Richard
Wetter, Thomas
Pfeifer, Daniel
Crawling the German Health Web: Exploratory Study and Graph Analysis
Journal of Medical Internet Research
title Crawling the German Health Web: Exploratory Study and Graph Analysis
title_full Crawling the German Health Web: Exploratory Study and Graph Analysis
title_fullStr Crawling the German Health Web: Exploratory Study and Graph Analysis
title_full_unstemmed Crawling the German Health Web: Exploratory Study and Graph Analysis
title_short Crawling the German Health Web: Exploratory Study and Graph Analysis
title_sort crawling the german health web exploratory study and graph analysis
url http://www.jmir.org/2020/7/e17853/
work_keys_str_mv AT zowallarichard crawlingthegermanhealthwebexploratorystudyandgraphanalysis
AT wetterthomas crawlingthegermanhealthwebexploratorystudyandgraphanalysis
AT pfeiferdaniel crawlingthegermanhealthwebexploratorystudyandgraphanalysis