Effective partitioning of RDF data for distributed query answering

<p>The growing popularity of Resource Description Framework (RDF) as a mode for data exchange and integration has resulted in the increased growth of RDF datasets. Some large scale RDF datasets cannot be stored and processed efficiently on a single node. A common approach to processing large R...

全面介绍

书目详细资料
主要作者: Banda, F
其他作者: Boris, M
格式: Thesis
语言:English
出版: 2021
主题:
实物特征
总结:<p>The growing popularity of Resource Description Framework (RDF) as a mode for data exchange and integration has resulted in the increased growth of RDF datasets. Some large scale RDF datasets cannot be stored and processed efficiently on a single node. A common approach to processing large RDF datasets is to partition the data in a cluster of shared-nothing servers and use a distributed query evaluation algorithm. It is commonly assumed in the literature that the performance of query processing in such systems is limited mainly by network communication. In this thesis, we show that this assumption does not always hold and we argue that more important than minimizing network communication, we should prioritise even workload distribution among servers when partitioning. Moreover, we present a new RDF partitioning method based on Louvain community detection, which drastically reduces communication, but without a corresponding decrease in query running times. This is because strongly connected partitions can lead to workload imbalance among the servers. We present a further refinement of our technique that aims to strike a balance between reducing communication and spreading processing more evenly, and our empirical evaluation shows that such an approach can improve load balance and hence reduce both communication and query times.</p>