الملخص: | <p>The growing popularity of Resource Description Framework (RDF) as a mode for
data exchange and integration has resulted in the increased growth of RDF datasets. Some large scale RDF datasets cannot be stored and processed efficiently on a single node. A common approach to processing large RDF datasets is to partition the data in a cluster of shared-nothing servers and use a distributed query evaluation algorithm. It is commonly assumed in the literature that the performance of query processing in such systems is limited mainly by network communication. In this thesis, we show that this assumption does not always hold and we argue that more important than minimizing network communication, we should prioritise even workload distribution among servers when partitioning. Moreover, we present a new RDF partitioning method based on Louvain community detection, which drastically reduces communication, but without a corresponding decrease in query running times. This is because strongly connected partitions can lead to workload imbalance among the servers. We present a further refinement of our technique that aims to
strike a balance between reducing communication and spreading processing more evenly, and our empirical evaluation shows that such an approach can improve load balance and hence reduce both communication and query times.</p>
|