Processing skyline queries in centralised and distributed incomplete databases

Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimens...

Full description

Bibliographic Details
Main Author: Alwan, Ali Amer
Format: Thesis
Language:English
Published: 2013
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf
_version_ 1796974253815365632
author Alwan, Ali Amer
author_facet Alwan, Ali Amer
author_sort Alwan, Ali Amer
collection UPM
description Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available (complete). However, this assumption is not always true particularly for multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,missing values will influence negatively on the process of finding skylines, leading to high overhead, due to exhaustive pairwise comparisons between the data items. This problem becomes more complicated when multiple tables with incomplete data need to be accessed in determining the skylines. Since in distributed database tables are spread over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these dimensions. This thesis aims at proposing an efficient approach which is able to identify skylines in incomplete database. The approach employs the concepts of clustering data to partition the initial database into a set of distinct clusters. Then, the derived clusters are further divided into smaller groups and local skylines of each cluster are then identified. Next, a set of virtual skylines called k-dom that is derived from the local skylines are merged to derive a global k-dom skyline which is inserted at the top of each cluster to identify the candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline queries in incomplete distributed databases by pruning the input relations before conducting the join and skyline operations. The thesis also proposes an approach to estimate the missing values in the skylines. The approach utilizes the concept of mining attribute correlations to generate approximate functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions. Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While the approach for processing skyline queries in incomplete distributed databases achieved between 56% to 88% reduction in the processing time and 84% to 90% for network cost compared to the previous approach. Lastly, the results for imputing the missing values of the skylines have shown that our approach achieved 25% error rate between the real missing values and the estimated values of the skylines.
first_indexed 2024-03-06T08:54:22Z
format Thesis
id upm.eprints-43004
institution Universiti Putra Malaysia
language English
last_indexed 2024-03-06T08:54:22Z
publishDate 2013
record_format dspace
spelling upm.eprints-430042016-07-12T02:37:51Z http://psasir.upm.edu.my/id/eprint/43004/ Processing skyline queries in centralised and distributed incomplete databases Alwan, Ali Amer Skyline queries incorporate and provide a flexible query operator that returns data items (skylines) which are not being dominated by other data items in all dimensions (attributes) of the database. Most of the existing skyline techniques determine the skylines by assuming that the values of dimensions for every data item are available (complete). However, this assumption is not always true particularly for multidimensional database as some values may be missing. The incompleteness of data leads to the loss of the transitivity property of skyline technique and results into failure in test dominance as some data items are incomparable to each other. Furthermore,missing values will influence negatively on the process of finding skylines, leading to high overhead, due to exhaustive pairwise comparisons between the data items. This problem becomes more complicated when multiple tables with incomplete data need to be accessed in determining the skylines. Since in distributed database tables are spread over various locations, therefore, join operation is needed in identifying the skylines. Joining these dimensions without any filteration will result in a huge amount of data to be joined. Furthermore, most of the previous works in the area of skyline queries in incomplete database emphasized only on retrieving skylines without estimating the missing values. In other words, the derived skylines have some missing values in one or more dimensions. However, in many cases users are concerned about the values in these dimensions. This thesis aims at proposing an efficient approach which is able to identify skylines in incomplete database. The approach employs the concepts of clustering data to partition the initial database into a set of distinct clusters. Then, the derived clusters are further divided into smaller groups and local skylines of each cluster are then identified. Next, a set of virtual skylines called k-dom that is derived from the local skylines are merged to derive a global k-dom skyline which is inserted at the top of each cluster to identify the candidate skylines. The final skylines are retrieved after conducting pairwise comparisons among the candidate skylines. The approach is extended to process skyline queries in incomplete distributed databases by pruning the input relations before conducting the join and skyline operations. The thesis also proposes an approach to estimate the missing values in the skylines. The approach utilizes the concept of mining attribute correlations to generate approximate functional dependencies (AFDs) that capture the relationships between dimensions. Besides, the strength of probability correlations between dimensions is computed in order to estimate the values. Then, the skylines are ranked according to the confidence of the generated AFD and the strength of probability correlations of the dimensions. Several experiments on synthetic and real datasets have been conducted. The results showed that our proposed approach for processing skyline queries in incomplete database has reduced the number of pairwise comparisons in the range of 75%-93% and the processing time in the range of 50%-89% compared to the previous approach. While the approach for processing skyline queries in incomplete distributed databases achieved between 56% to 88% reduction in the processing time and 84% to 90% for network cost compared to the previous approach. Lastly, the results for imputing the missing values of the skylines have shown that our approach achieved 25% error rate between the real missing values and the estimated values of the skylines. 2013-06 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf Alwan, Ali Amer (2013) Processing skyline queries in centralised and distributed incomplete databases. PhD thesis, Universiti Putra Malaysia. Database management Querying (Computer science) Data mining
spellingShingle Database management
Querying (Computer science)
Data mining
Alwan, Ali Amer
Processing skyline queries in centralised and distributed incomplete databases
title Processing skyline queries in centralised and distributed incomplete databases
title_full Processing skyline queries in centralised and distributed incomplete databases
title_fullStr Processing skyline queries in centralised and distributed incomplete databases
title_full_unstemmed Processing skyline queries in centralised and distributed incomplete databases
title_short Processing skyline queries in centralised and distributed incomplete databases
title_sort processing skyline queries in centralised and distributed incomplete databases
topic Database management
Querying (Computer science)
Data mining
url http://psasir.upm.edu.my/id/eprint/43004/1/FSKTM%202013%207R.pdf
work_keys_str_mv AT alwanaliamer processingskylinequeriesincentralisedanddistributedincompletedatabases