Performance analysis of data replication and scheduling in data grid

The Grid is an infrastructure that enables dynamic sharing and coordinated access of resources among different organizations. As a specialization and extension of the Grid, Data Grid emphasizes on the sharing of large-scale data sets and data storage resources. It has evolved to be the solution for...

Full description

Bibliographic Details
Main Author: Zhang, Junwei
Other Authors: Lee Bu Sung, Francis
Format: Thesis
Language:English
Published: 2010
Subjects:
Online Access:https://hdl.handle.net/10356/38584
Description
Summary:The Grid is an infrastructure that enables dynamic sharing and coordinated access of resources among different organizations. As a specialization and extension of the Grid, Data Grid emphasizes on the sharing of large-scale data sets and data storage resources. It has evolved to be the solution for data intensive applications, such as global climate change, High Energy Physics (HEP), astrophysics, and computational genomics. In these research domains, the size of scientific data is measured in terabytes (1024 gigabyte) or even petabytes (1024 terabytes). Such scientific data are stored as large files and replicated across the Data Grid. Scientists geographically located all over the world are able to download these datasets and analyze them for various purposes. Hierarchical Data Grid is a class of Data Grid that has been adopted by European Organization for Nuclear Research (CERN) to support the distribution of large experimental datasets across the globe. There have been a lot of research works on replication algorithms for the Hierarchical Data Grid. I have developed a probabilistic model of data replication in a Hierarchical Data Grid environment. The model enables us to evaluate the optimality of the replication algorithm in terms of average response time and average bandwidth cost. The accuracy of the model is verified through simulation.