RE: flexible rollback recovery in dynamic heterogeneous grid computing
FlexibleRollbackRecovery(mod).DOC (Size: 860 KB / Downloads: 110)
reliable Rollback Recovery in Dynamic Heterogeneous Grid Computin
Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded.
1.1 ABOUT THE PROJECT
Grid and cluster architectures have gained popularity for computationally intensive parallel applications. However, the complexity of the infrastructure, consisting of computational nodes, mass storage, and interconnection networks, poses great challenges with respect to overall system reliability. Simple tools of reliability
analysis show that as the complexity of the system increases, its reliability, and thus, Mean Time to Failure (MTTF), decreases. If one models the system as a series reliability block diagram , the reliability of the entire system is computed as the product of the reliabilities of all system components. For applications executing on large clusters or a Grid, e.g., Grid5000 , the long execution times may exceed the MTTF of the infrastructure and, thus, render the execution infeasible. As an example, let us consider an execution lasting 10 days in a system that does not consider fault tolerance. Under the optimistic assumption that the MTTF of a single node is 2,000 days, the probability of failure of this long execution using 100, 200, or 500 nodes is 0.39, 0.63,or 0.91, respectively, approaching fast certain failure. The high failure probabilities are due to the fact that, in the absence of fault-tolerance mechanisms, the failure of a single node will cause the entire execution to fail. Note that this simple example does not even consider network failures, which are typically more likely than computer failure. Fault tolerance is, thus, a necessity to avoid failure in large applications, such as found in scientific computing, executing on a Grid, or large cluster.
The fault-tolerance mechanisms also have to be capable of dealing with the specific characteristics of a heterogeneous and dynamic environment. Even if individual clusters are homogeneous, heterogeneity in a Grid is mostly un avoidable, since different participating clusters often use diverse hardware or software architectures . One possible solution to address heterogeneity is to use platform independent abstractions such as the Java Virtual Machine. However, this does not solve the problem in general. There is a large base of existing applications that have been developed in other languages. Reengineering may not be feasible due to performance or cost reasons. Environments like Microsoft .Net address portability but only few scientific applications on Grids or clusters exist. Whereas Grids and clusters are dominated by unix operating systems, e.g., Linux or Solaris, Microsoft .Net is Windows-centric with only recent or partial unix support.
1.2. ORGANIZATION PROFILE
EdwareUK Ltd is an IT solution provider for a dynamic environment where business and technology strategies converge. Their approach focuses on new ways of business combining IT innovation and adoption while also leveraging an organization’s current IT assets. Their work with large global corporations and new products or services and to implement prudent business and technology strategies in today’s environment.
EdwareUK LTD ’S RANGE OF EXPERTISE INCLUDES:
• Software Development Services
• Engineering Services
• Systems Integration
• Customer Relationship Management
• Product Development
• Electronic Commerce
• IT Outsourcing
We apply technology with innovation and responsibility to achieve two broad objectives:
• Effectively address the business issues our customers face today.
• Generate new opportunities that will help them stay ahead in the future.
THIS APPROACH RESTS ON:
• A strategy where we architect, integrate and manage technology services and solutions - we call it AIM for success.
• A robust offshore development methodology and reduced demand on customer resources.
• A focus on the use of reusable frameworks to provide cost and times benefits.
They combine the best people, processes and technology to achieve excellent results - consistency. We offer customers the advantages of:
They understand the importance of timing, of getting there before the competition. A rich portfolio of reusable, modular frameworks helps jump-start projects. Tried and tested methodology ensures that we follow a predictable, low - risk path to achieve results. Our track record is testimony to complex projects delivered within and evens before schedule.
Our teams combine cutting edge technology skills with rich domain expertise. What’s equally important - they share a strong customer orientation that means they actually start by listening to the customer. They’re focused on coming up with solutions that serve customer requirements today and anticipate future needs.
A FULL SERVICE PORTFOLIO:
They offer customers the advantage of being able to Architect, integrate and manage technology services. This means that they can rely on one, fully accountable source instead of trying to integrate disparate multi vendor solutions.
EdwareUK LTD is providing it’s services to companies which are in the field of production, quality control etc With their rich expertise and experience and information technology they are in best position to provide software solutions to distinct business requirements.