IBM BLUE GENE TECHNOLOGY.htm (Size: 43.98 KB / Downloads: 676)
IBM BLUE GENE TECHNOLOGY
A SEMINAR REPORT
Blue Gene is a massively parallel computer being developed at the IBM Thomas J. Watson Research Center. Blue Gene represents a hundred-fold improvement on performance compared with the fastest supercomputers of today. It will achieve 1 PetaFLOP/sec through unprecedented levels of parallelism in excess of 4,0000,000 threads of execution. The Blue Gene project has two important goals, in which understanding of biologically import processes will be advanced, as well as advancement of knowledge of cellular architectures (massively parallel system built of single chip cells that integrate processors, memory and communication), and of the software needed to exploit those effectively. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.
In November 2001 IBM announced a partnership with Lawrence Livermore National Laboratory to build the Blue Gene/L (BG/L) supercomputer, a 65,536-node machine designed around embedded PowerPC processors. Through the use of system-on-a-chip integration coupled with a highly scalable cellular architecture, Blue Gene/L will deliver 180 or 360 Teraflops of peak computing power, depending on the utilization mode. Blue Gene/L represents a new level of scalability for parallel systems. Whereas existing large scale systems range in size from hundreds to a few of compute nodes, Blue Gene/L makes a jump of almost two orders of magnitude. Several techniques have been proposed for building such a powerful machine. Some of the designs call for extremely powerful (100 GFLOPS) processors based on superconducting technology. The class of designs that we focus on use current and foreseeable CMOS technology. It is reasonably clear that such machines, in the near future at least, will require a departure from the architectures of the current parallel supercomputers, which use few thousand commodity microprocessors. With the current technology, it would take around a million microprocessors to achieve a petaFLOPS performance. Clearly, power requirements and cost considerations alone preclude this option. The class of machines of interest to us use a processorsin- memory design: the basic building block is a single chip that includes multiple processors as well as memory and interconnection routing logic. On such machines, the ratio of memory-to- processors will be substantially lower than the prevalent one. As the technology is assumed to be the current generation one, the number of processors will still have to be close to a million, but the number of chips will be much lower. Using such a design, petaFLOPS performance will be reached within the next 2-3 years, especially since IBM hasannounced the Blue Gene project aimed at building such a machine. The system software for Blue Gene/L is a combination of standard and custom solutions. The software architecture for the machine is divided into three functional Entities arranged hierarchically: a computational core, a control infrastructure and a service infrastructure. The I/O nodes (part of the control infrastructure) execute a version of the Linux kernel and are the primary off-load engine for most system services. No user code directly executes on the I/O nodes.
The basic building block of Blue Gene/L is a custom system-on-a-chip that integrates processors, memory and communications logic in the same piece of silicon. The BG/L chip contains two standard 32-bit embedded PowerPC 440 cores, each with private L1 32KB instruction and 32KB data caches. L2 caches acts as prefetch buffer for L3 cache. Each core drives a custom 128-bit double FPU that can perform four double precision floating-point operations per cycle. This custom FPU consists of two conventional FPUs joined together, each having a 64-bit register file with 32 registers. One of the conventional FPUs (the primary side) is compatible with the standard PowerPC floatingpoint instruction set. In most scenarios, only one of the 440 cores is dedicated to run user applications while the second processor drives the networks. At a target speed of 700 MHz the peak performance of a node is 2.8 GFlop/s. When both cores and FPUs in a chip are used, the peak performance per node is 5.6 GFlop/s. To overcome these limitations BG/L provides a variety of synchronization devices in the chip: lockbox, shared SRAM, L3 scratchpad and the blind device. The lockbox unit contains a limited number of memory locations for fast atomic test-and sets and barriers. 16 KB of SRAM in the chip can be used to exchange data between the cores and regions of the EDRAM L3 cache can be reserved as an addressable scratchpad. The blind device permits explicit cache management. The low power characteristics of Blue Gene/L permit a very dense packaging as in research paper . Two nodes share a node card that also contains SDRAM-DDR memory. Each node supports a maximum of 2 GB external memory but in the current configuration each node directly addresses 256MB at 5.5 GB/s bandwidth with a 75-cycle latency. Sixteen compute cards can be plugged in a node board. A cabinet with two mid planes contains 32 node boards for a total of 2048 CPUs and a peak performance of 2.9/5.7 TFlops. The complete system has 64 cabinets and 16 TB of memory. In addition to the 64K-compute nodes, BG/L contains a number of I/O nodes (1024 in the current design). Compute nodes and I/O nodes are physically identical although I/O nodes are likely to contain more memory.
Networks and communication hardware
The BG/L ASIC supports five different networks.
Hardware and hardware technologies
Grouping hardware technologies for achieving petaflop/s computing performance into five main categories we have: Conventional technologies, Processing in-memory (PIM) designs, Designs based on super conducting processor technology ,Special purpose hardware designs Schemes that use the aggregate computing power of Web-distributed processors Technologies currently available or that are expected to be available in the near future. Thus, we donâ„¢t discuss designs based on spectulative technologies, such as quantum6 or macromolecular7 computing, although they might be important in the long run. Special Purpose hardware Researchers on the Grape project at the University of Tokyo have designed and built a family of special-purpose attached processors for performing the gravitational force computations that form the inner loop of N-body simulation problems. The computational astrophysics community has extensively used Grape processors for N-body gravitational simulations. A Grape-4 system consisting of 1,692 processors, each with a peak speed of 640 Mflop/s, was completed in June 1995 and has a peak speed of 1.08 Tflop/s.14 In 1999, a Grape-5 system won a Gordon Bell Award in the performance per dollar category, achieving$7.3 per Mflop/s on a tree-code astrophysical simulation. A Grape-6 system is planned for completionin 2001 and is expected to have a performance of about 200 Tflop/s. A 1-Pflop/s Grape system is planned for completion by 2003.