Patent application number | Description | Published |
20080201657 | SCALABLE PROPERTY VIEWER FOR A MASSIVELY PARALLEL COMPUTER SYSTEM - A method and apparatus for a scalable property viewer for a massively parallel computer system. The property viewer includes a graphical user interface to allow the user to view different properties of the computer system with several different types of views. The different views provide the user with both logical and graphical representations of the properties being monitored and allows the user to link between a logical and physical view of the system. The GUI provides the user with a convenient way to view the elements of a large system and determine elements that are different. Different properties could be placed together in the same view with different colors to allow the user to see the interaction of multiple properties. | 08-21-2008 |
20080215916 | TEMPLATE BASED PARALLEL CHECKPOINTING IN A MASSIVELY PARALLEL COMPUTER SYSTEM - A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size. | 09-04-2008 |
20080270852 | MULTI-DIRECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method checks for nodal faults in a group of nodes comprising a center node and all adjacent nodes. The center node concurrently communicates with the immediately adjacent nodes in three dimensions. The communications are analyzed to determine a presence of a faulty node or connection. | 10-30-2008 |
20080288820 | MULTI-DIRECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method checks for nodal faults in a group of nodes comprising a center node and all adjacent nodes. The center node concurrently communicates with the immediately adjacent nodes in three dimensions. The communications are analyzed to determine a presence of a faulty node or connection. | 11-20-2008 |
20080313506 | BISECTIONAL FAULT DETECTION SYSTEM - An apparatus and program product logically divide a group of nodes and causes node pairs comprising a node from each section to communicate. Results from the communications may be analyzed to determine performance characteristics, such as bandwidth and proper connectivity. | 12-18-2008 |
20080320329 | ROW FAULT DETECTION SYSTEM - An apparatus and program product check for nodal faults in a row of nodes by causing each node in the row to concurrently communicate with its adjacent neighbor nodes in the row. The communications are analyzed to determine a presence of a faulty node or connection. | 12-25-2008 |
20080320330 | ROW FAULT DETECTION SYSTEM - An apparatus, program product and method check for nodal faults in a row of nodes by causing each node in the row to concurrently communicate with its adjacent neighbor nodes in the row. The communications are analyzed to determine a presence of a faulty node or connection. | 12-25-2008 |
20090037376 | DATABASE RETRIEVAL WITH A UNIQUE KEY SEARCH ON A PARALLEL COMPUTER SYSTEM - An apparatus and method retrieves a database record from an in-memory database of a parallel computer system using a unique key. The parallel computer system performs a simultaneous search on each node of the computer system using the unique key and then utilizes a global combining network to combine the results from the searches of each node to efficiently and quickly search the entire database. | 02-05-2009 |
20090037377 | DATABASE RETRIEVAL WITH A NON-UNIQUE KEY ON A PARALLEL COMPUTER SYSTEM - An apparatus and method retrieves a database record from an in-memory database of a parallel computer system using a non-unique key. The parallel computer system performs a simultaneous search on each node of the computer system using the non-unique key and then utilizes a global combining network to combine the local results from the searches of each node to efficiently and quickly search the entire database. | 02-05-2009 |
20090044052 | CELL BOUNDARY FAULT DETECTION SYSTEM - An apparatus and program product determine a nodal fault along the boundary, or face, of a computing cell. Nodes on adjacent cell boundaries communicate with each other, and the communications are analyzed to determine if a node or connection is faulty. | 02-12-2009 |
20090067334 | MECHANISM FOR PROCESS MIGRATION ON A MASSIVELY PARALLEL COMPUTER - Embodiments off the invention provide a mechanism for process migration on a massively parallel computer system. In particular, embodiments of the invention may be used to update process state data for a migrated compute node, such as MPI (or other communication library) state data, across a full collection of compute nodes present in a given parallel system executing a parallel task. Migrating a process form one compute node to another may be useful to address a variety of sub-optimal operating conditions. For example, one or more processes may be migrated to cure network congestion resulting from a poorly mapped task or when a compute node is predicted to experience a hardware failure. | 03-12-2009 |
20090178053 | DISTRIBUTED SCHEMES FOR DEPLOYING AN APPLICATION IN A LARGE PARALLEL SYSTEM - Embodiments of the invention provide a method for deploying and running an application on a massively parallel computer system, while minimizing the costs associated with latency, bandwidth, and limited memory resources. The executable code of a program may be divided into multiple code fragments and distributed to different compute nodes of a parallel computing system. During program execution, one compute node may fetch code fragments from other compute nodes as necessary. | 07-09-2009 |
20090187984 | DATASPACE PROTECTION UTILIZING VIRTUAL PRIVATE NETWORKS ON A MULTI-NODE COMPUTER SYSTEM - A method and apparatus provide data security on a parallel computer system using virtual private networks. An access setup mechanism sets up access control data in the nodes that describes which virtual networks are protected and what applications have access to the protected private networks. When an application accesses data on a protected virtual network, a network access mechanism determines the data is protected and intercepts the data access. The network access mechanism in the kernel may also execute a rule depending on the kind of access that was attempted to the virtual network. Authorized access to the private networks can be made via a system call to the access control mechanism in the kernel. The access control mechanism enforces policy decisions on which data can be distributed through the system via an access control list or other security policies. | 07-23-2009 |
20100185718 | PERFORMING PROCESS MIGRATION WITH ALLREDUCE OPERATIONS - Compute nodes perform allreduce operations that swap processes at nodes. A first allreduce operation generates a first result and uses a first process from a first compute node, a second process from a second compute node, and zeros from other compute nodes. The first compute node replaces the first process with the first result. A second allreduce operation generates a second result and uses the first result from the first compute node, the second process from the second compute node, and zeros from others. The second compute node replaces the second process with the second result, which is the first process. A third allreduce operation generates a third result and uses the first result from first compute node, the second result from the second compute node, and zeros from others. The first compute node replaces the first result with the third result, which is the second process. | 07-22-2010 |
20100318835 | BISECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method logically divide a group of nodes and causes node pairs comprising a node from each section to communicate. Results from the communications may be analyzed to determine performance characteristics, such as bandwidth and proper connectivity. | 12-16-2010 |
20110191633 | PARALLEL DEBUGGING IN A MASSIVELY PARALLEL COMPUTING SYSTEM - A method and apparatus is described for parallel debugging on the data nodes of a parallel computer system. A data template associated with the debugger can be used as a reference to the common data on the nodes. The application or data contained on the compute nodes diverges from the data template at the service node during the course of program execution, so that pieces of the data are different at each of the nodes at some time of interest. For debugging, the compute nodes search their own memory image for checksum matches with the template and produces new data blocks with checksums that didn't exist in the data template, and a template of references to the original data blocks in the template. Examples herein include an application of the rsync protocol, compression and network broadcast to improve debugging in a massively parallel computer environment. | 08-04-2011 |