Lakshminarayana B. Arimilli, Austin US

Lakshminarayana B. Arimilli, Austin, TX US

Patent application number	Description	Published
20090063443	System and Method for Dynamically Supporting Indirect Routing Within a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for dynamically routing data through the data processing system. Data is received at a first processor that is to be transmitted to a destination processor. The data that is received includes address information. A lookup is performed in routing table data structures based on the address information to identify candidate paths through which the data is routed to the destination processor. A determination is made as to whether any of the candidate paths are not able to be used to route the data to the destination processor based on a setting of at least one identifier. A path is selected from the identified candidate paths for routing of the data based on a setting of the at least one identifier. Then, the data is transmitted from the first processor along the selected path toward the destination processor.	03-05-2009
20090063444	System and Method for Providing Multiple Redundant Direct Routes Between Supernodes of a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for selecting, from a plurality of routes through the data processing system, a direct route for transmitting data. Data that includes address information is received at a first processor that is to be transmitted to a destination processor. Using routing table data structures, direct route entries are identified that correspond to direct routes for transmitting data. An accessed priority table data structure comprises a priority entry for each entry in the routing table data structures. The priority entry specifies a priority of a corresponding entry in the routing table data structures. A direct route entry is selected that corresponds to a direct route from the routing table data structures, based on specified priorities. Then the data is transmitted from the first processor to the destination processor using a path corresponding to the selected direct route entry.	03-05-2009
20090063445	System and Method for Handling Indirect Routing of Information Between Supernodes of a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for selecting, from a plurality of routes through the data processing system, an indirect route for transmitting data. Data that includes address information is received at a first processor that is to be transmitted to a destination processor. Using routing table data structures, indirect route entries are identified that correspond to indirect routes for transmitting data. An accessed priority table data structure comprises a priority entry for each entry in the routing table data structures. The priority entry specifies a priority of a corresponding entry in the routing table data structures. An indirect route entry is selected that corresponds to an indirect route from the routing table data structures, based on specified priorities. Then the data is transmitted from the first processor to the destination processor using a path corresponding to the selected indirect route entry.	03-05-2009
20090063728	System and Method for Direct/Indirect Transmission of Information Using a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for transmitting data in a data network. A first processor of the data network receives data to be transmitted to a second processor within the data network. A determination is made if the data has previously been routed through an indirect communication link from a source processor, the indirect communication link being a communication link that does not directly couple the source processor to a final destination processor which is to receive the data. A communication link is selected over which to transmit the data from the first processor to the second processor based on results of determining if the data has previously been routed through an indirect communication link. Finally, the data is transmitted from the first processor to the second processor using the selected communication link.	03-05-2009
20090063811	System for Data Processing Using a Multi-Tiered Full-Graph Interconnect Architecture - A system is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted.	03-05-2009
20090063814	System and Method for Routing Information Through a Data Processing System Implementing a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for routing information through the data processing system. Data is received at a source processor within a set of processors that is to be transmitted to a destination processor, where the data includes address information. A first determination is performed as to whether the destination processor is within a same processor book as the source processor based on the address information. A second determination is performed as to whether the destination processor is within a same supernode as the source processor based on the address information if the destination processor is not within the same processor book. A routing path is identified for the data based on results of the first determination, the second determination, and one or more routing table data structures. The data is then transmitted from the source processor along the identified routing path toward the destination processor.	03-05-2009
20090063815	System and Method for Providing Full Hardware Support of Collective Operations in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for performing collective operations. In hardware of a parent processor in a first processor book, a number of other processors are determined in a same or different processor book of the data processing system that is needed to execute the collective operation, thereby establishing a plurality of processors comprising the parent processor and the other processors. In hardware of the parent processor, the plurality of processors are logically arranged as a plurality of nodes in a hierarchical structure. The collective operation is transmitted to the plurality of processors based on the hierarchical structure. In hardware of the parent processor, results are received from the execution of the collective operation from the other processors, a final result is generated of the collective operation based on the received results, and the final result is output.	03-05-2009
20090063816	System and Method for Performing Collective Operations Using Software Setup and Partial Software Execution at Leaf Nodes in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for performing collective operations. In software executing on a parent processor in a first processor book, a number of other processors are determined in a same or different processor book of the data processing system that is needed to execute the collective operation, thereby establishing a plurality of processors comprising the parent processor and the other processors. In software executing on the parent processor, the plurality of processors are logically arranged as a plurality of nodes in a hierarchical structure. The collective operation is transmitted to the plurality of processors based on the hierarchical structure. In hardware of the parent processor, results are received from the execution of the collective operation from the other processors, a final result is generated of the collective operation based on the received results, and the final result is output.	03-05-2009
20090063817	System and Method for Packet Coalescing in Virtual Channels of a Data Processing System in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for packet coalescing in virtual channels of a data processing system. A first processor bundles original data to be transmitted to a destination processor, the original data provided by a first source processor. The first processor transmits the bundle of data to a second processor along a path to the destination processor. The second processor determines if the second processor has additional data destined for the same destination processor, the additional data being provided by a second source processor that is different from the first source processor. Responsive to the second processor having additional data, the second processor unbundles the original data, adds the additional data to the original data, and rebundles the data along with the additional data. Then the second processor transmits the rebundled data to at least one other processor along the path to the destination processor.	03-05-2009
20090063880	System and Method for Providing a High-Speed Message Passing Interface for Barrier Operations in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided performing a Message Passing Interface (MPI) job. A first processor chip receives a set of arrival signals from a set of processor chips executing tasks of the MPI job in the data processing system. The arrival signals identify when a processor chip executes a synchronization operation for synchronizing the tasks for the MPI job. Responsive to receiving the set of arrival signals from the set of processor chips, the first processor chip identifies a fastest processor chip of the set of processor chips whose arrival signal arrived first. An operation of the fastest processor chip is modified based on the identification of the fastest processor chip. The set of processor chips comprises processor chips that are in one of a same processor book or a different processor book of the data processing system.	03-05-2009
20090063885	System and Computer Program Product for Modifying an Operation of One or More Processors Executing Message Passing Interface Tasks - A system and computer program product for modifying an operation of one or more processors executing message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors.	03-05-2009
20090063886	System for Providing a Cluster-Wide System Clock in a Multi-Tiered Full-Graph Interconnect Architecture - A system for providing a cluster-wide system clock in a multi-tiered full graph (MTFG) interconnect architecture are provided. Heartbeat signals transmitted by each of the processor chips in the computing cluster are synchronized. Internal system clock signals are generated in each of the processor chips based on the synchronized heartbeat signals. As a result, the internal system clock signals of each of the processor chips are synchronized since the heartbeat signals, that are the basis for the internal system clock signals, are synchronized. Mechanisms are provided for performing such synchronization using direct couplings of processor chips within the same processor book, different processor books in the same supernode, and different processor books in different supernodes of the MTFG interconnect architecture.	03-05-2009
20090063891	System and Method for Providing Reliability of Communication Between Supernodes of a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for providing reliability of communication. A first processor determines a current state of links coupled to ports of a first processor of the data processing system. Each port of the first processor comprises a plurality of links to a corresponding port on a second processor of the data processing system. The current state of the links indicates a level of error associated with each link. The first processor determines, for each link, if a level of error associated with the link exceeds a threshold. For each link whose level of error exceeds the threshold, the first processor tags the link with an error identifier in a switch associated with the ports of the first processor. The first processor reduces a level of usage for transmitting data on ports associated with links tagged with the error identifier.	03-05-2009
20090064139	Method for Data Processing Using a Multi-Tiered Full-Graph Interconnect Architecture - A method is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted.	03-05-2009
20090064140	System and Method for Providing a Fully Non-Blocking Switch in a Supernode of a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for transmitting data from a first processor of a data processing system to a second processor of the data processing system. In one or more switches, a set of virtual channels is created, the one or more switches comprising, for each processor, a corresponding switch in the one or more switches. The data is transmitted from the first processor to the second processor through a path comprising a subset of processors of a set of processors in the data processing system. In each processor of the subset of processors, the data is stored in a virtual channel of a corresponding switch before transmitting the data to a next processor. The virtual channel of the corresponding switch in which the data is stored corresponds to a position of the processor in the path through which the data is transmitted.	03-05-2009
20090064165	Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks - A method for providing hardware based dynamic load balancing of message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors.	03-05-2009
20090064166	System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks - A system and method for providing hardware based dynamic load balancing of message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors.	03-05-2009
20090064167	System and Method for Performing Setup Operations for Receiving Different Amounts of Data While Processors are Performing Message Passing Interface Tasks - A system and method are provided for performing setup operations for receiving a different amount of data while processors are performing message passing interface (MPI) tasks. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. An MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, setup operations may be performed while processors are performing MPI tasks to prepare for receiving different sized portions of data in a subsequent computation cycle based on the history.	03-05-2009
20090064168	System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks By Modifying Tasks - A system and method are provided for providing hardware based dynamic load balancing of message passing interface (MPI) tasks by modifying tasks. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. Thus, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors.	03-05-2009
20090070617	Method for Providing a Cluster-Wide System Clock in a Multi-Tiered Full-Graph Interconnect Architecture - A method for providing a cluster-wide system clock in a multi-tiered full graph (MTFG) interconnect architecture are provided. Heartbeat signals transmitted by each of the processor chips in the computing cluster are synchronized. Internal system clock signals are generated in each of the processor chips based on the synchronized heartbeat signals. As a result, the internal system clock signals of each of the processor chips are synchronized since the heartbeat signals, that are the basis for the internal system clock signals, are synchronized. Mechanisms are provided for performing such synchronization using direct couplings of processor chips within the same processor book, different processor books in the same supernode, and different processor books in different supernodes of the MTFG interconnect architecture.	03-12-2009
20090198695	Method and Apparatus for Supporting Distributed Computing Within a Multiprocessor System - A locking mechanism for supporting distributed computing within a multiprocessor system is disclosed. A lock control section and a stage control section are assigned to a data block within a system memory. In response to a request for accessing the data block by a processing unit, a determination is made by a memory controller whether or not the lock control section of the data block has been set. If the lock control section of the data block has been set, the access request is denied. Otherwise, if the lock control section of the data block has not been set, another determination is made whether or not a current processing stage of the requesting processing unit matches a processing stage indicated by the stage control section. If the current processing stage of the requesting processing unit does not match the processing stage indicated by the stage control section, the access request is denied; otherwise, the access request is allowed.	08-06-2009
20090198762	Mechanism to Provide Reliability Through Packet Drop Detection - A method and a data processing system for completing checkpoint processing of a distributed job with local tasks communicating with other remote tasks via a host fabric interface (HFI) and assigned HFI window. Each HFI window has a send count and a receive count, which tracks GSM messages that are sent from and received at the HFI window. When a checkpoint is initiated by a master task, each local task forwards the send count and the receive count to the master task. The master task sums the respective counts and then compares the totals to each other. When the send count total is equal to the receive count total, the tasks are permitted to continue processing. However, when the send count total is not equal to the receive count total, the master task notifies each task of the job to rollback to a previous checkpoint or kill the job execution.	08-06-2009
20090198849	Memory Lock Mechanism for a Multiprocessor System - A memory lock mechanism within a multi-processor system is disclosed. A lock control section is initially assigned to a data block within a system memory of the multiprocessor system. In response to a request for accessing the data block by a processing unit within the multiprocessor system, a determination is made by a memory controller whether or not the lock control section of the data block has been set. If the lock control section of the data block has been set, the request for accessing the data block is denied. Otherwise, if the lock control section of the data block has not been set, the lock control section of the data block is set, and the request for accessing the data block is allowed.	08-06-2009
20090198911	DATA PROCESSING SYSTEM, PROCESSOR AND METHOD FOR CLAIMING COHERENCY OWNERSHIP OF A PARTIAL CACHE LINE OF DATA - According to method of data processing in a multiprocessor data processing system, in response to a processor request to modify a target granule of a target cache line of data containing multiple granules, a processing unit originates on an interconnect of the multiprocessor data processing system a data-claim-partial request that requests permission to promote only the target granule of the target cache line to a unique copy with an intent to modify the target granule. In response to a combined response to the data-claim-partial request indicating success (the combined response representing a system-wide response to the data-claim-partial-request), the processing unit promotes only the target granule of the target cache line to a unique copy by updating a coherency state of the target granule and retaining a coherency state of at least one other granule of the target cache line.	08-06-2009
20090198912	DATA PROCESSING SYSTEM, PROCESSOR AND METHOD FOR IMPLEMENTING CACHE MANAGEMENT FOR PARTIAL CACHE LINE OPERATIONS - A method of data processing in a cache memory includes caching a plurality of cache lines of data in a corresponding plurality of entries in a cache array, where each of the plurality of cache lines includes multiple data granules. For each of the plurality of cache entries, a plurality of line coherency state fields indicates an associated coherency state applicable to two or more data granules. For at least a particular cache line among the plurality of cache lines, a granule coherency state field indicates a coherency state for a particular granule of the multiple data granules in the particular cache line, where the coherency state field indicated by the granule coherency state field differs from that indicated for the particular cache line by its line coherency state field.	08-06-2009
20090198914	DATA PROCESSING SYSTEM, PROCESSOR AND METHOD IN WHICH AN INTERCONNECT OPERATION INDICATES ACCEPTABILITY OF PARTIAL DATA DELIVERY - According to at least one embodiment, a method of data processing in a multiprocessor data processing system includes a requesting processing unit initiating an interconnect operation including a memory access request that indicates an acceptability of a variable amount of data to service the interconnect request for data. In response to snooping the memory access request on an interconnect, a snooper selects an amount of data to supply to the requesting processing unit and transmits the selected amount of data to the requesting processing unit. The requesting processing unit receives the selected amount of data and utilizes at least some of the selected amount of data to service a processor request.	08-06-2009
20090198915	DATA PROCESSING SYSTEM, PROCESSOR AND METHOD THAT DYNAMICALLY SELECT A MEMORY ACCESS SIZE - A method of data processing in a processing unit supported by a memory hierarchy includes the processing unit performing a plurality of memory accesses to the memory hierarchy. The plurality of memory accesses includes one or more memory accesses targeting a full cache line of data. The processing unit monitors utilization of data accessed by the plurality of memory accesses, and based upon the utilization of the data, dynamically alters a memory access mode of operation so that a subsequent storage-modifying memory access targets less than a full cache line of data.	08-06-2009
20090198916	Method and Apparatus for Supporting Low-Overhead Memory Locks Within a Multiprocessor System - A method for supporting low-overhead memory locks within a multi-processor system is disclosed. A lock control section is initially assigned to a data block within a system memory of the multiprocessor system. In response to a request for accessing the data block by a processing unit within the multiprocessor system, a determination is made by a memory controller whether or not the lock control section of the data block has been set. If the lock control section of the data block has been set, the request for accessing the data block is ignored. Otherwise, if the lock control section of the data block has not been set, the lock control section of the data block is set, and the request for accessing the data block is allowed.	08-06-2009
20090198918	Host Fabric Interface (HFI) to Perform Global Shared Memory (GSM) Operations - A data processing system enables global shared memory (GSM) operations across multiple nodes with a distributed EA-to-RA mapping of physical memory. Each node has a host fabric interface (HFI), which includes HFI windows that are assigned to at most one locally-executing task of a parallel job. The tasks perform parallel job execution, but map only a portion of the effective addresses (EAs) of the global address space to the local, real memory of the task's respective node. The HFI window tags all outgoing GSM operations (of the local task) with the job ID, and embeds the target node and HFI window IDs of the node at which the EA is memory mapped. The HFI window also enables processing of received GSM operations with valid EAs that are homed to the local real memory of the receiving node, while preventing processing of other received operations without a valid EA-to-RA local mapping.	08-06-2009
20090198920	Processing Units Within a Multiprocessor System Adapted to Support Memory Locks - A processing unit within a multiprocessor system adapted to support memory locks is disclosed. In response to a request for accessing a data block being denied when a lock control section of the data block has been set by a memory controller, a timer countdown is started within a processing unit within a multiprocessor system. The requesting processing unit can relinquish the access request once the timer countdown has reached a timeout period.	08-06-2009
20090198933	Method and Apparatus for Handling Multiple Memory Requests Within a Multiprocessor System - A method for handling multiple memory requests within a multi-processor system is disclosed. A lock control section is initially assigned to a data block within a system memory. In response to a request for accessing the data block by a processing unit, a determination is made whether or not the lock control section of the data block has been set. If the lock control section has been set, another determination is made whether or not the requesting processing unit is located beyond a predetermined distance from a memory controller. If the requesting processing unit is located beyond a predetermined distance from the memory controller, the requesting processing unit is invited to perform other functions; otherwise, the number of the requesting processing unit is placed in a queue table. However, if the lock control section has not been set, the lock control section of the data block is set, and the access request is allowed.	08-06-2009
20090198956	System and Method for Data Processing Using a Low-Cost Two-Tier Full-Graph Interconnect Architecture - A system and method are provided for implementing a two-tier full-graph interconnect architecture. In order to implement a two-tier full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the two-tier full-graph interconnect architecture. Data is then transmitted from one processor to another within the two-tier full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor chip identifier associated with a target processor to which the data is to be transmitted.	08-06-2009
20090198957	System and Method for Performing Dynamic Request Routing Based on Broadcast Queue Depths - A system and method for performing dynamic request routing based on broadcast depth queue information are provided. Each processor chip in the system may use a synchronized heartbeat signal it generates to provide queue depth information to each of the other processor chips in the system. The queue depth information identifies a number of requests or amount of data in each of the queues of a processor chip that originated the heartbeat signal. The queue depth information from each of the processor chips in the system may be used by the processor chips in determining optimal routing paths for data from a source processor chip to a destination processor chip. As a result, the congestion of data for processing at each of the processor chips along each possible routing path may be taken into account when selecting to which processor chip to forward data.	08-06-2009
20090198958	System and Method for Performing Dynamic Request Routing Based on Broadcast Source Request Information - A system and method for performing dynamic request routing based on broadcast source request information are provided. Each processor chip in the system may use a synchronized heartbeat signal it generates to provide source request information to each of the other processor chips in the system. The source request information identifies the number of active source requests sent by the processor chip that originated the heartbeat signal. The source request information from each of the processor chips in the system may be used by the processor chips in determining optimal routing paths for data from a source processor chip to a destination processor chip. As a result, the congestion of data for processing at each of the processor chips along each possible routing path may be taken into account when selecting to which processor chip to forward data.	08-06-2009
20090198960	DATA PROCESSING SYSTEM, PROCESSOR AND METHOD THAT SUPPORT PARTIAL CACHE LINE READS - According to a method of data processing in a multiprocessor data processing system, in response to a processor request to read a target granule of a target cache line of data containing multiple granules, a processing unit originates on an interconnect of the multiprocessor data processing system a partial read request that requests permission to read only the target granule of the target cache line. In response to a combined response to the partial read request indicating success, the combined response representing a system-wide response to the partial read request, the processing unit receives the target granule of the target cache line, supplies the target granule to a requesting processor core, and updates a coherency state of the target granule while retaining a coherency state of at least one other granule of the target cache line.	08-06-2009
20090198971	Heterogeneous Processing Elements - A heterogeneous processing element model is provided where I/O devices look and act like processors. In order to be treated like a processor, an I/O processing element, or other special purpose processing element, must follow some rules and have some characteristics of a processor, such as address translation, security, interrupt handling, and exception processing, for example. The heterogeneous processing element model puts special purpose processing elements on the same playing field as processors, from a programming perspective, operating system perspective, power perspective, as the processors. The operating system can get work to a security engine, for example, in the same way it does to a processor.	08-06-2009
20090199046	Mechanism to Perform Debugging of Global Shared Memory (GSM) Operations - A host fabric interface (HFI) enables debugging of global shared memory (GSM) operations received at a local node from a network fabric. The local node has a memory management unit (MMU), which provides an effective address to real address (EA-to-RA) translation table that is utilized by the HFI to evaluate when EAs of GSM operations/data from a received GSM packet is memory-mapped to RAs of the local memory. The HFI retrieves the EA associated with a GSM operation/data within a received GSM packet. The HFI forwards the EA to the MMU, which determines when the EA is mapped to RAs within the local memory for the local task. The HFI processing logic enables processing of the GSM packet only when the EA of the GSM operation/data within the GSM packet is an EA that has a local RA translation. Non-matching EAs result in an error condition that requires debugging.	08-06-2009
20090199182	Notification by Task of Completion of GSM Operations at Target Node - A method for providing global notification of completion of a global shared memory (GSM) operation during processing by a target task executing at a target node of a distributed system. The distributed system has at least one other node on which an initiating task that generated the GSM operation is homed. The target task receives the GSM operation from the initiating task, via a host fabric interface (HFI) window assigned to the target task. The task initiates execution of the GSM operation on the target node. The task detects completion of the execution of the GSM operation on the target node, and issues a global notification to at least the initiating task. The global notification indicates the completion of the execution of the GSM operation to one or more tasks of a single job distributed across multiple processing nodes.	08-06-2009
20090199191	Notification to Task of Completion of GSM Operations by Initiator Node - In a global shared memory (GSM) environment, a method provides local notification of completion of a global shared memory (GSM) operation processed by a first task executing at a local node of the distributed system. The system includes multiple nodes on which different tasks of a single job execute and perform GSM operations that are received from a second task via a via host fabric interface (HFI) and associated HFR window assigned to the first tasks. The local task initiates execution of a GSM operation on the local node. The task then monitors for and detects a completion of the execution of the GSM operation on the local node. When the task detects completion of the execution of the GSM operation, the task issues an internal notification to inform the locally-executing tasks of the completion of the GSM operation.	08-06-2009
20090199194	Mechanism to Prevent Illegal Access to Task Address Space by Unauthorized Tasks - A method and data processing system for tracking global shared memory (GSM) operations to and from a local node configured with a host fabric interface (HFI) coupled to a network fabric. During task/job initialization, the system OS assigns HFI window(s) to handle the GSM packet generation and GSM packet receipt and processing for each local task. HFI processing logic automatically tags each GSM packet generated by the HFI window with a global job identifier (ID) of the job to which the local task is affiliated. The job ID is embedded within each GSM packet placed on the network fabric. On receipt of a GSM packet from the network fabric, the HFI logic retrieves the embedded job ID and compares the embedded job ID with the ID within the HFI window(s). GSM packets are forwarded to an HFI window only when the embedded job ID matches the HFI window's job ID.	08-06-2009
20090199195	Generating and Issuing Global Shared Memory Operations Via a Send FIFO - A method for issuing global shared memory (GSM) operations from an originating task on a first node coupled to a network fabric of a distributed network via a host fabric interface (HFI). The originating task generates a GSM command within an effective address (EA) space. The task then places the GSM command within a send FIFO. The send FIFO is a portion of real memory having real addresses (RA) that are memory mapped to EAs of a globally executing job. The originating task maintains a local EA-to-RA mapping of only a portion of the real address space of the globally executing job. The task enables the HFI to retrieve the GSM command from the send FIFO into an HFI window allocated to the originating task. The HFI window generates a corresponding GSM packet containing GSM operations and/or data, and the HFI window issues the GSM packet to the network fabric.	08-06-2009
20090199200	Mechanisms to Order Global Shared Memory Operations - A method and data processing system for performing fence operations within a global shared memory (GSM) environment having a local task executing on a processor and providing GSM commands for processing by a host fabric interface (HFI) window that is allocated to the task. The HFI window has one or more registers for use during local fence operations. A first register tracks a first count of task-issued GSM commands, and a second register tracks a second count of GSM operations being processed by the HFI. The processing logic detects a locally-issued fence operation, and responds by performing a series of operations, including: automatically stopping the task from issuing additional GSM commands; monitoring for completion of all the task-issued GSM commands at the HFI; and triggering a resumption of issuance of GSM commands by the task when the completion of all previous task-issued GSM commands is registered by the HFI.	08-06-2009
20090199201	Mechanism to Provide Software Guaranteed Reliability for GSM Operations - In a global shared memory (GSM) environment, an initiating task at a first node with a host fabric interface (HFI) uses epochs to provide reliability of transmission of packets via a network fabric to a target task. The HFI generates a packet for the initiating task addressed to the target task, and automatically inserts a current epoch of the initiating task into the packet. A copy of the current epoch is maintained by the target task, which accepts for processing only packets having the correct epoch, unless the packet is tagged for guaranteed-once delivery. When a packet delivery is accepted, the target task sends a notification to the initiating task. If the initiating task does not receive the notification of delivery for the issued packet, the initiating task updates the epoch at both the target node and the initiating node and re-transmits the packet.	08-06-2009
20090199209	Mechanism for Guaranteeing Delivery of Multi-Packet GSM Message - A target task ensures complete delivery of a global shared memory (GSM) message from an originating task to the target task. The target task's HFI receives a first of multiple GSM packets generated from a single GSM message sent from the originating task. The HFI logic assigns a sequence number and corresponding tuple to track receipt of the complete GSM message. The sequence number is unique relative to other sequence numbers assigned to GSM messages that have not been completely received from the initiating task. The HFI updates a count value within the tuple, which comprises the sequence number and the count value for the first GSM packet and for each subsequent GSM packet received for the GSM message. The HFI determines when receipt of the GSM message is complete by comparing the count value with a count total retrieved from the packet header.	08-06-2009
20100070710	Techniques for Cache Injection in a Processor System - A technique for performing cache injection includes monitoring addresses on a bus. Ownership of input/output data on the bus is acquired by a cache when an address on the bus (that is associated with the input/output data) corresponds to an address of a data block stored in the cache.	03-18-2010
20100070711	Techniques for Cache Injection in a Processor System Using a Cache Injection Instruction - A technique for performing cache injection includes monitoring addresses on a bus in response to a cache injection instruction. Ownership of input/output data on the bus is acquired by a cache when an address on the bus (that is associated with the input/output data) corresponds to an address of a data block associated with the cache injection instruction.	03-18-2010
20100070712	Techniques for Cache Injection in a Processor System with Replacement Policy Position Modification - A technique for performing cache injection includes monitoring, at a cache, addresses on a bus. Ownership of input/output data on the bus is then acquired by the cache when an address on the bus (that is associated with the input/output data) corresponds to an address of a data block stored in the cache. A replacement policy position of the data block is then modified (to increase a probability that the data block is consumed prior to ejection from the cache).	03-18-2010
20100070717	Techniques for Cache Injection in a Processor System Responsive to a Specific Instruction Sequence - A technique for performing cache injection includes monitoring an instruction stream for a specific instruction sequence. Addresses on a bus are then monitored, at a cache, in response to detecting the specific instruction sequence a determined number of times. Ownership of input/output data on the bus is then acquired by the cache when an address on the bus (that is associated with the input/output data) corresponds to an address of a data block stored in the cache.	03-18-2010
20100268788	Remote Asynchronous Data Mover - A distributed data processing system executes multiple tasks within a parallel job, including a first local task on a local node and at least one task executing on a remote node, with a remote memory having real address (RA) locations mapped to one or more of the source effective addresses (EA) and destination EA of a data move operation initiated by a task executing on the local node. On initiation of the data move operation, remote asynchronous data move (RADM) logic identifies that the operation moves data to/from a first EA that is memory mapped to an RA of the remote memory. The local processor/RADM logic initiates a RADM operation that moves a copy of the data directly from/to the first remote memory by completing the RADM operation using the network interface cards (NICs) of the source and destination processing nodes, determined by accessing a data center for the node IDs of remote memory.	10-21-2010
20100269027	USER LEVEL MESSAGE BROADCAST MECHANISM IN DISTRIBUTED COMPUTING ENVIRONMENT - A data processing system is programmed to provide a method for enabling user-level one-to-all message/messaging (OTAM) broadcast within a distributed parallel computing environment in which multiple threads of a single job execute on different processing nodes across a network. The method comprises: generating one or more messages for transmission to at least one other processing node accessible via a network, where the messages are generated by/for a first thread executing at the data processing system (first processing node) and the other processing node executes one or more second threads of a same parallel job as the first thread. An OTAM broadcast is transmitting via a host fabric interface (HFI) of the data processing system as a one-to-all broadcast on the network, whereby the messages are transmitted to a cluster of processing nodes across the network that execute threads of the same parallel job as the first thread.	10-21-2010
20110173258	Collective Acceleration Unit Tree Flow Control and Retransmit - A mechanism is provided for collective acceleration unit tree flow control forms a logical tree (sub-network) among those processors and transfers “collective” packets on this tree. The system supports many collective trees, and each collective acceleration unit (CAU) includes resources to support a subset of the trees. Each CAU has limited buffer space, and the connection between two CAUs is not completely reliable. Therefore, to address the challenge of collective packets traversing on the tree without colliding with each other for buffer space and guaranteeing the end-to-end packet delivery, each CAU in the system effectively flow controls the packets, detects packet loss, and retransmits lost packets.	07-14-2011
20110238956	Collective Acceleration Unit Tree Structure - A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.	09-29-2011
20120266180	Performing Setup Operations for Receiving Different Amounts of Data While Processors are Performing Message Passing Interface Tasks - A system and method are provided for performing setup operations for receiving a different amount of data while processors are performing message passing interface (MPI) tasks. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. An MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, setup operations may be performed while processors are performing MPI tasks to prepare for receiving different sized portions of data in a subsequent computation cycle based on the history.	10-18-2012
20120296915	Collective Acceleration Unit Tree Structure - A mechanism is provided in a collective acceleration unit for performing a collective operation to distribute or collect data among a plurality of participant nodes. The mechanism receives an input collective packet for a collective operation from a neighbor node within a collective tree. The input collective packet comprises a tree identifier and an input data field and wherein the collective tree comprises a plurality of sub trees. The mechanism maps the tree identifier to an index within the collective acceleration unit. The index identifies a portion of resources within the collective acceleration unit and is associated with a set of neighbor nodes in a given sub tree within the collective tree. For each neighbor node the collective acceleration unit stores destination information. The collective acceleration unit performs an operation on the input data field using the portion of resources to effect the collective operation.	11-22-2012

Patent applications by Lakshminarayana B. Arimilli, Austin, TX US

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Lakshminarayana B. Arimilli, Austin US

Lakshminarayana B. Arimilli, Austin, TX US