Patent application number | Description | Published |
20080209128 | METHOD AND APPARATUS FOR DETECTING A CACHE WRAP CONDITION - A method and apparatus for detecting a cache wrap condition in a computing environment having a processor and a cache. A cache wrap condition is detected when the entire contents of a cache have been replaced, relative to a particular starting state. A set-associative cache is considered to have wrapped when all of the sets within the cache have been replaced. The starting point for cache wrap detection is the state of the cache sets at the time of the previous cache wrap. The method and apparatus is preferably implemented in a snoop filter having filter mechanisms that rely upon detecting the cache wrap condition. These snoop filter mechanisms requiring this information are operatively coupled with cache wrap detection logic adapted to detect the cache wrap event, and perform an indication step to the snoop filter mechanisms. In the various embodiments, cache wrap detection logic is implemented using registers and comparators, loadable counters, or a scoreboard data structure. | 08-28-2008 |
20080222364 | SNOOP FILTERING SYSTEM IN A MULTIPROCESSOR SYSTEM - A system and method for supporting cache coherency in a computing environment having multiple processing units, each unit having an associated cache memory system operatively coupled therewith. The system includes a plurality of interconnected snoop filter units, each snoop filter unit corresponding to and in communication with a respective processing unit, with each snoop filter unit comprising a plurality of devices for receiving asynchronous snoop requests from respective memory writing sources in the computing environment; and a point-to-point interconnect comprising communication links for directly connecting memory writing sources to corresponding receiving devices; and, a plurality of parallel operating filter devices coupled in one-to-one correspondence with each receiving device for processing snoop requests received thereat and one of forwarding requests or preventing forwarding of requests to its associated processing unit. Each of the plurality of parallel operating filter devices comprises parallel operating sub-filter elements, each simultaneously receiving an identical snoop request and implementing one or more different snoop filter algorithms for determining those snoop requests for data that are determined not cached locally at the associated processing unit and preventing forwarding of those requests to the processor unit. In this manner, a number of snoop requests forwarded to a processing unit is reduced thereby increasing performance of the computing environment. | 09-11-2008 |
20080244194 | METHOD AND APARATHUS FOR FILTERING SNOOP REQUESTS USING STREAM REGISTERS - A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having a local cache memory associated therewith. A snoop filter device is associated with each processing unit and includes at least one snoop filter primitive implementing filtering method based on usage of stream registers sets and associated stream register comparison logic. From the plurality of stream registers sets, at least one stream register set is active, and at least one stream register set is labeled historic at any point in time. In addition, the snoop filter block is operatively coupled with cache wrap detection logic whereby the content of the active stream register set is switched into a historic stream register set upon the cache wrap condition detection, and the content of at least one active stream register set is reset. Each filter primitive implements stream register comparison logic that determines whether a received snoop request is to be forwarded to the processor or discarded. | 10-02-2008 |
20080263280 | LOW COMPLEXITY SPECULATIVE MULTITHREADING SYSTEM BASED ON UNMODIFIED MICROPROCESSOR CORE - A system, method and computer program product for supporting thread level speculative execution in a computing environment having multiple processing units adapted for concurrent execution of threads in speculative and non-speculative modes. Each processing unit includes a cache memory hierarchy of caches operatively connected therewith. The apparatus includes an additional cache level local to each processing unit for use only in a thread level speculation mode, each additional cache for storing speculative results and status associated with its associated processor when handling speculative threads. The additional local cache level at each processing unit are interconnected so that speculative values and control data may be forwarded between parallel executing threads. A control implementation is provided that enables speculative coherence between speculative threads executing in the computing environment. | 10-23-2008 |
20080294850 | METHOD AND APPARATUS FOR FILTERING SNOOP REQUESTS USING A SCOREBOARD - An apparatus for implementing snooping cache coherence that locally reduces the number of snoop requests presented to each cache in a multiprocessor system. A snoop filter device associated with a single processor includes one or more “scoreboard” data structures that make snoop determinations, i.e., for each snoop request from another processor, to determine if a request is to be forwarded to the processor or, discarded. At least one scoreboard is active, and at least one scoreboard is determined to be historic at any point in time. A snoop determination of the queue indicates that an entry may be in the cache, but does not indicate its actual residence status. In addition, the snoop filter block implementing scoreboard data structures is operatively coupled with a cache wrap detection logic means whereby, upon detection of a cache wrap condition, the content of the active scoreboard is copied into a historic scoreboard and the content of at least one active scoreboard is reset. | 11-27-2008 |
20080320228 | METHOD AND APPARATUS FOR EFFICIENT REPLACEMENT ALGORITHM FOR PRE-FETCHER ORIENTED DATA CACHE - Disclosed are a method and apparatus for replacing pre-fetched data in a pre-fetch cache. In one embodiment, each line of the pre-fetch cache will be accessed at most M times. A line accessed M times can be evicted from the cache without any performance loss. In this embodiment, a counter is added to each pre-fetch data line to track how many times it has been accessed. In another embodiment, a displacement bit is added to each pre-fetch data line, and when a defined portion of the data line is accessed, this bit is set to a given value, indicating that the line can be evicted. | 12-25-2008 |
20090006546 | MULTIPLE NODE REMOTE MESSAGING - A method for passing remote messages in a parallel computer system formed as a network of interconnected compute nodes includes that a first compute node (A) sends a single remote message to a remote second compute node (B) in order to control the remote second compute node (B) to send at least one remote message. The method includes various steps including controlling a DMA engine at first compute node (A) to prepare the single remote message to include a first message descriptor and at least one remote message descriptor for controlling the remote second compute node (B) to send at least one remote message, including putting the first message descriptor into an injection FIFO at the first compute node (A) and sending the single remote message and the at least one remote message descriptor to the second compute node (B). | 01-01-2009 |
20090006672 | METHOD AND APPARATUS FOR EFFICIENTLY TRACKING QUEUE ENTRIES RELATIVE TO A TIMESTAMP - An apparatus and method for tracking coherence event signals transmitted in a multiprocessor system. The apparatus comprises a coherence logic unit, each unit having a plurality of queue structures with each queue structure associated with a respective sender of event signals transmitted in the system. A timing circuit associated with a queue structure controls enqueuing and dequeuing of received coherence event signals, and, a counter tracks a number of coherence event signals remaining enqueued in the queue structure and dequeued since receipt of a timestamp signal. A counter mechanism generates an output signal indicating that all of the coherence event signals present in the queue structure at the time of receipt of the timestamp signal have been dequeued. In one embodiment, the timestamp signal is asserted at the start of a memory synchronization operation and, the output signal indicates that all coherence events present when the timestamp signal was asserted have completed. This signal can then be used as part of the completion condition for the memory synchronization operation. | 01-01-2009 |
20090006692 | METHOD AND APPARATUS FOR A CHOOSE-TWO MULTI-QUEUE ARBITER - An apparatus and method for granting one or more requesting entities access to a resource in a predetermined time interval. The apparatus includes a first circuit receiving one or more request signals, and implementing logic for assigning a priority to the one or more request signals, and, generating a set of first_request signals based on the priorities assigned. One or more priority select circuits for receiving the set of first_request signals and generating corresponding one or more fixed grant signals representing one or more highest priority request signals when asserted during the predetermined time interval. A second circuit device receives the one or more fixed grant signals generates one or more grant signals associated with one or more highest priority request signals assigned, the grant signals for enabling one or more respective requesting entities access to the resource in the predetermined time interval, wherein the priority assigned to the one or more request signals changes each successive predetermined time interval. In one embodiment, the assigned priority is based on a numerical pattern, the first circuit changing the numerical pattern with respect to the first_request signals generated at each successive predetermined time interval. | 01-01-2009 |
20090006718 | SYSTEM AND METHOD FOR PROGRAMMABLE BANK SELECTION FOR BANKED MEMORY SUBSYSTEMS - A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures. | 01-01-2009 |
20090006762 | METHOD AND APPARATUS OF PREFETCHING STREAMS OF VARYING PREFETCH DEPTH - Method and apparatus of prefetching streams of varying prefetch depth dynamically changes the depth of prefetching so that the number of multiple streams as well as the hit rate of a single stream are optimized. The method and apparatus in one aspect monitor a plurality of load requests from a processing unit for data in a prefetch buffer, determine an access pattern associated with the plurality of load requests and adjust a prefetch depth according to the access pattern. | 01-01-2009 |
20090006764 | INSERTION OF COHERENCE REQUESTS FOR DEBUGGING A MULTIPROCESSOR - A method and system are disclosed to insert coherence events in a multiprocessor computer system, and to present those coherence events to the processors of the multiprocessor computer system for analysis and debugging purposes. The coherence events are inserted in the computer system by adding one or more special insert registers. By writing into the insert registers, coherence events are inserted in the multiprocessor system as if they were generated by the normal coherence protocol. Once these coherence events are processed, the processing of coherence events can continue in the normal operation mode. | 01-01-2009 |
20090006769 | PROGRAMMABLE PARTITIONING FOR HIGH-PERFORMANCE COHERENCE DOMAINS IN A MULTIPROCESSOR SYSTEM - A multiprocessor computing system and a method of logically partitioning a multiprocessor computing system are disclosed. The multiprocessor computing system comprises a multitude of processing units, and a multitude of snoop units. Each of the processing units includes a local cache, and the snoop units are provided for supporting cache coherency in the multiprocessor system. Each of the snoop units is connected to a respective one of the processing units and to all of the other snoop units. The multiprocessor computing system further includes a partitioning system for using the snoop units to partition the multitude of processing units into a plurality of independent, memory-consistent, adjustable-size processing groups. Preferably, when the processor units are partitioned into these processing groups, the partitioning system also configures the snoop units to maintain cache coherency within each of said groups. | 01-01-2009 |
20090006770 | NOVEL SNOOP FILTER FOR FILTERING SNOOP REQUESTS - A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter device having a plurality of dedicated input ports for receiving snoop requests from dedicated memory writing sources in the multiprocessor computing environment. Each snoop filter device includes a plurality of parallel operating port snoop filters in correspondence with the plurality of dedicated input ports, each port snoop filter implementing one or more parallel operating sub-filter elements that are adapted to concurrently filter snoop requests received from respective dedicated memory writing sources and forward a subset of those requests to its associated processing unit. | 01-01-2009 |
20090006808 | ULTRASCALABLE PETAFLOP PARALLEL SUPERCOMPUTER - A novel massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. Novel use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node. | 01-01-2009 |
20090007119 | METHOD AND APPARATUS FOR SINGLE-STEPPING COHERENCE EVENTS IN A MULTIPROCESSOR SYSTEM UNDER SOFTWARE CONTROL - An apparatus and method are disclosed for single-stepping coherence events in a multiprocessor system under software control in order to monitor the behavior of a memory coherence mechanism. Single-stepping coherence events in a multiprocessor system is made possible by adding one or more step registers. By accessing these step registers, one or more coherence requests are processed by the multiprocessor system. The step registers determine if the snoop unit will operate by proceeding in a normal execution mode, or operate in a single-step mode. | 01-01-2009 |
20090007134 | SHARED PERFORMANCE MONITOR IN A MULTIPROCESSOR SYSTEM - A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices. | 01-01-2009 |
20090059955 | SINGLE CHIP PROTOCOL CONVERTER - A single chip protocol converter integrated circuit (IC) capable of receiving packets generating according to a first protocol type and processing said packets to implement protocol conversion and generating converted packets of a second protocol type for output thereof, the process of protocol conversion being performed entirely within the single integrated circuit chip. The single chip protocol converter can be further implemented as a macro core in a system-on-chip (SoC) implementation, wherein the process of protocol conversion is contained within a SoC protocol conversion macro core without requiring the processing resources of a host system. Packet conversion may additionally entail converting packets generated according to a first protocol version level and processing the said packets to implement protocol conversion for generating converted packets according to a second protocol version level, but within the same protocol family type. The single chip protocol converter integrated circuit and SoC protocol conversion macro implementation include multiprocessing capability including processor devices that are configurable to adapt and modify the operating functionality of the chip. | 03-05-2009 |
20090077571 | METHOD AND APPARATUS FOR EFFICIENT PERFORMANCE MONITORING OF A LARGE NUMBER OF SIMULTANEOUS EVENTS - A system for monitoring a large number of simultaneous events implements a hybrid counter array device having a first counter portion comprising counter devices, each counter device for receiving signals representing occurrences of events from an event source and providing a first count value corresponding to a lower order bits of the hybrid counter array. A second counter portion comprises a memory array device having addressable memory locations in correspondence with the counter devices, each addressable memory location for storing a second count value representing higher order bits. A control device monitors each of the counter devices and initiates updating a value of a corresponding second count value stored at the corresponding addressable memory location. The system includes interrupt pre-indication for providing fast interrupt trigger to a processor device when a count value related to an event equals a threshold value. A data transfer sub-system additionally enables one or more of: read access or write access to both the count values in the first and second counter portions over a narrow bus, the read/write access for purposes of initializing and determining status of the count values for a monitored event type in response to a processor device request. | 03-19-2009 |
20090116610 | LOW LATENCY COUNTER EVENT INDICATION - A hybrid counter array device for counting events with interrupt indication includes a first counter portion comprising N counter devices, each for counting signals representing event occurrences and providing a first count value representing lower order bits. An overflow bit device associated with each respective counter device is additionally set in response to an overflow condition. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits. An operatively coupled control device monitors each associated overflow bit device and initiates incrementing a second count value stored at a corresponding memory location in response to a respective overflow bit being set. The incremented second count value is compared to an interrupt threshold value stored in a threshold register, and, when the second counter value is equal to the interrupt threshold value, a corresponding “interrupt arm” bit is set to enable a fast interrupt indication. On a subsequent roll-over of the lower bits of that counter, the interrupt will be fired. | 05-07-2009 |
20090116611 | SPACE AND POWER EFFICIENT HYBRID COUNTERS ARRAY - A hybrid counter array device for counting events. The hybrid counter array includes a first counter portion comprising N counter devices, each counter device for receiving signals representing occurrences of events from an event source and providing a first count value corresponding to a lower order bits of the hybrid counter array. The hybrid counter array includes a second counter portion comprising a memory array device having N addressable memory locations in correspondence with the N counter devices, each addressable memory location for storing a second count value representing higher order bits of the hybrid counter array. A control device monitors each of the N counter devices of the first counter portion and initiates updating a value of a corresponding second count value stored at the corresponding addressable memory location in the second counter portion. Thus, a combination of the first and second count values provide an instantaneous measure of number of events received. | 05-07-2009 |
20110082980 | HIGH PERFORMANCE UNALIGNED CACHE ACCESS - A cache memory device and method for operating the same. One embodiment of the cache memory device includes an address decoder decoding a memory address and selecting a target cache line. A first cache array is configured to output a first cache entry associated with the target cache line, and a second cache array coupled to an alignment unit is configured to output a second cache entry associated with the alignment cache line. The alignment unit coupled to the address decoder selects either the target cache line or a neighbor cache line proximate the target cache line as an alignment cache line output. Selection of either the target cache line or a neighbor cache line is based on an alignment bit in the memory address. A tag array cache is split into even and odd cache lines tags, and provides one or two tags for every cache access. | 04-07-2011 |
20110247002 | Dynamic System Scheduling - Resources of a partitionable computer system are partitioned into: (i) a first partition for first jobs, the first jobs being at least one of small and short running; and (ii) a second partition for second jobs, the second jobs being at least one of large and long running. The computer system is run as partitioned in the partitioning step and the partitioning is periodically re-evaluated against at least one threshold for at least one of the partitions. If the periodic re-evaluation suggests that one of the first and second partitions is underutilized, the resources of the partitionable computer system are dynamically re-partitioned to reassign at least some of the resources of the partitionable computer system from the underutilized one of the first and second partitions to another one of the first and second partitions | 10-06-2011 |
20110247003 | Predictive Dynamic System Scheduling - Resources of a partitionable computer system are partitioned into at least first and second partitions, in accordance with a first or second mode of operation of the partitionable computer system. The system is run in the first or second mode, partitioned in accordance with the partitioning step. Periodically, it is determined whether the computer system should be switched from one mode to the other mode. If so, the computer system is run in the other mode, partitioned in accordance with the other mode. The first and second modes of operation are defined in accordance with historical observations of the partitionable computer system. The periodic determination is carried out based on predictions in accordance with the historical observations. | 10-06-2011 |
20110296421 | METHOD AND APPARATUS FOR EFFICIENT INTER-THREAD SYNCHRONIZATION FOR HELPER THREADS - A monitor bit per hardware thread in a memory location may be allocated, in a multiprocessing computer system having a plurality of hardware threads, the plurality of hardware threads sharing the memory location, and each of the allocated monitor bit corresponding to one of the plurality of hardware threads. A condition bit may be allocated for each of the plurality of hardware threads, the condition bit being allocated in each context of the plurality of hardware threads. In response to detecting the memory location being accessed, it is determined whether a monitor bit corresponding to a hardware thread in the memory location is set. In response to determining that the monitor bit corresponding to a hardware thread is set in the memory location, a condition bit corresponding to a thread accessing the memory location is set in the hardware thread's context. | 12-01-2011 |
20110296431 | METHOD AND APPARATUS FOR EFFICIENT HELPER THREAD STATE INITIALIZATION USING INTER-THREAD REGISTER COPY - This disclosure describes a method and system that may enable fast, hardware-assisted, producer-consumer style communication of values between threads. The method, in one aspect, uses a dedicated hardware buffer as an intermediary storage for transferring values from registers in one thread to registers in another thread. The method may provide a generic, programmable solution that can transfer any subset of register values between threads in any given order, where the source and target registers may or may not be correlated. The method also may allow for determinate access times, since it completely bypasses the memory hierarchy. Also, the method is designed to be lightweight, focusing on communication, and keeping synchronization facilities orthogonal to the communication mechanism. It may be used by a helper thread that performs data prefetching for an application thread, for example, to initialize the upward-exposed reads in the address computation slice of the helper thread code. | 12-01-2011 |
20110302394 | SYSTEM AND METHOD FOR PROCESSING REGULAR EXPRESSIONS USING SIMD AND PARALLEL STREAMS - A system and method for performing regular expression computations includes loading a plurality of input values corresponding to one or more input streams as elements of a vector register implemented on programmable storage media. New state indexes are computed using the input values, and current state values corresponding to different automata by using single instruction, multiple data (SIMD) vector operations. New state values associated with the different automata are determined using the new state indexes to look up new state values such that state transitions for a plurality of regular expressions are processed concurrently. | 12-08-2011 |
20110320765 | VARIABLE WIDTH VECTOR INSTRUCTION PROCESSOR - A computer processor, method, and computer program product for executing vector processing instructions on a variable width vector register file. An example embodiment is a computer processor that includes an instruction execution unit coupled to a variable width vector register file which contains a number of vector registers, the width of the vector registers is changeable during operation of the computer processor. | 12-29-2011 |
20120011348 | Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations - Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand. | 01-12-2012 |
20120060015 | Vector Loads with Multiple Vector Elements from a Same Cache Line in a Scattered Load Operation - Mechanisms for performing a scattered load operation are provided. With these mechanisms, an extended address is received in a cache memory of a processor. The extended address has a plurality of data element address portions that specify a plurality of data elements to be accessed using the single extended address. Each of the plurality of data element address portions is provided to corresponding data element selector logic units of the cache memory. Each data element selector logic unit in the cache memory selects a corresponding data element from a cache line buffer based on a corresponding data element address portion provided to the data element selector logic unit. Each data element selector logic unit outputs the corresponding data element for use by the processor. | 03-08-2012 |
20120060016 | Vector Loads from Scattered Memory Locations - Mechanisms for performing a scattered load operation are provided. With these mechanisms, a gather instruction is receive in a logic unit of a processor, the gather instruction specifying a plurality of addresses in a memory from which data is to be loaded into a target vector register of the processor. A plurality of separate load instructions for loading the data from the plurality of addresses in the memory are automatically generated within the logic unit. The plurality of separate load instructions are sent, from the logic unit, to one or more load/store units of the processor. The data corresponding to the plurality of addresses is gathered in a buffer of the processor. The logic unit then writes data stored in the buffer to the target vector register. | 03-08-2012 |
20120082171 | SINGLE CHIP PROTOCOL CONVERTER - A single chip protocol converter integrated circuit (IC) capable of receiving packets generating according to a first protocol type and processing said packets to implement protocol conversion and generating converted packets of a second protocol type for output thereof, the process of protocol conversion being performed entirely within the single integrated circuit chip. The single chip protocol converter can be further implemented as a macro core in a system-on-chip (SoC) implementation, wherein the process of protocol conversion is contained within a SoC protocol conversion macro core without requiring the processing resources of a host system. The single chip protocol converter integrated circuit and SoC protocol conversion macro implementation include multiprocessing capability including processor devices that are configurable to adapt and modify the operating functionality of the chip. | 04-05-2012 |
20120179879 | MECHANISMS FOR EFFICIENT INTRA-DIE/INTRA-CHIP COLLECTIVE MESSAGING - Mechanism of efficient intra-die collective processing across the nodelets with separate shared memory coherency domains is provided. An integrated circuit die may include a hardware collective unit implemented on the integrated circuit die. A plurality of cores on the integrated circuit die is grouped into a plurality of shared memory coherence domains. Each of the plurality of shared memory coherence domains is connected to the collective unit for performing collective operations between the plurality of shared memory coherence domains. | 07-12-2012 |
20120179896 | METHOD AND APPARATUS FOR A HIERARCHICAL SYNCHRONIZATION BARRIER IN A MULTI-NODE SYSTEM - A hierarchical barrier synchronization of cores and nodes on a multiprocessor system, in one aspect, may include providing by each of a plurality of threads on a chip, input bit signal to a respective bit in a register, in response to reaching a barrier; determining whether all of the plurality of threads reached the barrier by electrically tying bits of the register together and “AND”ing the input bit signals; determining whether only on-chip synchronization is needed or whether inter-node synchronization is needed; in response to determining that all of the plurality of threads on the chip reached the barrier, notifying the plurality of threads on the chip, if it is determined that only on-chip synchronization is needed; and after all of the plurality of threads on the chip reached the barrier, communicating the synchronization signal to outside of the chip, if it is determined that inter-node synchronization is needed. | 07-12-2012 |
20120198118 | USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY - A device for copying performance counter data includes hardware path that connects a direct memory access (DMA) unit to a plurality of hardware performance counters and a memory device. Software prepares an injection packet for the DMA unit to perform copying, while the software can perform other tasks. In one aspect, the software that prepares the injection packet runs on a processing core other than the core that gathers the hardware performance counter data. | 08-02-2012 |
20120304020 | SHARED PERFORMANCE MONITOR IN A MULTIPROCESSOR SYSTEM - A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices. | 11-29-2012 |
20120311272 | NOVEL SNOOP FILTER FOR FILTERING SNOOP REQUESTS - A method and apparatus for supporting cache coherency in a multiprocessor computing environment having multiple processing units, each processing unit having one or more local cache memories associated and operatively connected therewith. The method comprises providing a snoop filter device associated with each processing unit, each snoop filter device having a plurality of dedicated input ports for receiving snoop requests from dedicated memory writing sources in the multiprocessor computing environment. Each snoop filter device includes a plurality of parallel operating port snoop filters in correspondence with the plurality of dedicated input ports, each port snoop filter implementing one or more parallel operating sub-filter elements that are adapted to concurrently filter snoop requests received from respective dedicated memory writing sources and forward a subset of those requests to its associated processing unit. | 12-06-2012 |
20130007378 | MECHANISMS FOR EFFICIENT INTRA-DIE/INTRA-CHIP COLLECTIVE MESSAGING - Mechanism of efficient intra-die collective processing across the nodelets with separate shared memory coherency domains is provided. An integrated circuit die may include a hardware collective unit implemented on the integrated circuit die. A plurality of cores on the integrated circuit die is grouped into a plurality of shared memory coherence domains. Each of the plurality of shared memory coherence domains is connected to the collective unit for performing collective operations between the plurality of shared memory coherence domains. | 01-03-2013 |
20130013891 | METHOD AND APPARATUS FOR A HIERARCHICAL SYNCHRONIZATION BARRIER IN A MULTI-NODE SYSTEM - A hierarchical barrier synchronization of cores and nodes on a multiprocessor system, in one aspect, may include providing by each of a plurality of threads on a chip, input bit signal to a respective bit in a register, in response to reaching a barrier; determining whether all of the plurality of threads reached the barrier by electrically tying bits of the register together and “AND”ing the input bit signals; determining whether only on-chip synchronization is needed or whether inter-node synchronization is needed; in response to determining that all of the plurality of threads on the chip reached the barrier, notifying the plurality of threads on the chip, if it is determined that only on-chip synchronization is needed; and after all of the plurality of threads on the chip reached the barrier, communicating the synchronization signal to outside of the chip, if it is determined that inter-node synchronization is needed. | 01-10-2013 |
20130073836 | FINE-GRAINED INSTRUCTION ENABLEMENT AT SUB-FUNCTION GRANULARITY - Fine-grained enablement at sub-function granularity. An instruction encapsulates different sub-functions of a function, in which the sub-functions use different sets of registers of a composite register file, and therefore, different sets of functional units. At least one operand of the instruction specifies which set of registers, and therefore, which set of functional units, is to be used in performing the sub-function. The instruction can perform various functions (e.g., move, load, etc.) and a sub-function of the function specifies the type of function (e.g., move-floating point; move-vector; etc.). | 03-21-2013 |
20130080745 | FINE-GRAINED INSTRUCTION ENABLEMENT AT SUB-FUNCTION GRANULARITY - Fine-grained enablement at sub-function granularity. An instruction encapsulates different sub-functions of a function, in which the sub-functions use different sets of registers of a composite register file, and therefore, different sets of functional units. At least one operand of the instruction specifies which set of registers, and therefore, which set of functional units, is to be used in performing the sub-function. The instruction can perform various functions (e.g., move, load, etc.) and a sub-function of the function specifies the type of function (e.g., move-floating point; move-vector; etc.). | 03-28-2013 |
20130086361 | Scalable Decode-Time Instruction Sequence Optimization of Dependent Instructions - Producer-consumer instructions, comprising a first instruction and a second instruction in program order, are fetched requiring in-order execution, the second instruction is modified by the processor so that the first instruction and second instruction can be completed out-of-order, the modification comprising any one of extending an immediate field of the second instruction using immediate field information of the first instruction or providing a source location of the first instruction as an additional source location to source locations of the second instruction. | 04-04-2013 |
20130086362 | Managing a Register Cache Based on an Architected Computer Instruction Set Having Operand First-Use Information - A prefix instruction is executed and passes operands to a net instruction without storing the operands in an architected resource such that the execution of the next instruction uses the operands provided by the prefix instruction to perform an operation, the operands may be prefix instruction immediate field or a target register of the prefix instruction execution. | 04-04-2013 |
20130086363 | Computer Instructions for Activating and Deactivating Operands - An instruction set architecture (ISA) includes instructions for selectively indicating last-use architected operands having values that will not be accessed again, wherein architected operands are made active or inactive after an instruction specified last-use by an instruction, wherein the architected operands are made active by performing a write operation to an inactive operand, wherein the activation/deactivation may be performed by the instruction having the last-use of the operand or another (prefix) instruction. | 04-04-2013 |
20130086364 | Managing a Register Cache Based on an Architected Computer Instruction Set Having Operand Last-User Information - A multi-level register hierarchy is disclosed comprising a first level pool of registers for caching registers of a second level pool of registers in a system wherein programs can dynamically release and re-enable architected registers such that released architected registers need not be maintained by the processor, the processor accessing operands from the first level pool of registers, wherein a last-use instruction is identified as having a last use of an architected register before being released, the last-use architected register being released causes the multi-level register hierarchy to discard any correspondence of an entry to said last use architected register. | 04-04-2013 |
20130086365 | Exploiting an Architected List-Use Operand Indication in a Computer System Operand Resource Pool - A pool of available physical registers are provided for architected registers, wherein operations are performed that activate and deactivate selected architected registers, such that the deactivated selected architected registers need not retain values, and physical registers can be deallocated to the pool, wherein deallocation of physical registers is performed after a last-use by a designated last-use instruction, wherein the last-use information is provided either by the last-use instruction or a prefix instruction, wherein reads to deallocated architecture registers return an architected default value. | 04-04-2013 |
20130086367 | Tracking operand liveliness information in a computer system and performance function based on the liveliness information - Operand liveness state information is maintained during context switches for current architected operands of executing programs the current operand state information indicating whether corresponding current operands are any one of enabled or disabled for use by a first program module, the first program module comprising machine instructions of an instruction set architecture (ISA) for disabling current architected operands, wherein a current operand is accessed by a machine instruction of said first program module, the accessing comprising using the current operand state information to determine whether a previously stored current operand value is accessible by the first program module. | 04-04-2013 |
20130086368 | Using Register Last Use Infomation to Perform Decode-Time Computer Instruction Optimization - Two computer machine instructions are fetched for execution, but replaced by a single optimized instruction to be executed, wherein a temporary register used by the two instructions is identified as a last-use register, where a last-use register has a value that is not to be accessed by later instructions, whereby the two computer machine instructions are replaced by a single optimized internal instruction for execution, the single optimized instruction not including the last-use register. | 04-04-2013 |
20130086548 | GENERATING COMPILED CODE THAT INDICATES REGISTER LIVENESS - Object code is generated from an internal representation that includes a plurality of source operands. The generating includes performing for each source operand in the internal representation determining whether a last use has occurred for the source operand. The determining includes accessing a data flow graph to determine whether all uses of a live range have been emitted. If it is determined that a last use has occurred for the source operand, an architected resource associated with the source operand is marked for last-use indication. A last-use indication is then generated for the architected resource. Instructions and the last-use indications are emitted into the object code. | 04-04-2013 |
20130086598 | GENERATING COMPILED CODE THAT INDICATES REGISTER LIVENESS - Object code is generated from an internal representation that includes a plurality of source operands. The generating includes performing for each source operand in the internal representation determining whether a last use has occurred for the source operand. The determining includes accessing a data flow graph to determine whether all uses of a live range have been emitted. If it is determined that a last use has occurred for the source operand, an architected resource associated with the source operand is marked for last-use indication. A last-use indication is then generated for the architected resource. Instructions and the last-use indications are emitted into the object code. | 04-04-2013 |
20130179736 | TICKET CONSOLIDATION - A method of ticket consolidation in computing environment may in one aspect analyze problem reports, determine whether problems reported by machines are caused by the same or substantially the same run-time configuration error or are occurring on the same physical server, and are within the given sensitivity time window, consolidate the problem tickets and increase the priority of the consolidated ticket. | 07-11-2013 |
20130212139 | MIXED PRECISION ESTIMATE INSTRUCTION COMPUTING NARROW PRECISION RESULT FOR WIDE PRECISION INPUTS - A technique is provided for performing a mixed precision estimate. A processing circuit receives an input of a first precision having a wide precision value. The processing circuit computes an output in an output exponent range corresponding to a narrow precision value based on the input having the wide precision value. | 08-15-2013 |
20130218846 | MANAGING ENTERPRISE DATA QUALITY USING COLLECTIVE INTELLIGENCE - An embodiment of the invention is directed to a method associated with a data processing system disposed to receive and process enterprise data. Responsive to receiving a specified data element, the method determines a data type to be used for the specified data element. The method selectively determines a confidence level of the specified data element, and selects a plurality of subject matter experts (SMEs), wherein the data type of the specified data element is used in selecting each SME. A request is dispatched to each of the SMEs to selectively revise and validate the specified data element. The specified data element is then updated in accordance with each revision provided by an SME in response to one of the requests. | 08-22-2013 |
20130262821 | PERFORMING PREDECODE-TIME OPTIMIZED INSTRUCTIONS IN CONJUNCTION WITH PREDECODE TIME OPTIMIZED INSTRUCTION SEQUENCE CACHING - A method for performing predecode-time optimized instructions in conjunction with predecode time optimized instruction sequence caching. The method includes receiving a first instruction of an instruction sequence and a second instruction of the instruction sequence and determining if the first instruction and the second instruction can be optimized. In response to the determining that the first instruction and second instruction can be optimized, the method includes, preforming a pre-decode optimization on the instruction sequence and generating a new second instruction, wherein the new second instruction is not dependent on a target operand of the first instruction and storing a pre-decoded first instruction and a pre-decoded new second instruction in an instruction cache. In response to determining that the first instruction and second instruction can not be optimized, the method includes, storing the pre-decoded first instruction and a pre-decoded second instruction in the instruction cache. | 10-03-2013 |
20130262822 | CACHING OPTIMIZED INTERNAL INSTRUCTIONS IN LOOP BUFFER - Embodiments of the invention relate to a computer system for storing an internal instruction loop in a loop buffer. The computer system includes a loop buffer and a processor. The computer system is configured to perform a method including fetching instructions from memory to generate an internal instruction to be executed, detecting a beginning of a first instruction loop in the instructions, determining that a first internal instruction loop corresponding to the first instruction loop is not stored in the loop buffer, fetching the first instruction loop, optimizing one or more instructions corresponding to the first instruction loop to generate a first optimized internal instruction loop, and storing the first optimized internal instruction loop in the loop buffer based on the determination that the first internal instruction loop is not stored in the loop buffer. | 10-03-2013 |
20130262823 | INSTRUCTION MERGING OPTIMIZATION - A computer system for optimizing instructions includes a processor including an instruction execution unit configured to execute instructions and an instruction optimization unit configured to optimize instructions and memory to store machine instructions to be executed by the instruction execution unit. The computer system is configured to perform a method including analyzing machine instructions from among a stream of instructions to be executed by the instruction execution unit, the machine instructions including a memory load instruction and a data processing instruction to perform a data processing function based on the memory load instruction, identifying the machine instructions as being eligible for optimization, merging the machine instructions into a single optimized internal instruction, and executing the single optimized internal instruction to perform a memory load function and a data processing function corresponding to the memory load instruction and the data processing instruction. | 10-03-2013 |
20130262839 | INSTRUCTION MERGING OPTIMIZATION - A computer system for optimizing instructions is configured to identify two or more machine instructions as being eligible for optimization, to merge the two or more machine instructions into a single optimized internal instruction that is configured to perform functions of the two or more machine instructions, and to execute the single optimized internal instruction to perform the functions of the two or more machine instructions. Being eligible includes determining that the two or more machine instructions include a first instruction specifying a first target register and a second instruction specifying the first target register as a source register and a target register. The second instruction is a next sequential instruction of the first instruction in program order, wherein the first instruction specifies a first function to be performed, and the second instruction specifies a second function to be performed. | 10-03-2013 |
20130262840 | INSTRUCTION MERGING OPTIMIZATION - A computer-implemented method includes determining that two or more instructions of an instruction stream are eligible for optimization. Eligibility is based on a first instruction specifying a first target register and a second instruction specifying the first target register as a source register and a target register. The method includes merging the two or more machine instructions into a single optimized internal instruction that is configured to perform first and second functions of two or more machine instructions employing operands specified by the two or more machine instructions. The single optimized internal instruction specifies the first target register only as a single target register and the single optimized internal instruction specifies the first and second functions to be performed. The method includes executing the single optimized internal instruction to perform the first and second functions of the two or more instructions. | 10-03-2013 |
20130262841 | INSTRUCTION MERGING OPTIMIZATION - A computer-implemented method includes determining that two or more instructions of an instruction stream are eligible for optimization, where the two or more instructions include a memory load instruction and a data processing instruction to process data based on the memory load instruction. The method includes merging, by a processor, the two or more instructions into a single optimized internal instruction and executing the single optimized internal instruction to perform a memory load function and a data processing function corresponding to the memory load instruction and the data processing instruction. | 10-03-2013 |
20130263145 | METHOD AND APPARATUS FOR EFFICIENT INTER-THREAD SYNCHRONIZATION FOR HELPER THREADS - A monitor bit per hardware thread in a memory location may be allocated, in a multiprocessing computer system having a plurality of hardware threads, the plurality of hardware threads sharing the memory location, and each of the allocated monitor bit corresponding to one of the plurality of hardware threads. A condition bit may be allocated for each of the plurality of hardware threads, the condition bit being allocated in each context of the plurality of hardware threads. In response to detecting the memory location being accessed, it is determined whether a monitor bit corresponding to a hardware thread in the memory location is set. In response to determining that the monitor bit corresponding to a hardware thread is set in the memory location, a condition bit corresponding to a thread accessing the memory location is set in the hardware thread's context. | 10-03-2013 |
20130268741 | POWER REDUCTION IN SERVER MEMORY SYSTEM - A system and method for reducing power consumption of memory chips outside of a host processor device inoperative communication with the memory chips via a memory controller. The memory can operate in modes, such that via the memory controller, the stored data can be localized and moved at various granularities, among ranks established in the chips, to result in fewer operating ranks. Memory chips may then be turned on and off based on host memory access usage levels at each rank in the chip. Host memory access usage levels at each rank in the chip is tracked by performance counters established for association with each rank of a memory chip. Turning on and off of the memory chips is based on a mapping maintained between ranks and address locations corresponding to sub-sections within each rank receiving the host processor access requests. | 10-10-2013 |
20130294771 | MULTI-NODE SYSTEM NETWORKS WITH OPTICAL SWITCHES - A system and method for optical switching of networks in a multi-node computing system with programmable magneto-optical switches that enable optical signal routing on optical pathways. The system includes a network of optical links interconnecting nodes with switching elements that are controlled by electrical control signals. Data transmission is along the optical links and an optical pathway is determined by the electrical control signals which are launched ahead of optical signal. If links are available, an optical pathway is reserved, and the electrical signal sets the necessary optical switches for the particular optical pathway. There is thereby eliminated the need for optical-electrical-optical conversion at each node in order to route data packets through the network. If a link or optical pathway is not available the system tries to find an alternative path. If no alternative path is available, the system reserves buffering. After transmission, all reservations are released. | 11-07-2013 |
20130326180 | MECHANISM FOR OPTIMIZED INTRA-DIE INTER-NODELET MESSAGING COMMUNICATION - Point-to-point intra-nodelet messaging support for nodelets on a single chip that obey MPI semantics may be provided. In one aspect, a local buffering mechanism is employed that obeys standard communication protocols for the network communications between the nodelets integrated in a single chip. Sending messages from one nodelet to another nodelet on the same chip may be performed not via the network, but by exchanging messages in the point-to-point messaging buckets between the nodelets. The messaging buckets need not be part of the memory system of the nodelets. Specialized hardware controllers may be used for moving data between the nodelets and each messaging bucket, and ensuring correct operation of the network protocol. | 12-05-2013 |
20130332771 | METHODS AND APPARATUS FOR VIRTUAL MACHINE RECOVERY - Methods and apparatus for recovery of virtual machine failure. A succession of data images is captured, with each of the data images comprising an operating system of the virtual machine. The data images are images of data elements chosen based at least in part on their suitability for virtual machine restoration. Upon detection of a virtual machine failure, an attempt is made to restore the virtual machine using the highest ranked. If the attempt fails, further attempts are made using lower ranked data images, until an attempt is successful or all available data images have been used. | 12-12-2013 |
20140039957 | HANDLING CONSOLIDATED TICKETS - Handling problem tickets in a computing environment, in one aspect, may comprise identifying a plurality of tickets generated in the computing environment that are candidates for consolidation. The identifying may be done based on whether the tickets have the same or similar root cause, whether they are generated from virtual machines having same configuration, and/or one or more other criteria. The tickets which are candidates for consolidation may be grouped into a bundled group, and marked as bundled. Resolving a ticket from the bundled group may potentially resolves all tickets from the bundled group. | 02-06-2014 |
20140039958 | HANDLING CONSOLIDATED TICKETS - Handling problem tickets in a computing environment, in one aspect, may comprise identifying a plurality of tickets generated in the computing environment that are candidates for consolidation. The identifying may be done based on whether the tickets have the same or similar root cause, whether they are generated from virtual machines having same configuration, and/or one or more other criteria. The tickets which are candidates for consolidation may be grouped into a bundled group, and marked as bundled. Resolving a ticket from the bundled group may potentially resolves all tickets from the bundled group. | 02-06-2014 |
20140047216 | Scalable Decode-Time Instruction Sequence Optimization of Dependent Instructions - Producer-consumer instructions, comprising a first instruction and a second instruction in program order, are fetched requiring in-order execution, the second instruction is modified by the processor so that the first instruction and second instruction can be completed out-of-order, the modification comprising any one of extending an immediate field of the second instruction using immediate field information of the first instruction or providing a source location of the first instruction as an additional source location to source locations of the second instruction. | 02-13-2014 |
20140047219 | Managing A Register Cache Based on an Architected Computer Instruction Set having Operand Last-User Information - A multi-level register hierarchy is disclosed comprising a first level pool of registers for caching registers of a second level pool of registers in a system wherein programs can dynamically release and re-enable architected registers such that released architected registers need not be maintained by the processor, the processor accessing operands from the first level pool of registers, wherein a last-use instruction is identified as having a last use of an architected register before being released, the last-use architected register being released causes the multi-level register hierarchy to discard any correspondence of an entry to said last use architected register. | 02-13-2014 |
20140059394 | TICKET CONSOLIDATION FOR MULTI-TIERED APPLICATIONS - Consolidating problem tickets for a multi-tiered application may comprise identifying a plurality of correlated virtual machines that are running one or more application components of the multi-tiered application. Problem reports may be identified that are generated by one or more of the plurality of correlated virtual machines and caused by a failure of a same single component of the multi-tiered application. The identified problem reports may be consolidated into a single ticket and placed into a ticket handling system. | 02-27-2014 |
20140059395 | TICKET CONSOLIDATION FOR MULTI-TIERED APPLICATIONS - Consolidating problem tickets for a multi-tiered application may comprise identifying a plurality of correlated virtual machines that are running one or more application components of the multi-tiered application. Problem reports may be identified that are generated by one or more of the plurality of correlated virtual machines and caused by a failure of a same single component of the multi-tiered application. The identified problem reports may be consolidated into a single ticket and placed into a ticket handling system. | 02-27-2014 |
20140089636 | CACHING OPTIMIZED INTERNAL INSTRUCTIONS IN LOOP BUFFER - Embodiments of the invention relate to a computer system for storing an internal instruction loop in a loop buffer. The computer system includes a loop buffer and a processor. The computer system is configured to perform a method including fetching instructions from memory to generate an internal instruction to be executed, detecting a beginning of a first instruction loop in the instructions, determining that a first internal instruction loop corresponding to the first instruction loop is not stored in the loop buffer, fetching the first instruction loop, optimizing one or more instructions corresponding to the first instruction loop to generate a first optimized internal instruction loop, and storing the first optimized internal instruction loop in the loop buffer based on the determination that the first internal instruction loop is not stored in the loop buffer. | 03-27-2014 |
20140089886 | USING MULTIPLE TECHNICAL WRITERS TO PRODUCE A SPECIFIED SOFTWARE DOCUMENTATION PACKAGE - An embodiment of the invention produces software documentation that includes first and second sections. Skills a technical writer needs are determined, wherein preparation of the first and second sections require different skill sets. A database is searched to select technical writers qualified to prepare each of the multiple document sections, wherein the database contains the identities and qualifications of persons qualified to be technical writers. Preparation of the first and second sections are then assigned to first and second writers having first and second skill sets, respectively. Each prepared section is validated for incorporation into the software documentation. | 03-27-2014 |
20140089898 | USING MULTIPLE TECHNICAL WRITERS TO PRODUCE A SPECIFIED SOFTWARE DOCUMENTATION PACKAGE - An embodiment of the invention produces software documentation that includes first and second sections. Skills a technical writer needs are determined, wherein preparation of the first and second sections require different skill sets. A database is searched to select technical writers qualified to prepare each of the multiple document sections, wherein the database contains the identities and qualifications of persons qualified to be technical writers. Preparation of the first and second sections are then assigned to first and second writers having first and second skill sets, respectively. Each prepared section is validated for incorporation into the software documentation. | 03-27-2014 |
20140095833 | Prefix Computer Instruction for Compatibly Extending Instruction Functionality - A prefix instruction is executed and passes operands to a next instruction without storing the operands in an architected resource such that the execution of the next instruction uses the operands provided by the prefix instruction to perform an operation, the operands may be prefix instruction immediate field or a target register of the prefix instruction execution. | 04-03-2014 |
20140095835 | PERFORMING PREDECODE-TIME OPTIMIZED INSTRUCTIONS IN CONJUNCTION WITH PREDECODE TIME OPTIMIZED INSTRUCTION SEQUENCE CACHING - A method for performing predecode-time optimized instructions in conjunction with predecode time optimized instruction sequence caching. The method includes receiving a first instruction of an instruction sequence and a second instruction of the instruction sequence and determining if the first instruction and the second instruction can be optimized. In response to the determining that the first instruction and second instruction can be optimized, the method includes, preforming a pre-decode optimization on the instruction sequence and generating a new second instruction, wherein the new second instruction is not dependent on a target operand of the first instruction and storing a pre-decoded first instruction and a pre-decoded new second instruction in an instruction cache. In response to determining that the first instruction and second instruction can not be optimized, the method includes, storing the pre-decoded first instruction and a pre-decoded second instruction in the instruction cache. | 04-03-2014 |
20140095848 | Tracking Operand Liveliness Information in a Computer System and Performing Function Based on the Liveliness Information - Operand liveness state information is maintained during context switches for current architected operands of executing programs the current operand state information indicating whether corresponding current operands are any one of enabled or disabled for use by a first program module, the first program module comprising machine instructions of an instruction set architecture (ISA) for disabling current architected operands, wherein a current operand is accessed by a machine instruction of said first program module, the accessing comprising using the current operand state information to determine whether a previously stored current operand value is accessible by the first program module. | 04-03-2014 |
20140101216 | MIXED PRECISION ESTIMATE INSTRUCTION COMPUTING NARROW PRECISION RESULT FOR WIDE PRECISION INPUTS - A technique is provided for performing a mixed precision estimate. A processing circuit receives an input of a first precision having a wide precision value. The processing circuit computes an output in an output exponent range corresponding to a narrow precision value based on the input having the wide precision value. | 04-10-2014 |
20140108768 | Computer instructions for Activating and Deactivating Operands - An instruction set architecture (ISA) includes instructions for selectively indicating last-use architected operands having values that will not be accessed again, wherein architected operands are made active or inactive after an instruction specified last-use by an instruction, wherein the architected operands are made active by performing a write operation to an inactive operand, wherein the activation/deactivation may be performed by the instruction having the last-use of the operand or another (prefix) instruction. | 04-17-2014 |
20140108772 | Exploiting an Architected Last-Use Operand Indication in a System Operand Resource Pool - A pool of available physical registers are provided for architected registers, wherein operations are performed that activate and deactivate selected architected registers, such that the deactivated selected architected registers need not retain values, and physical registers can be deallocated to the pool, wherein deallocation of physical registers is performed after a last-use by a designated last-use instruction, wherein the last-use information is provided either by the last-use instruction or a prefix instruction, wherein reads to deallocated architecture registers return an archtiected default value. | 04-17-2014 |
20140164740 | Branch-Free Condition Evaluation - A compare instruction of an instruction set architecture (ISA), when executed tests one or more operands for an instruction defined condition. The result of the test is stored as an operand, with leading zeros, in a general register of the ISA. The general register is identified (explicitly or implicitly) by the compare instruction. Thus, the result of the test can be manipulated by standard register operations of the computer system. In a superscalar processor, no special “condition code” renaming is required, as the standard register renaming takes care of out-of-order processing of the conditions. | 06-12-2014 |
20140164747 | Branch-Free Condition Evaluation - A compare instruction of an instruction set architecture (ISA), when executed tests one or more operands for an instruction defined condition. The result of the test is stored as an operand, with leading zeros, in a general register of the ISA. The general register is identified (explicitly or implicitly) by the compare instruction. Thus, the result of the test can be manipulated by standard register operations of the computer system. In a superscalar processor, no special “condition code” renaming is required, as the standard register renaming takes care of out-of-order processing of the conditions. | 06-12-2014 |
20140237045 | EMBEDDING GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK - Embodiments of the invention provide a method, system and computer program product for embedding a global barrier and global interrupt network in a parallel computer system organized as a torus network. The computer system includes a multitude of nodes. In one embodiment, the method comprises taking inputs from a set of receivers of the nodes, dividing the inputs from the receivers into a plurality of classes, combining the inputs of each of the classes to obtain a result, and sending said result to a set of senders of the nodes. Embodiments of the invention provide a method, system and computer program product for embedding a collective network in a parallel computer system organized as a torus network. In one embodiment, the method comprises adding to a torus network a central collective logic to route messages among at least a group of nodes in a tree structure. | 08-21-2014 |
20150089152 | MANAGING HIGH-CONFLICT CACHE LINES IN TRANSACTIONAL MEMORY COMPUTING ENVIRONMENTS - Cache lines in a computing environment with transactional memory are configurable with a coherency mode. Cache lines in full-line coherency mode are operated or managed with full-line granularity. Cache lines in sub-line coherency mode are operated or managed as sub-cache line portions of a full cache line. When a transaction accessing a cache line in full-line coherency mode results in a transactional abort, the cache line may be placed in sub-line coherency mode if the cache line is a high-conflict cache line. The cache line may be associated with a counter in a conflict address detection table that is incremented whenever a transaction conflict is detected for the cache line. The cache line may be a high-conflict cache line when the counter satisfies a high-conflict criterion, such as reaching a threshold value. The cache line may be returned to full-line coherency mode when a reset criterion is satisfied. | 03-26-2015 |
20150089153 | IDENTIFYING HIGH-CONFLICT CACHE LINES IN TRANSACTIONAL MEMORY COMPUTING ENVIRONMENTS - Cache lines in a computing environment with transactional memory are configurable with a coherency mode and are associated with a high-conflict indicator. Cache lines in full-line coherency mode are operated or managed with full-line granularity. Cache lines in sub-line coherency mode are operated or managed as sub-cache line portions of a full cache line. A cache line is placed in sub-line coherency mode based on examining the high-conflict indicator. A transaction accessing a memory address in a cache line in sub-line coherency mode marks only the sub-cache line portion associated with the memory address as transactionally accessed. The high-conflict indicator may be included in a set of descriptive bits associated with the cache line. A copy of the high-conflict indicator for a cache line in a first cache may be updated with the high-conflict indicator for the cache line in a second cache. | 03-26-2015 |
20150089154 | MANAGING HIGH-COHERENCE-MISS CACHE LINES IN MULTI-PROCESSOR COMPUTING ENVIRONMENTS - Cache lines in a multi-processor computing environment are configurable with a coherency mode. Cache lines in full-line coherency mode are operated or managed with full-line granularity. Cache lines in sub-line coherency mode are operated or managed as sub-cache line portions of a full cache line. A high-coherence-miss cache line may be placed in sub-line coherency mode. A cache line may be associated with a counter in a coherence miss detection table that is incremented whenever an access of the cache line results in a coherence request. The cache line may be a high-coherence-miss cache line when the counter satisfies a high-coherence-miss criterion, such as reaching a threshold value. The cache line may be returned to full-line coherency mode when a reset criterion is satisfied. | 03-26-2015 |
20150089155 | CENTRALIZED MANAGEMENT OF HIGH-CONTENTION CACHE LINES IN MULTI-PROCESSOR COMPUTING ENVIRONMENTS - Cache lines in a multi-processor computing environment are configurable with a coherency mode. Cache lines in full-line coherency mode are operated or managed with full-line granularity. Cache lines in sub-line coherency mode are operated or managed as sub-cache line portions of a full cache line. Communications detected on a coherence interconnect may indicate that a cache line is associated with performance-reducing events. A high-contention cache line may be placed in sub-line coherency mode. Caches accessing the cache line are notified that the cache line is in sub-line coherency mode. The cache line may be associated with a counter in a centralized detection table that is incremented based on detecting the communications. The cache line may be a high-contention cache line when the counter satisfies a high-contention criterion, such as reaching a threshold value. The cache line may be returned to full-line coherency mode when a reset criterion is satisfied. | 03-26-2015 |
20150089159 | MULTI-GRANULAR CACHE MANAGEMENT IN MULTI-PROCESSOR COMPUTING ENVIRONMENTS - Cache lines in a multi-processor computing environment are configurable with a coherency mode. Cache lines in full-line coherency mode are operated or managed with full-line granularity. Cache lines in sub-line coherency mode are operated or managed as sub-cache line portions of a full cache line. Each cache is associated with a directory having a number of directory entries and with a side table having a smaller number of entries. The directory entry for a cache line associates the cache line with a tag and a set of full-line descriptive bits. Creating a side table entry for the cache line places the cache line in sub-line coherency mode. The side table entry associates each of the sub-cache line portions of the cache line with a set of sub-line descriptive bits. Removing the side table entry may return the cache line to full-line coherency mode. | 03-26-2015 |
20150089193 | PREDICTIVE FETCHING AND DECODING FOR SELECTED RETURN INSTRUCTIONS - Predictive fetching and decoding for selected instructions. A determination is made as to whether an instruction to be executed in a pipelined processor is a selected return instruction, the pipelined processor having a plurality of stages including an execute stage. Based on the instruction being the selected return instruction, obtaining from a data structure a predicted return address, the predicted return address being an address of an instruction to which it is predicted that processing is to be returned. Additionally, based on the instruction being the selected return instruction, operating state for the instruction at the predicted return address is predicted. The instruction is fetched at the predicted return address, prior to the selected return instruction reaching the execute stage, and decoding of the fetched instruction is initiated based on the predicted operating state. | 03-26-2015 |
20150089194 | PREDICTIVE FETCHING AND DECODING FOR SELECTED INSTRUCTIONS - Predictive fetching and decoding for selected instructions (e.g., operating system instructions, hypervisor instructions or other such instructions). A determination is made that a selected instruction, such as a system call instruction, an asynchronous interrupt, a return from system call instruction or return from asynchronous interrupt, is to be executed. Based on determining that such an instruction is to be executed, a predicted address is determined for the selected instruction, which is the address to which processing transfers in order to provide the requested services. Then, fetching of instructions beginning at the predicted address prior to execution of the selected instruction is commenced. Further, speculative state relating to a selected instruction, including, for instance, an indication of the privilege level of the selected instruction or instructions executed on behalf of the selected instruction, is predicted and maintained. | 03-26-2015 |
20150089208 | PREDICTOR DATA STRUCTURE FOR USE IN PIPELINED PROCESSING - A predictor data structure is used for pipelined processing by a pipelined processor. The predictor data structure includes a predicted address to be used in return from execution of a selected instruction, and a predicted operating state associated with the predicted address. Based on determining a selected return instruction is to be executed, the predicted address to which processing is to be returned is obtained from the predictor data structure. Further, based on determining the selected return instruction is to be executed, a transitional operating state to be entered based on the predicted operating state stored in the predictor data structure is predicted, wherein at least one of the predicted address and the predicted transitional operating state are to be used to validate execution of the selected return instruction. | 03-26-2015 |