| Patent application number | Description | Published |
| 20090013133 | CACHE LINE MARKING WITH SHARED TIMESTAMPS - Embodiments of the present invention provide a system that marks cache lines using shared timestamps. During operation, the system starts a transaction for a thread, wherein starting the transaction involves recording the value of an active timestamp and incrementing a transaction or overflow counter (TO_counter) corresponding to the recorded value. The system then places load-marks on cache lines which are loaded during the transaction. While placing the load-marks, the system writes the recorded value into metadata corresponding to the cache lines. Upon completing the transaction for the thread, the system decrements the TO_counter corresponding to the recorded value and resumes non-transactional execution for the thread without removing the load-marks from cache lines which were load-marked during the transaction. | 01-08-2009 |
| 20090019231 | Method and Apparatus for Implementing Virtual Transactional Memory Using Cache Line Marking - Embodiments of the present invention implement virtual transactional memory using cache line marking. The system starts by executing a starvation-avoiding transaction for a thread. While executing the starvation-avoiding transaction, the system places starvation-avoiding load-marks on cache lines which are loaded from and places starvation-avoiding store-marks on cache lines which are stored to. Next, while swapping a page out of a memory and to a disk during the starvation-avoiding transaction, the system determines if one or more cache lines in the page have a starvation-avoiding load-mark or a starvation-avoiding store-mark. If so, upon swapping the page into the memory from the disk, the system places a starvation-avoiding load-mark on each cache line that had a starvation-avoiding load-mark and places a starvation-avoiding store-mark on each cache line that had a starvation-avoiding store-mark. | 01-15-2009 |
| 20090019272 | STORE QUEUE ARCHITECTURE FOR A PROCESSOR THAT SUPPORTS SPECULATIVE EXECUTION - Embodiments of the present invention provide a system that buffers stores on a processor that supports speculative execution. The system starts by buffering a store into an entry in the store queue during a speculative execution mode. If an entry for the store does not already exist in the store queue, the system writes the store into an available entry in the store queue and updates a byte mask for the entry. Otherwise, if an entry for the store already exists in the store queue, the system merges the store into the existing entry in the store queue and updates the byte mask for the entry to include information about the newly merged store. The system then forwards the data from the store queue to subsequent dependent loads. | 01-15-2009 |
| 20090113131 | METHOD AND APPARATUS FOR TRACKING LOAD-MARKS AND STORE-MARKS ON CACHE LINES - Embodiments of the present invention provide a system that handles load-marked and store-marked cache lines. Upon asserting a load-mark or a store-mark for a cache line during a given phase of operation, the system adds an entry to a private buffer and in doing so uses an address of the cache line as a key for the entry in the private buffer. The system also updates the entry in the private buffer with information about the load-mark or store-mark and uses pointers for the entry and for the last entry added to the private buffer to add the entry to a sequence of private buffer entries placed during the phase of operation. The system then uses the entries in the private buffer to remove the load-marks and store-marks from cache lines when the phase of operation is completed. | 04-30-2009 |
| 20090119461 | MAINTAINING CACHE COHERENCE USING LOAD-MARK METADATA - Embodiments of the present invention provide a system that maintains load-marks on cache lines. The system includes: (1) a cache which accommodates a set of cache lines, wherein each cache line includes metadata for load-marking the cache line, and (2) a local cache controller for the cache. Upon determining that a remote cache controller has made a request for a cache line that would cause the local cache controller to invalidate a copy of the cache line in the cache, the local cache controller determines if there is a load-mark in the metadata for the copy of the cache line. If not, the local cache controller invalidates the copy of the cache line. Otherwise, the local cache controller signals a denial of the invalidation of the cache line and retains the copy of the cache line and the load-mark in the metadata for the copy of the cache line. | 05-07-2009 |
| 20090265533 | Branch Prediction Mechanisms Using Multiple Hash Functions - In one embodiment, the branch prediction mechanism includes a first storage including a first plurality of locations for storing a first set of partial prediction information. The branch prediction mechanism also includes a second storage including a second plurality of locations for storing a second set of partial prediction information. Further, the branch prediction mechanism includes a control unit that performs a first hash function on input branch information to generate a first index for accessing a selected location within the first storage. The control unit also performs a second hash function on the input branch information to generate a second index for accessing a selected location within the second storage. Lastly, the control unit further provides a prediction value based on corresponding partial prediction information in the selected locations of the first and the second storages. | 10-22-2009 |
| 20090292968 | Hard Component Failure Detection and Correction - In one embodiment, a memory controller comprises a check bit encoder circuit coupled to receive a data block to be written to memory, a check/correct circuit coupled to receive an encoded data block read from the memory, and a hard failure detection circuit coupled to the check/correct circuit. The check bit encoder circuit is configured to generate a corresponding encoded data block comprising the data block, a first plurality of check bits, and a second plurality of check bits. The check/correct circuit is configured to detect an error in the encoded data block responsive to the first check bits, the second check bits, and the data block within the encoded data block, which is logically arranged as an array of R rows and N columns, wherein R and N are positive integers. Each of the first check bits covers a respective row of the array, and the check/correct circuit is configured to generate a first syndrome corresponding to the first plurality of check bits. A presence of more than one binary one in the first syndrome indicates a multi-bit error. Responsive to detecting the multi-bit error, the hard failure detection circuit is configured to perform a plurality of memory read/write operations to the memory locations in which the encoded data block is stored to identify a hard error failure in the memory. | 11-26-2009 |
| 20100023701 | CACHE LINE DUPLICATION IN RESPONSE TO A WAY PREDICTION CONFLICT - Embodiments of the present invention provide a system that handles way mispredictions in a multi-way cache. The system starts by receiving requests to access cache lines in the multi-way cache. For each request, the system makes a prediction of a way in which the cache line resides based on a corresponding entry in the way prediction table. The system then checks for the presence of the cache line in the predicted way. Upon determining that the cache line is not present in the predicted way, but is present in a different way, and hence the way was mispredicted, the system increments a corresponding record in a conflict detection table. Upon detecting that a record in the conflict detection table indicates that a number of mispredictions equals a predetermined value, the system copies the corresponding cache line from the way where the cache line actually resides into the predicted way. | 01-28-2010 |
| 20100106912 | COHERENCE PROTOCOL WITH DYNAMIC PRIVATIZATION - Embodiments of the present invention provide a system that maintains coherence between cache lines in a computer system by using dynamic privatization. During operation, the system starts by receiving a request for a read-only copy of a cache line from a processor. The system then determines if the processor has privately requested the cache line a predetermined number of times. If so, the system provides a copy of the cache line to the processor in an exclusive state. Otherwise, the system provides a copy of the cache line to the processor in a shared state. | 04-29-2010 |
| 20100122032 | SELECTIVELY PERFORMING LOOKUPS FOR CACHE LINES - Embodiments of the present invention provide a system that selectively performs lookups for cache lines. During operation, the system by maintains a lower-level cache and a higher-level cache in accordance with a set of rules that dictate conditions under which cache lines are held in the lower-level cache and the higher-level cache. The system next performs a lookup for cache line A in the lower level cache. The system then discovers that the lookup for cache line A missed in the lower-level cache, but that cache line B is present in the lower-level cache. Next, in accordance with the set of rules, the system determines, without performing a lookup for cache line A in the higher-level cache, that cache line A is guaranteed not to be present and valid in the higher-level cache because cache line B is present in the lower-level cache. | 05-13-2010 |
| 20100125707 | DEADLOCK AVOIDANCE DURING STORE-MARK ACQUISITION - Some embodiments of the present invention provide a system that avoids deadlock while attempting to acquire store-marks on cache lines. During operation, the system keeps track of store-mark requests that arise during execution of a thread, wherein a store-mark on a cache line indicates that one or more associated store buffer entries are waiting to be committed to the cache line. In this system, store-mark requests are processed in a pipelined manner, which allows a store-mark request to be initiated before preceding store-mark requests for the same thread complete. Next, if a store-mark request fails, within a bounded amount of time, the system removes or prevents store-marks associated with younger store-mark requests for the same thread, thereby avoiding a potential deadlock that can arise when one or more other threads attempt to store-mark the same cache lines. | 05-20-2010 |
| 20100153655 | STORE QUEUE WITH STORE-MERGING AND FORWARD-PROGRESS GUARANTEES - Some embodiments of the present invention provide a system that performs stores in a memory system. During operation, the system performs a store for a first thread, which involves creating an entry for the store in a store queue for the first thread. It also involves attempting to store-mark a corresponding cache line for the first thread by sending a store-mark request for the first thread to the memory system, wherein a store-mark on the cache line indicates that one or more store queue entries are waiting to be committed to the cache line. If the attempt to store-mark the cache line fails because a second thread holds a store-mark on the cache line, and if obtaining the store-mark will ensure forward progress for the first thread, the system forces the second thread to release the store-mark, so the first thread can acquire a store-mark for the cache line. | 06-17-2010 |
| 20100180084 | CACHE-COHERENCY PROTOCOL WITH HELD STATE - A new “held” (“H”) cache-coherency state is introduced for directory-based multiprocessor systems. Using the held state enables embodiments of the present invention to track sharers that have a shared copy of a cache line after a directory runs out of space for holding information that identifies processors that have received shared copies of the cache line (e.g., pointers to sharers of the cache line). In these embodiments, when a directory entry is full, the system provides subsequent shared copies of the cache line to sharers in the held state and tracks the identity of the held-copy owners in a data field in the entry for the cache line in a home node. | 07-15-2010 |
| 20100199048 | SPECULATIVE WRITESTREAM TRANSACTION - Embodiments of the present invention provide a system that performs a speculative writestream transaction. The system starts by receiving, at a home node, a writestream ordered (WSO) request to start a WSO transaction from a processing subsystem. The WSO request identifies a cache line to be written during the WSO transaction. The system then sends an acknowledge signal to the processing subsystem to enable the processing subsystem to proceed with the WSO transaction. During the WSO transaction, the system receives a second WSO request to start a WSO transaction. The second WSO request identifies the same cache line as to be written during the subsequent WSO transaction. In response to receiving the second WSO request, the system sends an abort signal to cause the processing subsystem to abort the WSO transaction. | 08-05-2010 |
| 20100205609 | USING TIME STAMPS TO FACILITATE LOAD REORDERING - Some embodiments of the present invention provide a system that supports load reordering in a processor. The system maintains at least one counter value for each thread which is used to assign time stamps for the thread. While performing a load for the thread, the system reads a time stamp from a cache line to which the load is directed. Next, if the counter value is equal to the time stamp, the system performs the load. Otherwise, if the counter value is greater-than the time stamp, the system performs the load and increases the time stamp to be greater-than-or-equal-to the counter. Finally, if the load is a speculative load, which is speculatively performed earlier than an older load in program order, and the counter value is less-than the time stamp, the system fails speculative execution for the thread. | 08-12-2010 |
| 20100241814 | BANDWIDTH-EFFICIENT DIRECTORY-BASED COHERENCE PROTOCOL - Some embodiments of the present invention provide a system that processes a request for a cache line in a multiprocessor system that supports a directory-based cache-coherence scheme. During operation, the system receives the request for the cache line from a requesting node at a home node, wherein the home node maintains directory information for all or a subset of the address space which includes the cache line. Next, the system performs an action at the home node, which causes a valid copy of the cache line to be sent to the requesting node. The system then completes processing of the request at the home node without waiting for an acknowledgment indicating that the requesting node received the valid copy of the cache line. | 09-23-2010 |
| 20100325374 | DYNAMICALLY CONFIGURING MEMORY INTERLEAVING FOR LOCALITY AND PERFORMANCE ISOLATION - Embodiments of the present invention provide a system that dynamically reconfigures memory. During operation, the system determines that a virtual memory page is to be reconfigured from an original virtual-address-to-physical-address mapping to a new virtual-address-to-physical-address mapping. The system then determines a new real address mapping for a set of virtual addresses in the virtual memory page by selecting a range of real addresses for the virtual addresses that are arranged according to the new virtual-address-to-physical-address mapping. Next, the system temporarily disables accesses to the virtual memory page. Then, the system copies data from real address locations indicated by the original virtual-address-to-physical-address mapping to real address locations indicated by the new virtual-address-to-physical-address mapping. Next, the system updates the real-address-to-physical-address mapping for the page, and re-enables accesses to the virtual memory page. | 12-23-2010 |
| 20100332471 | Bloom Bounders for Improved Computer System Performance - A system and method for space and time efficient bound calculation is disclosed. The method comprises inserting a plurality of key/value pairs into a “Bloom bounder”, each key/value pair comprising a key and a value. For each pair, the inserting includes calculating a plurality of hash values, each calculated by applying a different one of a plurality of hash functions to the key, and selectively updating one or more data arrays based on the plurality of hash values and the value received key/value pair. A bound may then be determined for a given query key by analyzing information in the one or more data arrays to determine a bound value, such that for every received key/value pair with a key matching the query key, the corresponding value is less than or equal to the bound value. | 12-30-2010 |
| 20100332765 | HIERARCHICAL BLOOM FILTERS FOR FACILITATING CONCURRENCY CONTROL - Some embodiments provide a system that facilitates concurrency control in a computer system. During operation, the system generates a set of signatures associated with memory accesses in the computer system. To generate the signatures, the system creates a set of hierarchical Bloom filters (HBFs) corresponding to the signatures, and populates the HBFs using addresses associated with the memory accesses. Next, the system compares the HBFs to detect a potential conflict associated with the memory accesses. Finally, the system manages concurrent execution in the computer system based on the detected potential conflict. | 12-30-2010 |
| 20100332766 | SUPPORTING EFFICIENT SPIN-LOCKS AND OTHER TYPES OF SYNCHRONIZATION IN A CACHE-COHERENT MULTIPROCESSOR SYSTEM - Some embodiments of the present invention provide a system that acquires a lock in a shared memory multiprocessor system. During operation, the system loads the lock into a cache associated with the thread and then reads a value of the lock. If the value indicates that the lock is currently held by another thread, the system periodically executes an instruction that tests a status of the lock. If the status indicates the lock is valid, the system continues to test the status of the lock. Otherwise, if the status indicates that the lock was invalidated by a store, the system attempts to acquire the lock by executing an atomic operation. On the other hand, if the status indicates that the lock was invalidated by an atomic operation, or that the lock is not present in the cache, the system repeats the loading and reading operations. | 12-30-2010 |
| 20100332944 | FACILITATING ERROR DETECTION AND CORRECTION AFTER A MEMORY COMPONENT FAILURE - Some embodiments of the present invention provide a system that can be reconfigured to provide error detection and correction after a failure of a memory component in a memory system. During operation, the system accesses a block of data from the memory system, wherein each block of data in the memory system includes an array of bits logically organized into R rows and C columns, including two checkbit columns containing checkbits, and C-2 data-bit columns containing data bits, wherein each column is stored in a different memory component, and wherein the checkbits are generated from the data bits to provide block-level detection and correction for a failed memory component. Next, upon examining the block of data, the system determines that a specific memory component in the memory system has failed. If the failed memory component contains a data-bit column for the block of data, the system uses checkbits from the two checkbit columns to correct the data-bit column, and then stores the corrected data-bit column. Next, the system generates and stores new checkbits in a functioning memory component, wherein the new checkbits provide single-error-correction and double-error-detection for erroneous bits in the block of data, but do not provide for detection and correction of a failed memory component. | 12-30-2010 |
| 20100332945 | FACILITATING PROBABILISTIC ERROR DETECTION AND CORRECTION AFTER A MEMORY COMPONENT FAILURE - Some embodiments of the present invention provide a system that provides error detection and correction after a failure of a memory component in a memory system. During operation, the system accesses a block of data from the memory system, wherein the memory system is previously determined to have a specific failed memory component. Each block of data in the memory system includes an array of bits logically organized into R rows and C columns, including a row checkbit column including row-checkbits for each of the R rows, an inner checkbit column including X| 12-30-2010 | |
| 20100333108 | PARALLELIZING LOOPS WITH READ-AFTER-WRITE DEPENDENCIES - Some embodiments provide a system that increases parallelization in a computer program. During operation, the system obtains a binary associative operator and a ordered set of elements associated with a prefix operation in the computer program. Next, the system divides the elements into multiple sets of contiguous iterations based on a number of processors used to execute the computer program. The system then performs, in parallel on the processors, a set of local reductions on the contiguous iterations using the binary associative operator. Afterwards, the system calculates a set of boundary prefixes between the contiguous iterations using the local reductions. Finally, the system applies, in parallel on the processors, the boundary prefixes to the contiguous iterations using the binary associative operator to obtain a set of prefixes for the prefix operation. | 12-30-2010 |
| 20110035561 | STORE QUEUE WITH TOKEN TO FACILITATE EFFICIENT THREAD SYNCHRONIZATION - Some embodiments of the present invention provide a system for operating a store queue, wherein the store queue buffers stores that are waiting to be committed to a memory system in a processor. During operation, the system examines an entry at the head of the store queue. If the entry contains a membar token, the system examines an unacknowledged counter that keeps track of the number of store operations that have been sent from the store queue to the memory system but have not been acknowledged as being committed to the memory system. If the unacknowledged counter is non-zero, the system waits until the unacknowledged counter equals zero, and then removes the membar token from the store queue. | 02-10-2011 |
| 20110047346 | EFFICIENT INTERLEAVING BETWEEN A NON-POWER-OF-TWO NUMBER OF ENTITIES - Some embodiments of the present invention provide a system that maps an address to an entity, wherein the mapping interleaves addresses between a number of entities. During operation, the system receives an address A from a set of X consecutive addresses, wherein the address A is to be mapped to an entity E in a set of Y entities, and wherein Y need not be a power of two. Next, the system obtains F=floor(log | 02-24-2011 |