Patent application number | Description | Published |
20120198122 | GUEST TO NATIVE BLOCK ADDRESS MAPPINGS AND MANAGEMENT OF NATIVE CODE STORAGE - A method for managing mappings of storage on a code cache for a processor. The method includes storing a plurality of guest address to native address mappings as entries in a conversion look aside buffer, wherein the entries indicate guest addresses that have corresponding converted native addresses stored within a code cache memory, and receiving a subsequent request for a guest address at the conversion look aside buffer. The conversion look aside buffer is indexed to determine whether there exists an entry that corresponds to the index, wherein the index comprises a tag and an offset that is used to identify the entry that corresponds to the index. Upon a hit on the tag, the corresponding entry is accessed to retrieve a pointer to the code cache memory corresponding block of converted native instructions. The corresponding block of converted native instructions are fetched from the code cache memory for execution. | 08-02-2012 |
20120198157 | GUEST INSTRUCTION TO NATIVE INSTRUCTION RANGE BASED MAPPING USING A CONVERSION LOOK ASIDE BUFFER OF A PROCESSOR - A method for translating instructions for a processor. The method includes accessing a plurality of guest instructions that comprise multiple guest branch instructions, and assembling the plurality of guest instructions into a guest instruction block. The guest instruction block is converted into a corresponding native conversion block. The native conversion block is stored into a native cache. A mapping of the guest instruction block to corresponding native conversion block is stored in a conversion look aside buffer. Upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates whether the guest instruction has a corresponding converted native instruction in the native cache. The converted native instruction is forwarded for execution in response to the hit. | 08-02-2012 |
20120198168 | VARIABLE CACHING STRUCTURE FOR MANAGING PHYSICAL STORAGE - A method for managing a variable caching structure for managing storage for a processor. The method includes using a multi-way tag array to store a plurality of pointers for a corresponding plurality of different size groups of physical storage of a storage stack, wherein the pointers indicate guest addresses that have corresponding converted native addresses stored within the storage stack, and allocating a group of storage blocks of the storage stack, wherein the size of the allocation is in accordance with a corresponding size of one of the plurality of different size groups. Upon a hit on the tag, a corresponding entry is accessed to retrieve a pointer that indicates where in the storage stack a corresponding group of storage blocks of converted native instructions reside. The converted native instructions are then fetched from the storage stack for execution. | 08-02-2012 |
20120198209 | GUEST INSTRUCTION BLOCK WITH NEAR BRANCHING AND FAR BRANCHING SEQUENCE CONSTRUCTION TO NATIVE INSTRUCTION BLOCK - A method for translating instructions for a processor. The method includes accessing a plurality of guest instructions that comprise multiple guest branch instructions comprising at least one guest far branch, and building an instruction sequence from the plurality of guest instructions by using branch prediction on the at least one guest far branch. The method further includes assembling a guest instruction block from the instruction sequence. The guest instruction block is translated to a corresponding native conversion block, wherein an at least one native far branch that corresponds to the at least one guest far branch and wherein the at least one native far branch includes an opposite guest address for an opposing branch path of the at least one guest far branch. Upon encountering a missprediction, a correct instruction sequence is obtained by accessing the opposite guest address. | 08-02-2012 |
20120246448 | MEMORY FRAGMENTS FOR SUPPORTING CODE BLOCK EXECUTION BY USING VIRTUAL CORES INSTANTIATED BY PARTITIONABLE ENGINES - A system for executing instructions using a plurality of memory fragments for a processor. The system includes a global front end scheduler for receiving an incoming instruction sequence, wherein the global front end scheduler partitions the incoming instruction sequence into a plurality of code blocks of instructions and generates a plurality of inheritance vectors describing interdependencies between instructions of the code blocks. The system further includes a plurality of virtual cores of the processor coupled to receive code blocks allocated by the global front end scheduler, wherein each virtual core comprises a respective subset of resources of a plurality of partitionable engines, wherein the code blocks are executed by using the partitionable engines in accordance with a virtual core mode and in accordance with the respective inheritance vectors. A plurality memory fragments are coupled to the partitionable engines for providing data storage. | 09-27-2012 |
20120246450 | REGISTER FILE SEGMENTS FOR SUPPORTING CODE BLOCK EXECUTION BY USING VIRTUAL CORES INSTANTIATED BY PARTITIONABLE ENGINES - A system for executing instructions using a plurality of register file segments for a processor. The system includes a global front end scheduler for receiving an incoming instruction sequence, wherein the global front end scheduler partitions the incoming instruction sequence into a plurality of code blocks of instructions and generates a plurality of inheritance vectors describing interdependencies between instructions of the code blocks. The system further includes a plurality of virtual cores of the processor coupled to receive code blocks allocated by the global front end scheduler, wherein each virtual core comprises a respective subset of resources of a plurality of partitionable engines, wherein the code blocks are executed by using the partitionable engines in accordance with a virtual core mode and in accordance with the respective inheritance vectors. A plurality register file segments are coupled to the partitionable engines for providing data storage. | 09-27-2012 |
20120246657 | EXECUTING INSTRUCTION SEQUENCE CODE BLOCKS BY USING VIRTUAL CORES INSTANTIATED BY PARTITIONABLE ENGINES - A method for executing instructions using a plurality of virtual cores for a processor. The method includes receiving an incoming instruction sequence using a global front end scheduler, and partitioning the incoming instruction sequence into a plurality of code blocks of instructions. The method further includes generating a plurality of inheritance vectors describing interdependencies between instructions of the code blocks, and allocating the code blocks to a plurality of virtual cores of the processor, wherein each virtual core comprises a respective subset of resources of a plurality of partitionable engines. The code blocks are executed by using the partitionable engines in accordance with a virtual core mode and in accordance with the respective inheritance vectors. | 09-27-2012 |
20120297170 | DECENTRALIZED ALLOCATION OF RESOURCES AND INTERCONNNECT STRUCTURES TO SUPPORT THE EXECUTION OF INSTRUCTION SEQUENCES BY A PLURALITY OF ENGINES - A method for decentralized resource allocation in an integrated circuit. The method includes receiving a plurality of requests from a plurality of resource consumers of a plurality of partitionable engines to access a plurality resources, wherein the resources are spread across the plurality of engines and are accessed via a global interconnect structure. At each resource, a number of requests for access to said each resource are added. At said each resource, the number of requests are compared against a threshold limiter. At said each resource, a subsequent request that is received that exceeds the threshold limiter is canceled. Subsequently, requests that are not canceled within a current clock cycle are implemented. | 11-22-2012 |
20120297396 | INTERCONNECT STRUCTURE TO SUPPORT THE EXECUTION OF INSTRUCTION SEQUENCES BY A PLURALITY OF ENGINES - A global interconnect system. The global interconnect system includes a plurality of resources having data for supporting the execution of multiple code sequences and a plurality of engines for implementing the execution of the multiple code sequences. A plurality of resource consumers are within each of the plurality of engines. A global interconnect structure is coupled to the plurality of resource consumers and coupled to the plurality of resources to enable data access and execution of the multiple code sequences, wherein the resource consumers access the resources through a per cycle utilization of the global interconnect structure. | 11-22-2012 |
20130024619 | MULTILEVEL CONVERSION TABLE CACHE FOR TRANSLATING GUEST INSTRUCTIONS TO NATIVE INSTRUCTIONS - A method for translating instructions for a processor. The method includes accessing a guest instruction and performing a first level translation of the guest instruction using a first level conversion table. The method further includes outputting a resulting native instruction when the first level translation proceeds to completion. A second level translation of the guest instruction is performed using a second level conversion table when the first level translation does not proceed to completion, wherein the second level translation further processes the guest instruction based upon a partial translation from the first level conversion table. The resulting native instruction is output when the second level translation proceeds to completion. | 01-24-2013 |
20130024661 | HARDWARE ACCELERATION COMPONENTS FOR TRANSLATING GUEST INSTRUCTIONS TO NATIVE INSTRUCTIONS - A hardware based translation accelerator. The hardware includes a guest fetch logic component for accessing guest instructions; a guest fetch buffer coupled to the guest fetch logic component and a branch prediction component for assembling guest instructions into a guest instruction block; and conversion tables coupled to the guest fetch buffer for translating the guest instruction block into a corresponding native conversion block. The hardware further includes a native cache coupled to the conversion tables for storing the corresponding native conversion block, and a conversion look aside buffer coupled to the native cache for storing a mapping of the guest instruction block to corresponding native conversion block, wherein upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates the guest instruction has a corresponding converted native instruction in the native cache. | 01-24-2013 |
20130238874 | SYSTEMS AND METHODS FOR ACCESSING A UNIFIED TRANSLATION LOOKASIDE BUFFER - Systems and methods for accessing a unified translation lookaside buffer (TLB) are disclosed. A method includes receiving an indicator of a level one translation lookaside buffer (L | 09-12-2013 |
20130311759 | INSTRUCTION SEQUENCE BUFFER TO ENHANCE BRANCH PREDICTION EFFICIENCY - A method for outputting alternative instruction sequences. The method includes tracking repetitive hits to determine a set of frequently hit instruction sequences for a microprocessor. A frequently miss-predicted branch instruction is identified, wherein the predicted outcome of the branch instruction is frequently wrong. An alternative instruction sequence for the branch instruction target is stored into a buffer. On a subsequent hit to the branch instruction where the predicted outcome of the branch instruction was wrong, the alternative instruction sequence is output from the buffer. | 11-21-2013 |
20140032844 | SYSTEMS AND METHODS FOR FLUSHING A CACHE WITH MODIFIED DATA - Systems and methods for flushing a cache with modified data are disclosed. Responsive to a request to flush data from a cache with modified data to a next level cache that does not include the cache with modified data, the cache with modified data is accessed using an index and a way and an address associated with the index and the way is secured. Using the address, the cache with modified data is accessed a second time and an entry that is associated with the address is retrieved from the cache with modified data. The entry is placed into a location of the next level cache. | 01-30-2014 |
20140032845 | SYSTEMS AND METHODS FOR SUPPORTING A PLURALITY OF LOAD ACCESSES OF A CACHE IN A SINGLE CYCLE - A method for supporting a plurality of load accesses is disclosed. A plurality of requests to access a data cache is accessed, and in response, a tag memory is accessed that maintains a plurality of copies of tags for each entry in the data cache. Tags are identified that correspond to individual requests. The data cache is accessed based on the tags that correspond to the individual requests. A plurality of requests to access the same block of the plurality of blocks causes an access arbitration that is executed in the same clock cycle as is the access of the tag memory. | 01-30-2014 |
20140032846 | SYSTEMS AND METHODS FOR SUPPORTING A PLURALITY OF LOAD AND STORE ACCESSES OF A CACHE - Systems and methods for supporting a plurality of load and store accesses of a cache are disclosed. Responsive to a request of a plurality of requests to access a block of a plurality of blocks of a load cache, the block of the load cache and a logically and physically paired block of a store coalescing cache are accessed in parallel. The data that is accessed from the block of the load cache is overwritten by the data that is accessed from the block of the store coalescing cache by merging on a per byte basis. Access is provided to the merged data. | 01-30-2014 |
20140032856 | SYSTEMS AND METHODS FOR MAINTAINING THE COHERENCY OF A STORE COALESCING CACHE AND A LOAD CACHE - A method for maintaining the coherency of a store coalescing cache and a load cache is disclosed. As a part of the method, responsive to a write-back of an entry from a level one store coalescing cache to a level two cache, the entry is written into the level two cache and into the level one load cache. The writing of the entry into the level two cache and into the level one load cache is executed at the speed of access of the level two cache. | 01-30-2014 |
20140075168 | INSTRUCTION SEQUENCE BUFFER TO STORE BRANCHES HAVING RELIABLY PREDICTABLE INSTRUCTION SEQUENCES - A method for outputting reliably predictable instruction sequences. The method includes tracking repetitive hits to determine a set of frequently hit instruction sequences for a microprocessor, and out of that set, identifying a branch instruction having a series of subsequent frequently executed branch instructions that form a reliably predictable instruction sequence. The reliably predictable instruction sequence is stored into a buffer. On a subsequent hit to the branch instruction, the reliably predictable instruction sequence is output from the buffer. | 03-13-2014 |
20140108729 | SYSTEMS AND METHODS FOR LOAD CANCELING IN A PROCESSOR THAT IS CONNECTED TO AN EXTERNAL INTERCONNECT FABRIC - Systems and methods for load canceling in a processor that is connected to an external interconnect fabric are disclosed. As a part of a method for load canceling in a processor that is connected to an external bus, and responsive to a flush request and a corresponding cancellation of pending speculative loads from a load queue, a type of one or more of the pending speculative loads that are positioned in the instruction pipeline external to the processor, is converted from load to prefetch. Data corresponding to one or more of the pending speculative loads that are positioned in the instruction pipeline external to the processor is accessed and returned to cache as prefetch data. The prefetch data is retired in a cache location of the processor. | 04-17-2014 |
20140108730 | SYSTEMS AND METHODS FOR NON-BLOCKING IMPLEMENTATION OF CACHE FLUSH INSTRUCTIONS - Systems and methods for non-blocking implementation of cache flush instructions are disclosed. As a part of a method, data is accessed that is received in a write-back data holding buffer from a cache flushing operation, the data is flagged with a processor identifier and a serialization flag, and responsive to the flagging, the cache is notified that the cache flush is completed. Subsequent to the notifying, access is provided to data then present in the write-back data holding buffer to determine if data then present in the write-back data holding buffer is flagged. | 04-17-2014 |
20140108739 | SYSTEMS AND METHODS FOR IMPLEMENTING WEAK STREAM SOFTEARE DATA AND INSTRUCTION PREFETCHING USING A HARDWARE DATA PREFETCHER - A method for weak stream software data and instruction prefetching using a hardware data prefetcher is disclosed. A method includes, determining if software includes software prefetch instructions, using a hardware data prefetcher, and, accessing the software prefetch instructions if the software includes software prefetch instructions. Using the hardware data prefetcher, weak stream software data and instruction prefetching operations are executed based on the software prefetch instructions, free of training operations. | 04-17-2014 |
20140269753 | METHOD FOR IMPLEMENTING A LINE SPEED INTERCONNECT STRUCTURE - A method for line speed interconnect processing. The method includes receiving initial inputs from an input communications path, performing a pre-sorting of the initial inputs by using a first stage interconnect parallel processor to create intermediate inputs, and performing the final combining and splitting of the intermediate inputs by using a second stage interconnect parallel processor to create resulting outputs. The method further includes transmitting the resulting outputs out of the second stage at line speed. | 09-18-2014 |
20140281242 | METHODS, SYSTEMS AND APPARATUS FOR PREDICTING THE WAY OF A SET ASSOCIATIVE CACHE - A method for predicting a way of a set associative shadow cache is disclosed. As a part of a method, a request to fetch a first far taken branch instruction of a first cache line from an instruction cache is received, and responsive to a hit in the instruction cache, a predicted way is selected from a way array using a way that corresponds to the hit in the instruction cache. A second cache line is selected from a shadow cache using the predicted way and the first cache line and the second cache line are forwarded in the same clock cycle. | 09-18-2014 |
20140281411 | METHOD FOR DEPENDENCY BROADCASTING THROUGH A SOURCE ORGANIZED SOURCE VIEW DATA STRUCTURE - A method for dependency broadcasting through a source organized source view data structure. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating a source organized source view data structure, wherein the source view data structure stores sources corresponding to the instruction blocks as recorded by the plurality of register templates; upon dispatch of one block of the instruction blocks, broadcasting a number belonging to the one block to a row of the source view data structure that relates that block and marking the sources of the row accordingly; and updating the dependency information of remaining instruction blocks in accordance with the broadcast. | 09-18-2014 |
20140281412 | METHOD FOR POPULATING AND INSTRUCTION VIEW DATA STRUCTURE BY USING REGISTER TEMPLATE SNAPSHOTS - A method for populating an instruction view data structure by using register template snapshots. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating and instruction view data structure, wherein the instruction view data structure stores instructions corresponding to the instruction blocks as recorded by the plurality of register templates; and using the instruction view data structure to feed a plurality of stacked execution units of execution stage in accordance with the readiness of instruction sources of the instruction blocks. | 09-18-2014 |
20140281426 | METHOD FOR POPULATING A SOURCE VIEW DATA STRUCTURE BY USING REGISTER TEMPLATE SNAPSHOTS - A method for populating a source view data structure by using register template snapshots. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating a source view data structure, wherein the source view data structure stores sources corresponding to the instruction blocks as recorded by the plurality of register templates; and determining which of the plurality of instruction blocks are ready for dispatch by using the populated source view data structure. | 09-18-2014 |
20140281427 | METHOD FOR IMPLEMENTING A REDUCED SIZE REGISTER VIEW DATA STRUCTURE IN A MICROPROCESSOR - A method for implementing a reduced size register view data structure in a microprocessor. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating a register view data structure, wherein the register view data structure stores destinations corresponding to the instruction blocks as recorded by the plurality of register templates; and using the register view data structure to track a machine state in accordance with the execution of the plurality of instruction blocks, wherein the register view data structure is a reduced size register view data structure by only storing register template snapshots containing branches or by storing deltas between changing register template snapshots. | 09-18-2014 |
20140281428 | METHOD FOR POPULATING REGISTER VIEW DATA STRUCTURE BY USING REGISTER TEMPLATE SNAPSHOTS - A method for populating a register view data structure by using register template snapshots. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating a register view data structure, wherein the register view data structure stores destinations corresponding to the instruction blocks as recorded by the plurality of register templates; and using the register view data structure to track a machine state in accordance with the execution of the plurality of instruction blocks. | 09-18-2014 |
20140281436 | METHOD FOR EMULATING A GUEST CENTRALIZED FLAG ARCHITECTURE BY USING A NATIVE DISTRIBUTED FLAG ARCHITECTURE - A method for emulating a guest centralized flag architecture by using a native distributed flag architecture. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein each of the instruction blocks comprise two half blocks; scheduling the instructions of the instruction block to execute in accordance with a scheduler; and using a distributed flag architecture to emulate a centralized flag architecture for the emulation of guest instruction execution. | 09-18-2014 |
20140281438 | METHOD FOR A DELAYED BRANCH IMPLEMENTATION BY USING A FRONT END TRACK TABLE - A method for a delayed branch implementation by using a front end track table. The method includes receiving an incoming instruction sequence using a global front end, wherein the instruction sequence includes at least one branch, creating a delayed branch in response to receiving the one branch, and using a front end track table to track both the delayed branch the one branch. | 09-18-2014 |
20140282546 | METHODS, SYSTEMS AND APPARATUS FOR SUPPORTING WIDE AND EFFICIENT FRONT-END OPERATION WITH GUEST-ARCHITECTURE EMULATION - Methods for supporting wide and efficient front-end operation with guest architecture emulation are disclosed. As a part of a method for supporting wide and efficient front-end operation, upon receiving a request to fetch a first far taken branch instruction, a cache line that includes the first far taken branch instruction, a next cache line and a cache line located at the target of the first far taken branch instruction is read. Based on information that is accessed from a data table, the cache line and either the next cache line or the cache line located at the target is fetched in a single cycle. | 09-18-2014 |
20140282592 | METHOD FOR EXECUTING MULTITHREADED INSTRUCTIONS GROUPED INTO BLOCKS - A method for executing multithreaded instructions grouped into blocks. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein the instructions of the instruction blocks are interleaved with multiple threads; scheduling the instructions of the instruction block to execute in accordance with the multiple threads; and tracking execution of the multiple threads to enforce fairness in an execution pipeline. | 09-18-2014 |
20140282601 | METHOD FOR DEPENDENCY BROADCASTING THROUGH A BLOCK ORGANIZED SOURCE VIEW DATA STRUCTURE - A method for dependency broadcasting through a block organized source view data structure. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; populating a block organized source view data structure, wherein the source view data structure stores sources corresponding to the instruction blocks as recorded by the plurality of register templates; upon dispatch of one block of the instruction blocks, broadcasting a number belonging to the one block to a column of the source view data structure that relates that block and marking the column accordingly; and updating the dependency information of remaining instruction blocks in accordance with the broadcast. | 09-18-2014 |
20140317387 | METHOD FOR PERFORMING DUAL DISPATCH OF BLOCKS AND HALF BLOCKS - A method for executing dual dispatch of blocks and half blocks. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein each of the instruction blocks comprise two half blocks; scheduling the instructions of the instruction block to execute in accordance with a scheduler; and performing a dual dispatch of the two half blocks for execution on an execution unit. | 10-23-2014 |
20140324937 | METHOD FOR A STAGE OPTIMIZED HIGH SPEED ADDER - A method for fast parallel adder processing. The method includes receiving parallel inputs from a communications path, wherein each input comprises one bit, adding the inputs using a parallel structure, wherein the parallel structure is optimized to accelerate the addition by utilizing a characteristic that the inputs are one bit each, and transmitting the resulting outputs to a subsequent stage. | 10-30-2014 |
20140344554 | MICROPROCESSOR ACCELERATED CODE OPTIMIZER AND DEPENDENCY REORDERING METHOD - A dependency reordering method. The method includes accessing an input sequence of instructions, initializing three registers, and loading instruction numbers into a first register. The method further includes loading destination register numbers into a second register, broadcasting values from the first register to a position in a third register in accordance with a position number in the second register, overwriting positions in the third register in accordance with position numbers in the second register, and using information in the third register to populate a dependency matrix for grouping dependent instructions from the sequence of instructions. | 11-20-2014 |
20150039859 | MICROPROCESSOR ACCELERATED CODE OPTIMIZER - A method for accelerating code optimization a microprocessor. The method includes fetching an incoming microinstruction sequence using an instruction fetch component and transferring the fetched macroinstructions to a decoding component for decoding into microinstructions. Optimization processing is performed by reordering the microinstruction sequence into an optimized microinstruction sequence comprising a plurality of dependent code groups. The optimized microinstruction sequence is output to a microprocessor pipeline for execution. A copy of the optimized microinstruction sequence is stored into a sequence cache for subsequent use upon a subsequent hit optimized microinstruction sequence. | 02-05-2015 |
20150095588 | LOCK-BASED AND SYNCH-BASED METHOD FOR OUT OF ORDER LOADS IN A MEMORY CONSISTENCY MODEL USING SHARED MEMORY RESOURCES - In a processor, a lock-based method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of cores; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register. | 04-02-2015 |
20150095591 | METHOD AND SYSTEM FOR FILTERING THE STORES TO PREVENT ALL STORES FROM HAVING TO SNOOP CHECK AGAINST ALL WORDS OF A CACHE - In a processor, a method for filtering stores to prevent all stores from having to snoop check against all words of a cache. The method includes implementing a cache wherein stores snoop the caches for address matches to maintain coherency; marking a portion of a cache line if a given core out of a plurality of cores loads from that portion by using an access mask; checking the access mask upon execution of subsequent stores to the cache line; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask. | 04-02-2015 |
20150100734 | SEMAPHORE METHOD AND SYSTEM WITH OUT OF ORDER LOADS IN A MEMORY CONSISTENCY MODEL THAT CONSTITUTES LOADS READING FROM MEMORY IN ORDER - In a processor, a method for using a semaphore with out of order loads in a memory consistency model that constitutes loads reading from memory in order. The method includes implementing a memory resource that can be accessed by a plurality of cores; implementing an access mask that functions by tracking which words of a cache line have pending loads, wherein the cache line includes the memory resource, wherein an out of order load sets a mask bit within the access mask when accessing a word of the cache line, and clears the mask bit when that out of order load retires. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register. | 04-09-2015 |
20150134934 | VIRTUAL LOAD STORE QUEUE HAVING A DYNAMIC DISPATCH WINDOW WITH A DISTRIBUTED STRUCTURE - An out of order processor. The processor includes a distributed load queue and a distributed store queue that maintain single program sequential semantics while allowing an out of order dispatch of loads and stores across a plurality of cores and memory fragments; wherein the processor allocates other instructions besides loads and stores beyond the actual physical size limitation of the load/store queue; and wherein the other instructions can be dispatched and executed even though intervening loads or stores do not have spaces in the load store queue. | 05-14-2015 |
20150186144 | ACCELERATED CODE OPTIMIZER FOR A MULTIENGINE MICROPROCESSOR - A method for accelerating code optimization a microprocessor. The method includes fetching an incoming microinstruction sequence using an instruction fetch component and transferring the fetched macroinstructions to a decoding component for decoding into microinstructions. Optimization processing is performed by reordering the microinstruction sequence into an optimized microinstruction sequence comprising a plurality of dependent code groups. The plurality of dependent code groups are then output to a plurality of engines of the microprocessor for execution in parallel. A copy of the optimized microinstruction sequence is stored into a sequence cache for subsequent use upon a subsequent hit optimized microinstruction sequence. | 07-02-2015 |
20150205605 | LOAD STORE BUFFER AGNOSTIC TO THREADS IMPLEMENTING FORWARDING FROM DIFFERENT THREADS BASED ON STORE SENIORITY - In a processor, a thread agnostic unified store queue and a unified load queue method for out of order loads in a memory consistency model using shared memory resources. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores, wherein the plurality of cores share a unified store queue and a unified load queue; and implementing an access mask that functions by tracking which words of a cache line are accessed via a load, wherein the cache line includes the memory resource, wherein the load sets a mask bit within the access mask when accessing a word of the cache line, and wherein the mask bit blocks accesses from other loads from a plurality of cores. The method further includes checking the access mask upon execution of subsequent stores from the plurality of cores to the cache line, wherein stores from different threads can forward to loads of different threads while still maintaining in order memory consistency semantics; and causing a miss prediction when a subsequent store to the portion of the cache line sees a prior mark from a load in the access mask, wherein the subsequent store will signal a load queue entry corresponding to that load by using a tracker register and a thread ID register. | 07-23-2015 |
20150248294 | FAST UNALIGNED MEMORY ACCESS - Fast unaligned memory access. In accordance with a first embodiment of the present invention, a computing device includes a load queue memory structure configured to queue load operations and a store queue memory structure configured to queue store operations. The computing device includes also includes at least one bit configured to indicate the presence of an unaligned address component for an entry of said load queue memory structure, and at least one bit configured to indicate the presence of an unaligned address component for an entry of said store queue memory structure. The load queue memory may also include memory configured to indicate data forwarding of an unaligned address component from said store queue memory structure to said load queue memory structure. | 09-03-2015 |
20150286576 | CACHE REPLACEMENT POLICY - Cache replacement policy. In accordance with a first embodiment of the present invention, an apparatus comprises a queue memory structure configured to queue cache requests that miss a second cache after missing a first cache. The apparatus comprises additional memory associated with the queue memory structure is configured to record an evict way of the cache requests for the cache. The apparatus may be further configured to lock the evict way recorded in the additional memory, for example, to prevent reuse of the evict way. The apparatus may be further configured to unlock the evict way responsive to a fill from the second cache to the cache. The additional memory may be a component of a higher level cache. | 10-08-2015 |
20150301954 | SYSTEMS AND METHODS FOR ACCESSING A UNIFIED TRANSLATION LOOKASIDE BUFFER - Systems and methods for accessing a unified translation lookaside buffer (TLB) are disclosed. A method includes receiving an indicator of a level one translation lookaside buffer (L1TLB) miss corresponding to a request for a virtual address to physical address translation, searching a cache that includes virtual addresses and page sizes that correspond to translation table entries (TTEs) that have been evicted from the L1TLB, where a page size is identified, and searching a second level TLB and identifying a physical address that is contained in the second level TLB. Access is provided to the identified physical address. | 10-22-2015 |
20150324213 | METHOD AND APPARATUS FOR PROVIDING HARDWARE SUPPORT FOR SELF-MODIFYING CODE - A method and apparatus for providing support for self modifying guest code. The apparatus includes a memory, a hardware buffer, and a processor. The processor is configured to convert guest code to native code and store converted native code equivalent of the guest code into a code cache portion of the processor. The processor is further configured to maintain the hardware buffer configured for tracking respective locations of converted code in a code cache. The hardware buffer is updated based a respective access to a respective location in the memory associated with a respective location of converted code in the code cache. The processor is further configured to perform a request to modify a memory location after accessing the hardware buffer. | 11-12-2015 |
20160026444 | SYSTEM CONVERTER THAT IMPLEMENTS A REORDERING PROCESS THROUGH JIT (JUST IN TIME) OPTIMIZATION THAT ENSURES LOADS DO NOT DISPATCH AHEAD OF OTHER LOADS THAT ARE TO THE SAME ADDRESS - A system for an agnostic runtime architecture. The system includes a system emulation/virtualization converter, an application code converter, and a converter wherein a system emulation/virtualization converter and an application code converter implement a system emulation process, and wherein the system converter implements a system and application conversion process for executing code from a guest image, wherein the system converter or the system emulator. The system further includes a reordering process through JIT (just in time) optimization that ensures loads do not dispatch ahead of other loads that are to the same address, wherein a load will check for a same address of subsequent loads from a same thread, and a thread checking process that enable other thread store checks against the entire load queue and a monitor extension. | 01-28-2016 |
20160026445 | SYSTEM CONVERTER THAT IMPLEMENTS A RUN AHEAD RUN TIME GUEST INSTRUCTION CONVERSION/DECODING PROCESS AND A PREFETCHING PROCESS WHERE GUEST CODE IS PRE-FETCHED FROM THE TARGET OF GUEST BRANCHES IN AN INSTRUCTION SEQUENCE - A system for an agnostic runtime architecture. The system includes a system emulation/virtualization converter, an application code converter, and a converter wherein a system emulation/virtualization converter and an application code converter implement a system emulation process, and wherein the system converter implements a system and application conversion process for executing code from a guest image, wherein the system converter or the system emulator. The system further includes a run ahead run time guest such an conversion/decoding process, and a prefetching process where guest code is pre-fetched from the target of guest branches in an instruction sequence. | 01-28-2016 |
20160026482 | USING A PLURALITY OF CONVERSION TABLES TO IMPLEMENT AN INSTRUCTION SET AGNOSTIC RUNTIME ARCHITECTURE - A system for an agnostic runtime architecture is disclosed. The system includes a system emulation/virtualization converter, an application code converter, and a system converter wherein the system emulation/virtualization converter and the application code converter implement a system emulation process, and wherein the system converter implements a system conversion process for executing code from a guest image. The system converter further comprises a guest fetch logic component for accessing a plurality of guest instructions, a guest fetch buffer coupled to the guest fetch logic component and a branch prediction component for assembling the plurality of guest instructions into a guest instruction block, and a plurality of conversion tables including a first level conversion table and a second level conversion table coupled to the guest fetch buffer for translating the guest instruction block into a corresponding native conversion block. The system further includes a native cache coupled to the conversion tables for storing the corresponding native conversion block, a conversion look aside buffer coupled to the native cache for storing a mapping of the guest instruction block to corresponding native conversion block. Upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates the guest instruction has a corresponding converted native instruction in the native cache, and in response to the hit the conversion look aside buffer forwards the translated native instruction for execution. | 01-28-2016 |
20160026483 | SYSTEM FOR AN INSTRUCTION SET AGNOSTIC RUNTIME ARCHITECTURE - A system for an agnostic runtime architecture. The system includes a close to bare metal JIT conversion layer, a runtime native instruction assembly component included within the conversion layer for receiving instructions from a guest virtual machine, and a runtime native instruction sequence formation component included within the conversion layer for receiving instructions from native code. The system further includes a dynamic sequence block-based instruction mapping component included within the conversion layer for code cache allocation and metadata creation, and is coupled to receive inputs from the runtime native instruction assembly component and the runtime native instruction sequence formation component, and wherein the dynamic sequence block-based instruction mapping component receives resulting processed instructions from the runtime native instruction assembly component and the runtime native instruction sequence formation component and allocates the resulting processed instructions to a processor for execution. | 01-28-2016 |
20160026484 | SYSTEM CONVERTER THAT EXECUTES A JUST IN TIME OPTIMIZER FOR EXECUTING CODE FROM A GUEST IMAGE - A system for an agnostic runtime architecture. The system includes a system emulation/virtualization converter, an application code converter, and a system converter wherein the system emulation/virtualization converter and the application code converter implement a system emulation process, and wherein the system converter implements a system conversion process for executing code from a guest image. The system converter executes a JIT optimizer, and wherein the JIT optimizer ensures loads are not dispatch ahead of other loads that are to a same memory address by checking for the same address from subsequent loads from a same thread. | 01-28-2016 |
20160026486 | AN ALLOCATION AND ISSUE STAGE FOR REORDERING A MICROINSTRUCTION SEQUENCE INTO AN OPTIMIZED MICROINSTRUCTION SEQUENCE TO IMPLEMENT AN INSTRUCTION SET AGNOSTIC RUNTIME ARCHITECTURE - A system for an agnostic runtime architecture. The system includes a system emulation/virtualization converter, an application code converter, and a system converter wherein the system emulation/virtualization converter and the application code converter implement a system emulation process, and wherein the system converter implements a system conversion process for executing code from a guest image. The system converter further comprises an instruction fetch component for fetching an incoming microinstruction sequence, a decoding component coupled to the instruction fetch component to receive the fetched macro instruction sequence and decode into a microinstruction sequence, and an allocation and issue stage coupled to the decoding component to receive the microinstruction sequence perform optimization processing by reordering the microinstruction sequence into an optimized microinstruction sequence comprising a plurality of dependent code groups. A microprocessor pipeline is coupled to the allocation and issue stage to receive and execute the optimized microinstruction sequence. A sequence cache is coupled to the allocation and issue stage to receive and store a copy of the optimized microinstruction sequence for subsequent use upon a subsequent hit on the optimized microinstruction sequence, and a hardware component is coupled for moving instructions in the incoming microinstruction sequence. | 01-28-2016 |
20160026487 | USING A CONVERSION LOOK ASIDE BUFFER TO IMPLEMENT AN INSTRUCTION SET AGNOSTIC RUNTIME ARCHITECTURE - A system for an agnostic runtime architecture. The system includes a system emulation/virtualization converter, an application code converter, and a converter, wherein a system emulation/virtualization converter and an application code converter implement a system emulation process. The system converter implements a system and application conversion process for executing code from a guest image, wherein the system converter or the system emulator accesses a plurality of guest instructions that comprise multiple guest branch instructions, and assembles the plurality of guest instructions into a guest instruction block. The system converter also translates the guest instruction block into a corresponding native conversion block, stores the native conversion block into a native cache, and stores a mapping of the guest instruction block to corresponding native conversion block in a conversion look aside buffer. Upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates the guest instruction has a corresponding converted native instruction in the native cache, and forwards the converted native instruction for execution in response to the hit. | 01-28-2016 |
20160041908 | SYSTEMS AND METHODS FOR MAINTAINING THE COHERENCY OF A STORE COALESCING CACHE AND A LOAD CACHE - A method for maintaining the coherency of a store coalescing cache and a load cache is disclosed. As a part of the method, responsive to a write-back of an entry from a level one store coalescing cache to a level two cache, the entry is written into the level two cache and into the level one load cache. The writing of the entry into the level two cache and into the level one load cache is executed at the speed of access of the level two cache. | 02-11-2016 |
20160041913 | SYSTEMS AND METHODS FOR SUPPORTING A PLURALITY OF LOAD AND STORE ACCESSES OF A CACHE - Systems and methods for supporting a plurality of load and store accesses of a cache are disclosed. Responsive to a request of a plurality of requests to access a block of a plurality of blocks of a load cache, the block of the load cache and a logically and physically paired block of a store coalescing cache are accessed in parallel. The data that is accessed from the block of the load cache is overwritten by the data that is accessed from the block of the store coalescing cache by merging on a per byte basis. Access is provided to the merged data. | 02-11-2016 |
20160041930 | SYSTEMS AND METHODS FOR SUPPORTING A PLURALITY OF LOAD ACCESSES OF A CACHE IN A SINGLE CYCLE - A method for supporting a plurality of load accesses is disclosed. A plurality of requests to access a data cache is accessed, and in response, a tag memory is accessed that maintains a plurality of copies of tags for each entry in the data cache. Tags are identified that correspond to individual requests. The data cache is accessed based on the tags that correspond to the individual requests. A plurality of requests to access the same block of the plurality of blocks causes an access arbitration that is executed in the same clock cycle as is the access of the tag memory. | 02-11-2016 |
Patent application number | Description | Published |
20130091340 | Apparatus and Method for Processing an Instruction Matrix Specifying Parallel and Dependent Operations - A matrix of execution blocks form a set of rows and columns. The rows support parallel execution of instructions and the columns support execution of dependent instructions. The matrix of execution blocks process a single block of instructions specifying parallel and dependent instructions. | 04-11-2013 |
20140181475 | Parallel Processing of a Sequential Program Using Hardware Generated Threads and Their Instruction Groups Executing on Plural Execution Units and Accessing Register File Segments Using Dependency Inheritance Vectors Across Multiple Engines - A unified architecture for dynamic generation, execution, synchronization and parallelization of complex instruction formats includes a virtual register file, register cache and register file hierarchy. A self-generating and synchronizing dynamic and static threading architecture provides efficient context switching. | 06-26-2014 |
20140281116 | Method and Apparatus to Speed up the Load Access and Data Return Speed Path Using Early Lower Address Bits - A microprocessor implemented method for processing a load instruction is disclosed. The method comprises computing a virtual address corresponding to the load instruction. Next, it comprises performing a lookup of a set associative translation lookaside buffer (TLB) and a set associative data cache memory in parallel using early calculated lower address bits of the virtual address. Subsequently, it comprises retrieving a set of entries from the TLB corresponding to a first group of lower address bits transmitted to the TLB, wherein the set of entries comprise a plurality of virtual addresses and corresponding physical addresses. Further, it comprises finding a matching entry for the virtual address in the set of entries using upper bits of the virtual address, wherein the matching entry comprises a physical address corresponding to the virtual address. Finally, it comprises finding a matching entry in the data cache memory using the physical address. | 09-18-2014 |
20140281388 | Method and Apparatus for Guest Return Address Stack Emulation Supporting Speculation - A microprocessor implemented method for maintaining a guest return address stack in an out-of-order microprocessor pipeline is disclosed. The method comprises mapping a plurality of instructions in a guest address space into a corresponding plurality of instructions in a native address space. For each function call instruction in the native address space fetched during execution, the method also comprises performing the following: (a) pushing a current entry into a guest return address stack (GRAS) responsive to a function call, wherein the GRAS is maintained at the fetch stage of the pipeline, and wherein the current entry comprises information regarding both a guest target return address and a corresponding native target return address associated with the function call; (b) popping the current entry from the GRAS in response to processing a return instruction; and (c) fetching instructions from the native target return address in the current entry after the popping from the GRAS. | 09-18-2014 |
20140281409 | METHOD AND APPARATUS FOR NEAREST POTENTIAL STORE TAGGING - A method for performing memory disambiguation in an out-of-order microprocessor pipeline is disclosed. The method comprises storing a tag with a load operation, wherein the tag is an identification number representing a store instruction nearest to the load operation, wherein the store instruction is older with respect to the load operation and wherein the store has potential to result in a RAW violation in conjunction with the load operation. The method also comprises issuing the load operation from an instruction scheduling module. Further, the method comprises acquiring data for the load operation speculatively after the load operation has arrived at a load store queue module. Finally, the method comprises determining if an identification number associated with a last contiguous issued store with respect to the load operation is equal to or greater than the tag and gating a validation process for the load operation in response to the determination. | 09-18-2014 |
20140281410 | Method and Apparatus to Allow Early Dependency Resolution and Data Forwarding in a Microprocessor - A microprocessor implemented method for performing early dependency resolution and data forwarding is disclosed. The method comprises mapping a plurality of instructions in a guest address space into a corresponding plurality of instructions in a native address space. For each current guest branch instruction in the native address space fetched during execution, performing (a) determining a youngest prior guest branch target stored in a guest branch target register, wherein the guest branch register is operable to speculatively store a plurality of prior guest branch targets corresponding to prior guest branch instructions; (b) determining a current branch target for a respective current guest branch instruction by adding an offset value for the respective current guest branch instruction to the youngest prior guest branch target; and (c) creating an entry in the guest branch target register for the current branch target. | 09-18-2014 |
20140281416 | METHOD FOR IMPLEMENTING A REDUCED SIZE REGISTER VIEW DATA STRUCTURE IN A MICROPROCESSOR - A method for implementing a reduced size register view data structure in a microprocessor. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of multiplexers to access ports of a scheduling array to store the instruction blocks as a series of chunks. | 09-18-2014 |
20140281422 | Method and Apparatus for Sorting Elements in Hardware Structures - A method for sorting elements in hardware structures is disclosed. The method comprises selecting a plurality of elements to order from an unordered input queue (UIQ) within a predetermined range in response to finding a match between at least one most significant bit of the predetermined range and corresponding bits of a respective identifier associated with each of the plurality of elements. The method further comprises presenting each of the plurality of elements to a respective multiplexer. Further the method comprises generating a select signal for an enabled multiplexer in response to finding a match between at least one least significant bit of a respective identifier associated with each of the plurality of elements and a port number of the ordered queue. Finally, the method comprises forwarding a packet associated with a selected element identifier to a matching port number of the ordered queue from the enabled multiplexer. | 09-18-2014 |
20140304492 | METHOD AND APPARATUS TO INCREASE THE SPEED OF THE LOAD ACCESS AND DATA RETURN SPEED PATH USING EARLY LOWER ADDRESS BITS - A microprocessor implemented method for resolving dependencies for a load instruction in a load store queue (LSQ) is disclosed. The method comprises initiating a computation of a virtual address corresponding to the load instruction in a first clock cycle. It also comprises transmitting early calculated lower address bits of the virtual address to a load store queue (LSQ) in the same cycle as the initiating. Finally, it comprises performing a partial match in the LSQ responsive to and using the lower address bits to find a prior aliasing store, wherein the prior aliasing store stores to a same address as the load instruction. | 10-09-2014 |
20150046683 | METHOD FOR USING REGISTER TEMPLATES TO TRACK INTERDEPENDENCIES AMONG BLOCKS OF INSTRUCTIONS - A method for executing instructions using register templates to track interdependencies among blocks of instructions. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; and using a register template to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions. | 02-12-2015 |
20150046686 | METHOD FOR EXECUTING BLOCKS OF INSTRUCTIONS USING A MICROPROCESSOR ARCHITECTURE HAVING A REGISTER VIEW, SOURCE VIEW, INSTRUCTION VIEW, AND A PLURALITY OF REGISTER TEMPLATES - A method for executing blocks of instructions using a microprocessor architecture having a register view, source view, instruction view, and a plurality of register templates. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks; using a plurality of register templates to track instruction destinations and instruction sources by populating the register template with block numbers corresponding to the instruction blocks, wherein the block numbers corresponding to the instruction blocks indicate interdependencies among the blocks of instructions; using a register view data structure, wherein the register view data structure stores destinations corresponding to the instruction blocks; using a source view data structure, wherein the source view data structure stores sources corresponding to the instruction blocks; and using an instruction view data structure, wherein the instruction view data structure stores instructions corresponding to the instruction blocks. | 02-12-2015 |
20150095615 | INSTRUCTION DEFINITION TO IMPLEMENT LOAD STORE REORDERING AND OPTIMIZATION - A method for forwarding data from the store instructions to a corresponding load instruction in an out of order processor. The method includes accessing an incoming sequence of instructions, and of said sequence of instructions, splitting store instructions into a store address instruction and a store data instruction, wherein the store address performs address calculation and fetch, and wherein the store data performs a load of register contents to a memory address. The method further includes, of said sequence of instructions, splitting load instructions into a load address instruction and a load data instruction, wherein the load address performs address calculation and fetch, and wherein the load data performs a load of memory address contents into a register, and reordering the store address and load address instructions earlier and further away from LD/SD the instruction sequence to enable earlier dispatch and execution of the loads and the stores. | 04-02-2015 |
20150095618 | VIRTUAL LOAD STORE QUEUE HAVING A DYNAMIC DISPATCH WINDOW WITH A UNIFIED STRUCTURE - An out of order processor. The processor includes a virtual load store queue for allocating a plurality of loads and a plurality of stores, wherein more loads and more stores can be accommodated beyond an actual physical size of the load store queue of the processor; wherein the processor allocates other instructions besides loads and stores beyond the actual physical size limitation of the load/store queue; and wherein the other instructions can be dispatched and executed even though intervening loads or stores do not have spaces in the load store queue. | 04-02-2015 |
20150095629 | METHOD AND SYSTEM FOR IMPLEMENTING RECOVERY FROM SPECULATIVE FORWARDING MISS-PREDICTIONS/ERRORS RESULTING FROM LOAD STORE REORDERING AND OPTIMIZATION - A method for forwarding data from the store instructions to a corresponding load instruction in an out of order processor. The method includes accessing an incoming sequence of instructions; reordering the instructions in accordance with processor resources for dispatch and execution; ensuring a closest earlier store in machine order for to a corresponding load, by determining if said store has an actual age but said corresponding load does not have an actual age, then said store is earlier than said corresponding load; if said corresponding load has an actual age but said store does not have an actual age, then said corresponding load is earlier than said store; if neither said corresponding load or said store have an actual age, then a virtual identifier table is used to determine which is earlier; and if both said corresponding load and said store have actual ages, then the actual ages are used to determine which is earlier. | 04-02-2015 |
20150100765 | DISAMBIGUATION-FREE OUT OF ORDER LOAD STORE QUEUE - In a processor, a disambiguation-free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching. The method further includes in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match; and forwarding data from the first match to the subsequent load. | 04-09-2015 |
20150100766 | REORDERED SPECULATIVE INSTRUCTION SEQUENCES WITH A DISAMBIGUATION-FREE OUT OF ORDER LOAD STORE QUEUE - In a processor, a disambiguation-free out of order load store queue method. The method includes implementing a memory resource that can be accessed by a plurality of asynchronous cores; implementing a store retirement buffer, wherein stores from a store queue have entries in the store retirement buffer in original program order; and implementing speculative execution, wherein results of speculative execution can be saved in the store retirement/reorder buffer as a speculative state. The method further includes, upon dispatch of a subsequent load from a load queue, searching the store retirement buffer for address matching; and, in cases where there are a plurality of address matches, locating a correct forwarding entry by scanning for the store retirement buffer for a first match, and forwarding data from the first match to the subsequent load. Once speculative outcomes are known, the speculative state is retired to memory. | 04-09-2015 |
20150269118 | Apparatus and Method for Processing an Instruction Matrix Specifying Parallel and Dependent Operations - A matrix of execution blocks form a set of rows and columns. The rows support parallel execution of instructions and the columns support execution of dependent instructions. The matrix of execution blocks process a single block of instructions specifying parallel and dependent instructions. | 09-24-2015 |