Patent application number | Description | Published |
20110072248 | UNANIMOUS BRANCH INSTRUCTIONS IN A PARALLEL THREAD PROCESSOR - One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch. | 03-24-2011 |
20110072249 | UNANIMOUS BRANCH INSTRUCTIONS IN A PARALLEL THREAD PROCESSOR - One embodiment of the present invention sets forth a mechanism for managing thread divergence in a thread group executing a multithreaded processor. A unanimous branch instruction, when executed, causes all the active threads in the thread group to branch only when each thread in the thread group agrees to take the branch. In such a manner, thread divergence is eliminated. A branch-any instruction, when executed, causes all the active threads in the thread group to branch when at least one thread in the thread group agrees to take the branch. | 03-24-2011 |
20110078381 | Cache Operations and Policies For A Multi-Threaded Client - A method for managing a parallel cache hierarchy in a processing unit. The method including receiving an instruction that includes a cache operations modifier that identifies a level of the parallel cache hierarchy in which to cache data associated with the instruction; and implementing a cache replacement policy based on the cache operations modifier. | 03-31-2011 |
20110078406 | Unified Addressing and Instructions for Accessing Parallel Memory Spaces - One embodiment of the present invention sets forth a technique for unifying the addressing of multiple distinct parallel memory spaces into a single address space for a thread. A unified memory space address is converted into an address that accesses one of the parallel memory spaces for that thread. A single type of load or store instruction may be used that specifies the unified memory space address for a thread instead of using a different type of load or store instruction to access each of the distinct parallel memory spaces. | 03-31-2011 |
20110078415 | Efficient Predicated Execution For Parallel Processors - The invention set forth herein describes a mechanism for predicated execution of instructions within a parallel processor executing multiple threads or data lanes. Each thread or data lane executing within the parallel processor is associated with a predicate register that stores a set of 1-bit predicates. Each of these predicates can be set using different types of predicate-setting instructions, where each predicate setting instruction specifies one or more source operands, at least one operation to be performed on the source operands, and one or more destination predicates for storing the result of the operation. An instruction can be guarded by a predicate that may influence whether the instruction is executed for a particular thread or data lane or how the instruction is executed for a particular thread or data lane. | 03-31-2011 |
20110078690 | Opcode-Specified Predicatable Warp Post-Synchronization - One embodiment of the present invention sets forth a technique for performing a method for synchronizing divergent executing threads. The method includes receiving a plurality of instructions that includes at least one set-synchronization instruction and at least one instruction that includes a synchronization command, and determining an active mask that indicates which threads in a plurality of threads are active and which threads in the plurality of threads are disabled. For each instruction included in the plurality of instructions, the instruction is transmitted to each of the active threads included in the plurality of threads. If the instruction is a set-synchronization instruction, then a synchronization token, the active mask and the synchronization point is each pushed onto a stack. Or, if the instruction is a predicated instruction that includes a synchronization command, then each active thread that executes the predicated instruction is monitored to determine when the active mask has been updated to indicate that each active thread, after executing the predicated instruction, has been disabled. | 03-31-2011 |
20130166882 | METHODS AND APPARATUS FOR SCHEDULING INSTRUCTIONS WITHOUT INSTRUCTION DECODE - Systems and methods for scheduling instructions without instruction decode. In one embodiment, a multi-core processor includes a scheduling unit in each core for scheduling instructions from two or more threads scheduled for execution on that particular core. As threads are scheduled for execution on the core, instructions from the threads are fetched into a buffer without being decoded. The scheduling unit includes a macro-scheduler unit for performing a priority sort of the two or more threads and a micro-scheduler arbiter for determining the highest order thread that is ready to execute. The macro-scheduler unit and the micro-scheduler arbiter use pre-decode data to implement the scheduling algorithm. The pre-decode data may be generated by decoding only a small portion of the instruction or received along with the instruction. Once the micro-scheduler arbiter has selected an instruction to dispatch to the execution unit, a decode unit fully decodes the instruction. | 06-27-2013 |
20140049549 | EFFICIENT PLACEMENT OF TEXTURE BARRIER INSTRUCTIONS - One embodiment of the present invention sets forth a technique for placing texture barrier instructions within a thread program to advantageously enable efficient and correct operation of the thread program. A thread program compiler statically determines a pending request count needed to progress beyond a particular texture barrier instruction, which blocks execution of subsequent instructions that depend on previously requested data. Each instance of the thread program blocks execution at the barrier instruction until a pending request count condition is satisfied. This technique may advantageously reduce power consumption in a graphics processing unit by eliminating power consumption associated with conventional, generalized scoreboard resources. | 02-20-2014 |