Patent application number | Description | Published |
20090015589 | Store Misaligned Vector with Permute - Embodiments of the invention provide logic within the store data path between a processor and a memory array. The logic may be configured to misalign vector data as it is stored to memory. By misaligning vector data as it is stored to memory, memory bandwidth may be maximized while processing bandwidth required to store vector data misaligned is minimized. Furthermore, embodiments of the invention provide logic within the load data path which allows vector data which is stored misaligned to be aligned as it is loaded into a vector register. By aligning misaligned vector data as it is loaded into a vector register, memory bandwidth may be maximized while processing bandwidth required to align misaligned vector data may be minimized. | 01-15-2009 |
20090037694 | Load Misaligned Vector with Permute and Mask Insert - Embodiments of the invention provide logic within the store data path between a processor and a memory array. The logic may be configured to misalign vector data as it is stored to memory. By misaligning vector data as it is stored to memory, memory bandwidth may be maximized while processing bandwidth required to store vector data misaligned is minimized. Furthermore, embodiments of the invention provide logic within the load data path which allows vector data which is stored misaligned to be aligned as it is loaded into a vector register. By aligning misaligned vector data as it is loaded into a vector register, memory bandwidth may be maximized while processing bandwidth required to align misaligned vector data may be minimized. | 02-05-2009 |
20090049113 | Method and Apparatus for Implementing a Multiple Operand Vector Floating Point Summation to Scalar Function - Embodiments of the invention provide methods and apparatus for executing a multiple operand instruction. Executing the multiple operand instruction comprises computing an arithmetic result of a pair of operands in each processing lane of a vector unit. The arithmetic results generated in each processing lane of the vector unit may be transferred to a dot product unit. The dot product unit may compute an arithmetic result using the arithmetic result computed by each processing lane of the vector unit to generate an arithmetic result of more than two operands. | 02-19-2009 |
20090063608 | Full Vector Width Cross Product Using Recirculation for Area Optimization - Embodiments of the invention are generally related to the field of image processing, and more specifically to vector units for supporting image processing. A vector unit may comprise a plurality of operand multiplexers associated with each vector processing lane of the vector unit. The operand multiplexers may select vector operands from one or more register files for performing a cross product operation. A first multiply operation may be performed in a first pipeline stage by multiplying a first set of operands in a multiplier. In a second pipeline stage, a second multiply operation may be performed by multiplying a second set of operands. The results of the first multiply operation and the second multiply operation may be transferred to an adder to complete the cross product instruction. | 03-05-2009 |
20090070398 | Method and Apparatus for an Area Efficient Transcendental Estimate Algorithm - A method, computer-readable medium, and an apparatus for generating a transcendental value. The method includes receiving an input containing an input value and an opcode and determining whether the opcode corresponds to a trigonometric operation or a power-of-two operation. The method also includes calculating a fractional value and an integer value from the input value, generating the transcendental value based on the fractional value by adding at least a portion of the fractional value with at least one of a shifted fractional value produced by shifting the portion of the fractional value and a constant value, and providing the transcendental value in response to the request. In this fashion, the same circuit area may be used to carry out both trigonometric and power-of-two calculations, leading to greater circuit area savings and performance advantages while not sacrificing significant accuracy. | 03-12-2009 |
20090083357 | Method and Apparatus Implementing a Floating Point Weighted Average Function - A method, computer-readable medium, and an apparatus for implementing a floating point weighted average function. The method includes receiving an input containing 2 | 03-26-2009 |
20090106525 | DESIGN STRUCTURE FOR SCALAR PRECISION FLOAT IMPLEMENTATION ON THE "W" LANE OF VECTOR UNIT - A design structure embodied in a machine readable storage medium for designing, manufacturing, and/or testing a design for image processing, and more specifically to vector units for supporting image processing is provided. A combined vector/scalar unit is provided wherein one or more processing lanes of the vector unit are used for performing scalar operations. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated and a significant amount of chip area is saved. | 04-23-2009 |
20090106527 | Scalar Precision Float Implementation on the "W" Lane of Vector Unit - Embodiments of the invention are generally related to image processing, and more specifically to vector units for supporting image processing. A combined vector/scalar unit is provided wherein one or more processing lanes of the vector unit are used for performing scalar operations. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated and a significant amount of chip area is saved. | 04-23-2009 |
20090113181 | Method and Apparatus for Executing Instructions - A method and apparatus for executing instructions in a processor are provided. In one embodiment of the invention, the method includes receiving a plurality of instructions. The plurality of instructions includes first instructions in a first thread and second instructions in a second thread. The method further includes forming a common issue group including an instruction of a first instruction type and an instruction of a second instruction type. The method also includes issuing the common issue group to a first execution unit and a second execution unit. The instruction of the first instruction type is issued to the first execution unit and the instruction of the second instruction type is issued to the second execution unit. | 04-30-2009 |
20090150647 | Processing Unit Incorporating Vectorizable Execution Unit - A vectorizable execution unit is capable of being operated in a plurality of modes, with the processing lanes in the vectorizable execution unit grouped into different combinations of logical execution units in different modes. By doing so, processing lanes can be selectively grouped together to operate as different types of vector execution units and/or scalar execution units, and if desired, dynamically switched during runtime to process various types of instruction streams in a manner that is best suited for each type of instruction stream. As a consequence, a single vectorizable execution unit may be configurable, e.g., via software control, to operate either as a vector execution or a plurality of scalar execution units. | 06-11-2009 |
20090182987 | Processing Unit Incorporating Multirate Execution Unit - A multirate execution unit is capable of being operated in a plurality of modes, with the execution unit being capable of clocked at multiple different rates relative to a multithreaded issue unit such that, in applications where maximum performance is desired, the execution unit can be clocked at a rate that is faster than the clock rate for the multithreaded issue unit, and in applications where a lower power profile is desired, the execution unit can be throttled back to a slower rate to reduce the power consumption of the execution unit. When the execution unit is clocked at a faster rate than the multithreaded issue unit, the issue unit is permitted to issue more instructions per cycle than when the execution unit is throttled to the slower rate to increase overall instruction throughput. | 07-16-2009 |
20090240920 | Execution Unit with Data Dependent Conditional Write Instructions - An execution unit supports data dependent conditional write instructions that write data to a target only when a particular condition is met. In one implementation, a data dependent conditional write instruction identifies a condition as well as data to be tested against that condition. The data is tested against that condition, and the result of the test is used to selectively enable or disable a write to a target associated with the data dependent conditional write instruction. Then, a write is attempted while the write to the target is enabled or disabled such that the write will update the contents of the target only when the write is selectively enabled as a result of the test. By doing so, dependencies are typically avoided, as is use of an architected condition register that might otherwise introduce branch prediction mispredict penalties, enabling improved performance with z-buffer test and similar types of algorithms. | 09-24-2009 |
20090300335 | Execution Unit With Inline Pseudorandom Number Generator - A circuit arrangement and method couple a hardware-based pseudorandom number generator (PRNG) to an execution unit in such a manner that pseudorandom numbers generated by the PRNG may be selectively output to the execution unit for use as an operand during the execution of instructions by the execution unit. A PRNG may be coupled to an input of an operand multiplexer that outputs to an operand input of an execution unit so that operands provided by instructions supplied to the execution unit are selectively overridden with pseudorandom numbers generated by the PRNG. Furthermore, overridden operands provided by instructions supplied to the execution unit may be used as seed values for the PRNG. In many instances, an instruction executed by an execution unit may be able to perform an arithmetic operation using both an operand specified by the instruction and a pseudorandom number generated by the PRNG during the execution of the instruction, so that the generation of the pseudorandom number and the performance of the arithmetic operation occur during a single pass of an execution unit. | 12-03-2009 |
20090315908 | Anisotropic Texture Filtering with Texture Data Prefetching - A circuit arrangement and method utilize texture data prefetching to prefetch texture data used by an anisotropic filtering algorithm. In particular, stride-based prefetching may be used to prefetch texture data for use in anisotropic filtering, where the value of the stride, or difference between successive accesses, is based upon a distance in a memory address space between sample points taken along the line of anisotropy used in an anisotropic filtering algorithm. | 12-24-2009 |
20100031009 | Floating Point Execution Unit for Calculating a One Minus Dot Product Value in a Single Pass - A floating point execution unit calculates a one minus dot product value in a single pass. As such, the dependency that otherwise would be required to perform the calculations is eliminated, resulting in a substantially faster performance of such calculations. The floating point execution unit may be used, for example, to accelerate pixel shading algorithms such as Fresnel and electron microscope effects. | 02-04-2010 |
20100100712 | Multi-Execution Unit Processing Unit with Instruction Blocking Sequencer Logic - A processing unit includes multiple execution units and sequencer logic that is disposed downstream of instruction buffer logic, and that is responsive to a sequencer instruction present in an instruction stream. In response to such an instruction, the sequencer logic issues a plurality of instructions associated with a long latency operation to one execution unit, while blocking instructions from the instruction buffer logic from being issued to that execution unit. In addition, the blocking of instructions from being issued to the execution unit does not affect the issuance of instructions to any other execution unit, and as such, other instructions from the instruction buffer logic are still capable of being issued to and executed by other execution units even while the sequencer logic is issuing the plurality of instructions associated with the long latency operation. | 04-22-2010 |
20100125719 | Instruction Target History Based Register Address Indexing - A circuit arrangement and method support instruction target history based register address indexing, whereby register addresses to be used by an instruction are decoded using a target history table of previous target register addresses, and an index into the target history table supplied by an index value in the instruction. An instruction may include at least one index value that identifies a previously used register address. During execution of the instruction, the index is retrieved from the instruction, and then a register address is retrieved from the target history table using the index. | 05-20-2010 |
20100191937 | Implied Storage Operation Decode Using Redundant Target Address Detection - A logic arrangement and method to support implied storage operation decode uses redundant target address detection, whereby target addresses of previous instructions are compared with the target address of the current instruction, and if equal, and the target addresses of previous instructions are not used as sources, the current instruction is decoded as a store instruction. This allows a redundant operation in an instruction set architecture to be redefined as a store instruction, freeing up opcodes normally used for store instructions to be used for other instructions. | 07-29-2010 |
20120169755 | ANISOTROPIC TEXTURE FILTERING WITH TEXTURE DATA PREFETCHING - A circuit arrangement and method utilize texture data prefetching to prefetch texture data used by an anisotropic filtering algorithm. In particular, stride-based prefetching may be used to prefetch texture data for use in anisotropic filtering, where the value of the stride, or difference between successive accesses, is based upon a distance in a memory address space between sample points taken along the line of anisotropy used in an anisotropic filtering algorithm. | 07-05-2012 |
20120303691 | EXECUTION UNIT WITH INLINE PSEUDORANDOM NUMBER GENERATOR - A circuit arrangement and method couple a hardware-based pseudorandom number generator (PRNG) to an execution unit in such a manner that pseudorandom numbers generated by the PRNG may be selectively output to the execution unit for use as an operand during the execution of instructions by the execution unit. A PRNG may be coupled to an input of an operand multiplexer that outputs to an operand input of an execution unit so that operands provided by instructions supplied to the execution unit are selectively overridden with pseudorandom numbers generated by the PRNG. Furthermore, overridden operands provided by instructions supplied to the execution unit may be used as seed values for the PRNG. | 11-29-2012 |