Patent application number | Description | Published |
20100058266 | 3-Stack Floorplan for Floating Point Unit - A 3-stack floorplan for a floating point unit includes: an aligner located in the center of the floating point unit; a frontend located directly above the aligner; a multiplier located directly below the frontend and next to the aligner; an adder located directly next to the multiplier and directly below the aligner; a normalizer located directly above the adder; and a rounder located directly above the normalizer. | 03-04-2010 |
20100095099 | SYSTEM AND METHOD FOR STORING NUMBERS IN FIRST AND SECOND FORMATS IN A REGISTER FILE - A system and a method for storing numbers in a register file are provided. The system and the method store single precision numbers in double precision format in a register file that is shared between floating point computational units and computational units not supporting floating point numbers. | 04-15-2010 |
20100146023 | SHIFTER WITH ALL-ONE AND ALL-ZERO DETECTION - A shifter that includes a plurality of shift stages positioned within the shifter, and receiving and shifting input data to generate a shifted result, and a detection circuit coupled at an input of a final shift stage of the plurality of shifters, in a final stage within the shifter. The detection circuit receives a partially shifted vector at the input of the final shift stage along with a predetermined shift amount, and performing an all-one or all-zero detection operation using a portion of the partially shifted vector and the predetermined shift amount, in parallel, to a shifting operation performed by the final shift stage to generate the shifted result. | 06-10-2010 |
20100146471 | FAST ROUTING OF CUSTOM MACROS - A system for creating layout and wiring diagrams for an integrated circuit (IC) includes a placement engine configured to receive a hierarchical schematic and to create a placed layout. The system also includes a flat layout engine configured to receive the hierarchical schematic and to create a flat layout and a back annotation engine coupled to the placement engine and the flat layout engine, the back annotation engine configured to receive the hierarchical placed layout and the flat unplaced layout and to create a flat placed layout there from. | 06-10-2010 |
20100174764 | REUSE OF ROUNDER FOR FIXED CONVERSION OF LOG INSTRUCTIONS - A method for converting a signed fixed point number into a floating point number that includes reading an input number corresponding to a signed fixed point number to be converted, determining whether the input number is less than zero, setting a sign bit based upon whether the input number is less than zero or greater than or equal to zero, computing a first intermediate result by exclusive-ORing the input number with the sign bit, computing leading zeros of the first intermediate result, padding the first intermediate result based upon the sign bit, computing a second intermediate result by shifting the padded first intermediate result to the left by the leading zeros, computing an exponent portion and a fraction portion, conditionally incrementing the fraction portion based on the sign bit, correcting the exponent portion and the fraction portion if the incremented fraction portion overflows, and returning the floating point number. | 07-08-2010 |
20130204916 | RESIDUE-BASED ERROR DETECTION FOR A PROCESSOR EXECUTION UNIT THAT SUPPORTS VECTOR OPERATIONS - A residue generating circuit for an execution unit that supports vector operations includes an operand register and a residue generator coupled to the operand register. The residue generator includes a first residue generation tree coupled to a first section of the operand register and a second residue generation tree coupled to a second section of the operand register. The first residue generation tree is configured to generate a first residue for first data included in the first section of the operand register. The second residue generation tree is configured to generate a second residue for second data included in a second section of the operand register. The first section of the operand register includes a different number of register bits than the second section of the operand register. Configuring a residue-generation tree to be split into multiple residue generation trees facilitates residue checking multiple independent vector instruction operands within a full width dataflow in parallel or alternatively residue checking a full width operand of a standard instruction. | 08-08-2013 |
20140164462 | RESIDUE-BASED ERROR DETECTION FOR A PROCESSOR EXECUTION UNIT THAT SUPPORTS VECTOR OPERATIONS - A residue generating circuit for an execution unit that supports vector operations includes an operand register and a residue generator coupled to the operand register. The residue generator includes a first residue generation tree coupled to a first section of the operand register and a second residue generation tree coupled to a second section of the operand register. The first residue generation tree is configured to generate a first residue for first data included in the first section of the operand register. The second residue generation tree is configured to generate a second residue for second data included in a second section of the operand register. The first section of the operand register includes a different number of register bits than the second section of the operand register. | 06-12-2014 |
Patent application number | Description | Published |
20100017635 | ZERO INDICATION FORWARDING FOR FLOATING POINT UNIT POWER REDUCTION - A method, system and computer program product for reducing power consumption when processing mathematical operations. Power may be reduced in processor hardware devices that receive one or more operands from an execution unit that executes instructions. A circuit detects when at least one operand of multiple operands is a zero operand, prior to the operand being forwarded to an execution component for completing a mathematical operation. When at least one operand is a zero operand or at least one operand is “unordered”, a flag is set that triggers a gating of a clock signal. The gating of the clock signal disables one or more processing stages and/or devices, which perform the mathematical operation. Disabling the stages and/or devices enables computing the correct result of the mathematical operation on a reduced data path. When a device(s) is disabled, the device may be powered off until the device is again required by subsequent operations. | 01-21-2010 |
20100023573 | EFFICIENT FORCING OF CORNER CASES IN A FLOATING POINT ROUNDER - The forcing of the result or output of a rounder portion of a floating point processor occurs only in a fraction non-increment data path within the rounder and not in the fraction increment data path within the rounder. The fraction forcing is active on a corner case such as a disabled overflow exception. A disabled overflow exception may be detected by inspecting the normalized exponent. If a disabled overflow exception is detected, the round mode is selected to execute only in the non-increment data path thereby preventing the fraction increment data path from being selected. | 01-28-2010 |
20100063985 | NORMALIZER SHIFT PREDICTION FOR LOG ESTIMATE INSTRUCTIONS - A floating point processor unit includes a shift amount calculation circuit within a normalizer portion of the floating point unit, wherein the shift amount calculation circuit is utilized to compute the normalizer shift amount for a log estimate instruction that runs as a pipelinable instruction. | 03-11-2010 |
20100063987 | SUPPORTING MULTIPLE FORMATS IN A FLOATING POINT PROCESSOR - In a binary floating point processor, the exponents of each of the various types of operands are recoded into an internal format, by biasing the exponents with the minimum exponent value of the result precision (“Emin”), i.e., the recoded value of the exponent is the represented value of the exponent minus Emin. Emin depends only on the result precision of the instruction that is currently being executed in the binary floating point processor. The exponent computations are then performed in this new format. The underflow check for all result precisions is a check against zero and overflow checks are performed against a positive number that depends on the result precision. The exponent values are in a 2's complement representation, so the underflow check simply becomes a check of the sign bit. | 03-11-2010 |
20100100713 | FAST FLOATING POINT COMPARE WITH SLOWER BACKUP FOR CORNER CASES - A floating point processor unit executes a floating point compare instruction with two operands of the same or different precision by comparing the two operands in integer format, which speeds up the execution of the floating point compare instruction significantly. The floating point processor now executes the floating point compare instruction at least twice as fast or faster (e.g., two clock cycles instead of five clock cycles in the prior art) for nearly most operand cases (e.g., 99% of all cases). Only the rare corner cases require additional operations on one of the operands and thus require additional cycles of execution time because the integer compare operation will not work for these corner cases. This is due to the fact that one operand is a single precision subnormal number in an unnormalized representation (i.e., has two representations) and the other operand is in the SP subnormal range such that the integer compare operation will fail. | 04-22-2010 |
20120110271 | MECHANISM TO SPEED-UP MULTITHREADED EXECUTION BY REGISTER FILE WRITE PORT REALLOCATION - Various systems and processes may be used to speed up multi-threaded execution. In certain implementations, a system and process may include the ability to write results of a first group of execution units associated with a first register file into the first register file using a first write port of the first register file and write results of a second group of execution units associated with a second register file into the second register file using a first write port of the second register file. The system, apparatus, and process may also include the ability to connect, in a shared register file mode, results of the second group of execution units to a second write port of the first register file and connect, in a split register file mode, results of a part of the first group of execution units to the second write port of the first register file. | 05-03-2012 |
20120128149 | APPARATUS AND METHOD FOR CALCULATING AN SHA-2 HASH FUNCTION IN A GENERAL PURPOSE PROCESSOR - Various systems, apparatuses, processes, and/or products may be used to calculate an SHA-2 hash function in a general-purpose processor. In some implementations, a system, apparatus, process, and/or product may include the ability to calculate at least one SHA-2 sigma function by using an execution unit adapted for performing a processor instruction, the execution unit including an integrated circuit primarily designed for calculating the SHA-2 sigma function(s), and calculating the SHA-2 hash function with general-purpose hardware processing components of the processor based on the sigma function(s). In certain implementations, the calculation of the SHA-2 sigma function(s) can be performed by the integrated circuit within a single instruction, allowing for a faster calculation of the SHA-2 hash function. | 05-24-2012 |
20120150933 | METHOD AND DATA PROCESSING UNIT FOR CALCULATING AT LEAST ONE MULTIPLY-SUM OF TWO CARRY-LESS MULTIPLICATIONS OF TWO INPUT OPERANDS, DATA PROCESSING PROGRAM AND COMPUTER PROGRAM PRODUCT - Various systems, apparatuses, processes, and programs may be used to calculate a multiply-sum of two carry-less multiplications of two input operands. In particular implementations, a system, apparatus, process, and program may include the ability to use input data busses for the input operands and an output data bus for an overall calculation result, each bus including a width of 2n bits, where n is an integer greater than one. The system, apparatus, process, and program may also calculate the carry-less multiplications of the two input operands for a lower level of a hierarchical structure and calculating the at least one multiply-sum and at least one intermediate multiply-sum for a higher level of the structure based on the carry-less multiplications of the lower level. A certain number of multiply-sums may be output as an overall calculation result dependent on mode of operation using the full width of said output data bus. | 06-14-2012 |
20130159666 | REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction. | 06-20-2013 |
20130332501 | Fused Multiply-Adder with Booth-Encoding - A fused multiply-adder is disclosed. The fused multiply-adder includes a Booth encoder, a fraction multiplier, a carry corrector, and an adder. The Booth encoder initially encodes a first operand. The fraction multiplier multiplies the Booth-encoded first operand by a second operand to produce partial products, and then reduces the partial products into a set of redundant sum and carry vectors. The carry corrector then generates a carry correction factor for correcting the carry vectors. The adder adds the redundant sum and carry vectors and the carry correction factor to a third operand to yield a final result. | 12-12-2013 |
20140075153 | REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction. | 03-13-2014 |
20140095568 | Fused Multiply-Adder with Booth-Encoding - A fused multiply-adder is disclosed. The fused multiply-adder includes a Booth encoder, a fraction multiplier, a carry corrector, and an adder. The Booth encoder initially encodes a first operand. The fraction multiplier multiplies the Booth-encoded first operand by a second operand to produce partial products, and then reduces the partial products into a set of redundant sum and carry vectors. The carry corrector then generates a carry correction factor for correcting the carry vectors. The adder adds the redundant sum and carry vectors and the carry correction factor to a third operand to yield a final result. | 04-03-2014 |
20140136815 | VERIFICATION OF A VECTOR EXECUTION UNIT DESIGN - A method for verification of a vector execution unit design. The method includes issuing an instruction into a first instance and a second instance of a vector execution unit. The method includes issuing a random operand into a first lane of the first instance of the vector execution unit and into a second lane of the second instance of the vector execution unit. The method further includes receiving results from execution of the instruction and the random operand in both the first and the second instance of the vector execution unit and comparing the received results. | 05-15-2014 |
20140156969 | VERIFICATION OF A VECTOR EXECUTION UNIT DESIGN - A method for verification of a vector execution unit design. The method includes issuing an instruction into a first instance and a second instance of a vector execution unit. The method includes issuing a random operand into a first lane of the first instance of the vector execution unit and into a second lane of the second instance of the vector execution unit. The method further includes receiving results from execution of the instruction and the random operand in both the first and the second instance of the vector execution unit and comparing the received results. | 06-05-2014 |
20150067298 | SPLITABLE AND SCALABLE NORMALIZER FOR VECTOR DATA - A hardware circuit component configured to support vector operations in a scalar data path. The hardware circuit component configured to operate in a vector mode configuration and in a scalar mode configuration. The hardware circuit component configured to split the scalar mode configuration into a left half and a right half of the vector mode configuration. The hardware circuit component configured to perform one or more bit shifts over one or more stages of interconnected multiplexers in the vector mode configuration. The hardware circuit component configured to include duplicated coarse shift multiplexers at bit positions that receive data from both the left half and the right half of the vector mode configuration, resulting in one or more coarse shift multiplexers sharing the bit position. | 03-05-2015 |
20150067299 | SPLITABLE AND SCALABLE NORMALIZER FOR VECTOR DATA - A hardware circuit component configured to support vector operations in a scalar data path. The hardware circuit component configured to operate in a vector mode configuration and in a scalar mode configuration. The hardware circuit component configured to split the scalar mode configuration into a left half and a right half of the vector mode configuration. The hardware circuit component configured to perform one or more bit shifts over one or more stages of interconnected multiplexers in the vector mode configuration. The hardware circuit component configured to include duplicated coarse shift multiplexers at bit positions that receive data from both the left half and the right half of the vector mode configuration, resulting in one or more coarse shift multiplexers sharing the bit position. | 03-05-2015 |