Entries |
Document | Title | Date |
20080222390 | Low Noise Coding for Digital Data Interface - A digital data interface system comprises a data transmitter configured to transmit a data word across a plurality of data lines. The data word can comprise a plurality of digital data bits having a bit number order from a lowest bit number to a highest bit number with the lowest ordered bit numbers having higher noise content and the highest ordered bit numbers having higher harmonic content. The system also comprises an encoder configured to arrange the plurality of digital data bits as serialized data sets to be transmitted over each of the plurality of data lines by the data transmitter with consecutive data bits of at least one serialized data set being matched such that bits with the higher harmonic content are matched with bits of the higher noise content to substantially mitigate of at least one of the noise content and the harmonic content of the data word. | 09-11-2008 |
20090094439 | Data processing apparatus and method employing multiple register sets - A data processing apparatus and method employing multiple register sets is disclosed. The data processing apparatus has processing logic for performing data processing operations and a register bank for storing data associated with the processing logic. The register bank has at least one register group, each register group having a plurality of register sets. The processing logic has an operating state associated with each register group defining how that register group is used, a first operating state being a state in which each register set in the register group is used to support an independent execution thread of the processing logic, and a second operating state being a state in which the register sets of the register group are collectively used to support a single execution thread of the processing logic. Control logic is provided to control how the register sets of each register group are used dependent on the operating state associated with that register group. This has been found to provide a particularly efficient use of the registers within the data processing apparatus. | 04-09-2009 |
20090113176 | Method of reducing data path width restrictions on instruction sets - A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers. | 04-30-2009 |
20090125704 | DESIGN STRUCTURE FOR DYNAMICALLY SELECTING COMPILED INSTRUCTIONS - A design structure embodied in a machine readable medium used in a design process includes an apparatus for dynamically selecting compiled instructions for execution, the apparatus including an input for receiving static instructions for execution on a first execution unit and receiving dynamic instructions for execution on a second execution unit; and an instruction selection element adapted to evaluate throughput performance of the static instructions and dynamic instructions based on current states of the execution units and select the static instructions or the dynamic instructions for execution at runtime on the first execution unit or the second execution unit, respectively, based on the throughput performance of the instructions. | 05-14-2009 |
20090138677 | System for Native Code Execution - A process, apparatus, and system to execute a program in an array of processor nodes that include an agent node and an executor node. A virtual program of tokens of different types represents the program and is provided in a memory. The types include a run type that includes native code instructions of the executer node. A token is loaded from the memory and executed in the agent node based on its type. In particular, if the token is an optional stop type execution ends and if the token is a run type the native code instructions in the token are sent to the executor node. The native code instructions are executed in the executor node as received from the agent node. And such loading and execution continues in this manner indefinitely or until a stop type token is executed. | 05-28-2009 |
20090144525 | APPARATUS AND METHOD FOR SCHEDULING THREADS IN MULTI-THREADING PROCESSORS - An multi-threading processor is provided. The multi-threading processor includes a first instruction fetch unit to receive a first thread and a second instruction fetch unit to receive a second thread. A multi-thread scheduler coupled to the instruction fetch units and a execution unit. The multi-thread scheduler determines the width of the execution unit and the execution unit executes the threads accordingly. | 06-04-2009 |
20090177866 | SYSTEM AND METHOD FOR FUNCTIONALLY REDUNDANT COMPUTING SYSTEM HAVING A CONFIGURABLE DELAY BETWEEN LOGICALLY SYNCHRONIZED PROCESSORS - A method of operating a computer system. A first processor sends a first unit of binary information to an input/output (I/O) unit. The I/O unit then conveys the first unit of binary information to a functional unit in the computer system. A system response from the functional unit is then received by the I/O unit, which forwards the system response to the first processor. The system response is also stored in a first buffer. After a predetermined delay time has elapsed, the system response is then forwarded to the second processor. | 07-09-2009 |
20090204790 | BUFFER MANAGEMENT FOR REAL-TIME STREAMING - Technologies are described herein for buffer management during real-time streaming. A video frame buffer stores video frames generated by a real-time streaming video capture device. New video frames received from the video capture device are stored in the video frame buffer prior to processing by a video processing pipeline that processes frames stored in the video frame buffer. A buffer manager determines whether a new video frame has been received from the video capture device and stored in the video frame buffer. When the buffer manager determines that a new video frame has arrived at the video frame buffer, it then determines whether the video processing pipeline has an unprocessed video frame. If the video processing pipeline has an unprocessed video frame, the buffer manager discards the new video frame stored in the video frame buffer or performs other processing on the new video frame. | 08-13-2009 |
20100082944 | Multi-thread processor - In an exemplary aspect, the present invention provides a multi-thread processor including a plurality of hardware threads each of which generates an independent instruction flow, a thread scheduler that outputs a thread selection signal in accordance with a first or second schedule, the thread selection signal designating a hardware thread to be executed in a next execution cycle among the plurality of hardware threads, a first selector that selects one of the plurality of hardware threads according to the thread selection signal and outputs an instruction generated by the selected hardware thread, and an execution pipeline that executes an instruction output from the first selector, wherein when the multi-thread processor is in a first state, the thread scheduler selects the first schedule, and when the multi-thread processor is in a second state, the thread scheduler selects the second schedule. | 04-01-2010 |
20100082945 | Multi-thread processor and its hardware thread scheduling method - A multi-thread processor in accordance with an exemplary aspect of the present invention includes a plurality of hardware threads each of which generates an independent instruction flow, a thread scheduler that outputs a thread selection signal TSEL designating a hardware thread to be executed in a next execution cycle, a first selector that outputs an instruction generated by a hardware thread selected according to the thread selection signal, and an execution pipeline that executes an instruction output from the first selector, wherein the thread scheduler specifies execution of at least one hardware thread selected in a fixed manner in a predetermined first execution period, and specifies execution of an arbitrary hardware thread in a second execution period. | 04-01-2010 |
20100169610 | PROCESSOR - The processor according to the present invention is a processor having a forwarding function and includes an attribute information holding unit that holds attribute information regarding inhibition of writing to a register and a register write inhibition circuit that holds, when forwarding is performed, the writing of the data forwarded according to attribute information. The attribute information holding unit holds the attribute information by relating the attribute information to at least one register. Alternatively, the attribute information holding unit is a part of plural pipeline buffers and passes the attribute information along with the data to be forwarded, to a pipeline buffer in a subsequent stage. | 07-01-2010 |
20100217957 | Structured Virtual Registers for Embedded Controller Devices - Techniques for using structured virtual registers in embedded systems are described. A virtual register structure definition provides a map of virtual registers within an embedded controller. The virtual registers are externally accessible and correspond to memory locations within the embedded controller. In various embodiments, an embedded controller and/or an external entity may store data in or read data from the virtual registers using the virtual register structure definition. The problems of manual tracking of virtual register addresses and manual transcription of virtual register addresses to program code are ameliorated. When the virtual register map changes, logical references in program code to particular virtual registers need not necessarily be changed. | 08-26-2010 |
20100299497 | APPARATUS FOR EFFICIENTLY DETERMINING INSTRUCTION LENGTH WITHIN A STREAM OF X86 INSTRUCTION BYTES - An apparatus efficiently determines the length of an instruction within a stream of instruction bytes processed by a microprocessor having a variable instruction length instruction set architecture. The apparatus includes combinatorial logic associated with each instruction byte of the stream, each configured to receive the associated instruction byte and the next instruction byte of the stream and to generate in response thereto a first length, a second length, and a select control. A multiplexor associated with each of the combinatorial logic selects and outputs one of the following inputs based on the select control received from the combinatorial logic: a zero input and the second length received from the combinatorial logic associated with each of the next three instruction bytes of the stream. An adder associated with each of the combinatorial logic and multiplexor adds the first length and the output of the multiplexor to generate the length of the instruction. | 11-25-2010 |
20100299498 | INFORMATION PROCESSING APPARATUS AND INFORMATION PROCESSING METHOD - An information processing apparatus includes: a first pipeline having first nodes, and moving data held in each first node to a first node located in a first direction; a second pipeline having second nodes corresponding to the first nodes, and moving data held in each second node to a second node located in a second direction that is opposite to the first direction; a first comparison unit arranged to compare data held in a node of interest with data held in a second node corresponding to the node of interest, where the node of interest is one of the first nodes; and a second comparison unit arranged to compare the data held in the node of interest with data held in a second node located one node on an upstream or downstream side of the second node corresponding to the node of interest. | 11-25-2010 |
20110055521 | MICROPROCESSOR HAVING AT LEAST ONE APPLICATION SPECIFIC FUNCTIONAL UNIT AND METHOD TO DESIGN SAME - Customisable embedded processors that are available on the market make it possible for designers to speed up execution of applications by using Application-specific Functional Units (AFUs), implementing Instruction-Set Extensions (ISEs). Furthermore, techniques for automatic ISE identification have been improving; many algorithms have been proposed for choosing, given the application's source code, the best ISEs under various constraints. Read and write ports between the AFUs and the processor register file are an expensive asset, fixed in the micro-architecture—some processors indeed only allow two read ports and one write port—and yet, on the other hand, a large availability of inputs and outputs to and from the AFUs exposes high speedup. Here we present a solution to the limitation of actual register file ports by serialising register file access and therefore addressing multi-cycle read and write. It does so in an innovative way for two reasons: (1) it exploits and brings forward the progress in ISE identification under constraint, and (2) it combines register file access serialisation with pipelining in order to obtain the best global solution. Our method consists of scheduling graphs—corresponding to ISEs—under input/output constraint | 03-03-2011 |
20110066826 | IMAGE DATA PROCESSING APPARATUS - An image data processing apparatus includes: a plurality of operational processing circuits each of which is configured to have a variable circuit configuration and to execute operational processing on image data; and a control section that controls each of the operational processing circuits such that each of the operational processing circuits executes one of a plurality of types of operational processing performed on image data in a predetermined order. The control section controls each of the operational processing circuits so that when image data to be newly given to one of the operational processing circuits is interrupted, said one of the operational processing circuits and another one of the operational processing circuits execute operational processing by taking partial charge of the operational processing. | 03-17-2011 |
20110161629 | Arithmetic processor, information processor, and pipeline control method of arithmetic processor - An arithmetic processor includes a first pipeline unit configured to execute a first instruction that is input; a second pipeline unit configured to execute a second instruction that is input; a registration unit into which an aborted instruction is registered, the aborted instruction being the first instruction when the first pipeline unit is unable to complete the first instruction or the second instruction when the second pipeline unit is unable to complete the second instruction; a determination unit configured to make a determination as to which one of the first pipeline unit and the second pipeline unit is operating under a lower load; and an input unit configured to input, in the first pipeline unit or the second pipeline unit that is determined as operating under the lower load by the determination unit, the aborted instruction that is registered in the registration unit. | 06-30-2011 |
20110185156 | EXECUTING WATCHPOINT EVENTS FOR DEBUGGING IN A "BREAK BEFORE MAKE" MANNER - A processor (e.g., a Digital Signal Processor (DSP) core) rewinds a pipeline of instructions upon a watchpoint event in an instruction being processed. The program execution ceases at the instruction in which the watchpoint event occurred, while the instruction and subsequent instructions are cancelled, keeping the hardware components associated with executing the program in their previous states, prior to the watchpoint. The rewind is such that the program is refetched to enable execution to continue from the instruction in which the watchpoint event occurred. The watchpoint event is executed in a “break before make” manner. | 07-28-2011 |
20110185157 | MULTIFUNCTION HEXADECIMAL INSTRUCTION FORM SYSTEM AND PROGRAM PRODUCT - A new zSeries floating-point unit has a fused multiply-add dataflow capable of supporting two architectures and fused MULTIPLY and ADD and Multiply and SUBTRACT in both RRF and RXF formats for the fused functions. Both binary and hexadecimal floating-point instructions are supported for a total of 6 formats. The floating-point unit is capable of performing a multiply-add instruction for hexadecimal or binary every cycle with a latency of 5 cycles. This supports two architectures with two internal formats with their own biases. This has eliminated format conversion cycles and has optimized the width of the dataflow. The unit is optimized for both hexadecimal and binary floating-point architecture supporting a multiply-add/subtract per cycle. | 07-28-2011 |
20110264891 | MICROPROCESSOR THAT FUSES MOV/ALU/JCC INSTRUCTIONS - A microprocessor receives first, second, and third program-adjacent macroinstructions. The first macroinstruction moves a first operand to a first register from a second register. The second macroinstruction performs an arithmetic/logic operation using the first operand in the second register and a second operand in a third register to generate a result, loads the result back into the first register, and updates condition codes based on the result. The third macroinstruction conditionally jumps to a target address. An instruction translator simultaneously translates the first, second, and third program-adjacent macroinstructions into a single micro-operation for execution by an execution unit. The micro-operation performs the arithmetic/logic operation using the first operand in the second register and the second operand in third register to generate the result, loads the result back into the first register, updates the condition codes based on the result, and conditionally jumps to the target address. | 10-27-2011 |
20110276783 | THREAD FAIRNESS ON A MULTI-THREADED PROCESSOR WITH MULTI-CYCLE CRYPTOGRAPHIC OPERATIONS - Systems and methods for efficient execution of operations in a multi-threaded processor. Each thread may include a blocking instruction. A blocking instruction blocks other threads from utilizing hardware resources for an appreciable amount of time. One example of a blocking type instruction is a Montgomery multiplication cryptographic instruction. Each thread can operate in a thread-based mode that allows the insertion of stall cycles during the execution of blocking instructions, during which other threads may utilize the previously blocked hardware resources. At times when multiple threads are scheduled to execute blocking instructions, the thread-based mode may be changed to increase throughput for these multiple threads. For example, the mode may be changed to disallow the insertion of stall cycles. Therefore, the time for sequential operation of the blocking instructions corresponding to the multiple threads may be reduced. | 11-10-2011 |
20110302391 | DIGITAL SIGNAL PROCESSOR - A digital signal processor comprises an instruction analysis unit, a digital signal processor (DSP) core and a memory unit. The instruction analysis unit receives an instruction and determines the required bit width M for the data process corresponding to the instruction. The DSP core performs the M-bit data process based on the bit width M determined by the instruction analysis unit, and the memory unit stores multiple data and performs the M-bit access based on the bit width M determined by the instruction analysis unit thereby allowing the DSP core to access, and at lest one available space in the memory unit will be adjusted such that only the access space having the bit width M for the operation corresponding to the instruction will be open in each access, thereby effectively achieving the effect of power-saving. | 12-08-2011 |
20120084533 | Efficient Parallel Floating Point Exception Handling In A Processor - Methods and apparatus are disclosed for handling floating point exceptions in a processor that executes single-instruction multiple-data (SIMD) instructions. In one embodiment a numerical exception is identified for a SIMD floating point operation and SIMD micro-operations are initiated to generate two packed partial results of a packed result for the SIMD floating point operation. A SIMD denormalization micro-operation is initiated to combine the two packed partial results and to denormalize one or more elements of the combined packed partial results to generate a packed result for the SIMD floating point operation having one or more denormal elements. Flags are set and stored with packed partial results to identify denormal elements. In one embodiment a SIMD normalization micro-operation is initiated to generate a normalized pseudo internal floating point representation prior to the SIMD floating point operation when it uses multiplication. | 04-05-2012 |
20120102301 | PREDICATE COUNT AND SEGMENT COUNT INSTRUCTIONS FOR PROCESSING VECTORS - The described embodiments comprise a PredCount instruction and a SegCount instruction. When executed by a processor, the PredCount instruction causes the processor to analyze a predicate vector to determine a number of active elements in the predicate vector that exhibit a predetermined condition (e.g., that are set to a predetermined value) and to return a result indicating that number. When executed by a processor, the segCount instruction causes the processor to determine a number of times that a GeneratePredicates instruction would be executed to generate a full set of predicates using active elements of an input vector. | 04-26-2012 |
20120110305 | Register Renamer that Handles Multiple Register Sizes Aliased to the Same Storage Locations - A processor may include a physical register file and a register renamer. The register renamer may be organized into even and odd banks of entries, where each entry stores an identifier of a physical register. The register renamer may be indexed by a register number of an architected register, such that the renamer maps a particular architected register to a corresponding physical register. Individual entries of the renamer may correspond to architected register aliases of a given size. Renaming aliases that are larger than the given size may involve accessing multiple entries of the renamer, while renaming aliases that are smaller than the given size may involve accessing a single renamer entry. | 05-03-2012 |
20120117358 | Software Selectable Adjustment of SIMD Parallelism - Selective power control of one or more processing elements matches a degree of parallelism to requirements of a task performed in a highly parallel programmable data processor. For example, when program operations require less than the full width of the data path, a software instruction of the program sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down to conserve power. At a later time, when the added capacity is needed, execution of another software instruction sets the mode of operation to that of the wider data path, typically the full width, and the mode change reactivates the previously shut-down processing element. | 05-10-2012 |
20120124334 | SIMD PROCESSOR FOR PERFORMING DATA FILTERING AND/OR INTERPOLATION - Data processing circuit containing an instruction execution circuit having an instruction set comprising a SIMD instruction. The instruction execution circuit comprises arithmetic circuits, arranged to perform N respective identical operations in parallel in response to the SIMD instruction. The SIMD instruction selects a first one and a second one of the registers. The SIMD instruction defines a first and second series of N respective SIMD instruction operands of the SIMD instruction from the addressed registers. Each arithmetic circuit receives a respective first operand and a respective second operand from the first and second series respectively. The instruction execution circuit selects the first and second series so they partially overlap. Positioning the operands is under program control. | 05-17-2012 |
20120137108 | SYSTEMS AND METHODS INTEGRATING BOOLEAN PROCESSING AND MEMORY - The present disclosure relates to placing a Boolean Processor on a chip with memory to eliminate memory latency issues in computing systems. An asynchronous implementation of a Boolean Processor Switched Memory can theoretically operate at terahertz speed and vastly improve the rate at which computationally relevant data is fed to a microprocessor or microcontroller. Boolean Processor Enhanced Memories hold the promise of increasing memory throughput by several orders of magnitude and shifting the burden of “catching up” to microprocessors and microcontrollers. | 05-31-2012 |
20120144162 | SYSTEMS AND METHODS FOR DETERMINING COMPUTE KERNELS FOR AN APPLICATION IN A PARALLEL-PROCESSING COMPUTER SYSTEM - A runtime system implemented in accordance with the present invention provides an application platform for parallel-processing computer systems. Such a runtime system enables users to leverage the computational power of parallel-processing computer systems to accelerate/optimize numeric and array-intensive computations in their application programs. This enables greatly increased performance of high-performance computing (HPC) applications. | 06-07-2012 |
20120166765 | PREDICTING BRANCHES FOR VECTOR PARTITIONING LOOPS WHEN PROCESSING VECTOR INSTRUCTIONS - While fetching the instructions from a loop in program code, a processor calculates a number of times that a backward-branching instruction at the end of the loop will actually be taken when the fetched instructions are executed. Upon determining that the backward-branching instruction has been predicted taken more than the number of times that the branch instruction will actually be taken, the processor immediately commences a mispredict operation for the branch instruction, which comprises: (1) flushing fetched instructions from the loop that will not be executed from the processor, and (2) commencing fetching instructions from an instruction following the branch instruction. | 06-28-2012 |
20120179895 | METHOD AND APPARATUS FOR FAST DECODING AND ENHANCING EXECUTION SPEED OF AN INSTRUCTION - Method and apparatus for fast decoding of microinstructions are disclosed. An integrated circuit is disclosed wherein microinstructions are queued for execution in an execution unit having multiple pipelines where each pipeline is configured to execute a set of supported microinstructions. The execution unit receives microinstruction data including an operation code (opcode) or a complex opcode. The execution unit executes the microinstruction multiple times wherein the microinstruction is executed at least once to get an address value and at least once to get a result of an operation. The execution unit processes complex opcodes by utilizing both a load/store support and a simple opcode support by splitting the complex opcode into load/store and simple opcode components and creating an internal source/destination between the two components. | 07-12-2012 |
20120198208 | SHARED FUNCTION MULTI-PORTED ROM APPARATUS AND METHOD - Various embodiments may be disclosed that may share a ROM pull down logic circuit among multiple ports of a processing core. The processing core may include an execution unit (EU) having an array of read only memory (ROM) pull down logic storing math functions. The ROM pull down logic circuit may implement single instruction, multiple data (SIMD) operations. The ROM pull down logic circuit may be operatively coupled with each of the multiple ports in a multi-port function sharing arrangement. Sharing the ROM pull down logic circuit reduces the need to duplicate logic and may result in a savings of chip area as well as a savings of power. | 08-02-2012 |
20120210099 | RUNNING UNARY OPERATION INSTRUCTIONS FOR PROCESSING VECTORS - During operation, a processor generates a result vector. In particular, the processor records a value from an element at a key element position in an input vector into a base value. Next, for each active element in the result vector to the right of the key element position, the processor generates a result vector by setting the element in the result vector equal to a result of performing a unary operation on the base value a number of times equal to a number of relevant elements. The number of relevant elements is determined from the key element position to and including a predetermined element in the result vector, where the predetermined element in the result vector may be one of: a first element to the left of the element in the result vector; or the element in the result vector. | 08-16-2012 |
20120260067 | MICROPROCESSOR THAT PERFORMS X86 ISA AND ARM ISA MACHINE LANGUAGE PROGRAM INSTRUCTIONS BY HARDWARE TRANSLATION INTO MICROINSTRUCTIONS EXECUTED BY COMMON EXECUTION PIPELINE - A microprocessor includes a hardware instruction translator that translates x86 ISA and ARM ISA machine language program instructions into microinstructions, which are encoded in a distinct manner from the x86 and ARM instructions. An execution pipeline executes the microinstructions to generate x86/ARM-defined results. The microinstructions are distinct from the results generated by the execution of the microinstructions by the execution pipeline. The translator directly provides the microinstructions to the execution pipeline for execution. Each time the microprocessor performs one of the x86 ISA and ARM ISA instructions, the translator translates it into the microinstructions. An indicator indicates either x86 or ARM as a boot ISA. After reset, the microprocessor initializes its architectural state, fetches its first instructions from a reset address, and translates them all as defined by the boot ISA. An instruction cache caches the x86 and ARM instructions and provides them to the translator. | 10-11-2012 |
20120260068 | APPARATUS AND METHOD FOR HANDLING OF MODIFIED IMMEDIATE CONSTANT DURING INSTRUCTION TRANSLATION - An ISA-defined instruction includes an immediate field having a first and second portions specifying first and second values, which instructs the microprocessor to perform an operation using a constant value as one of its source operands. The constant value is the first value rotated/shifted by a number of bits based on the second value. An instruction translator translates the instruction into one or more microinstructions. An execution pipeline executes the microinstructions generated by the instruction translator. The instruction translator, rather than the execution pipeline, generates the constant value for the execution pipeline as a source operand of at least one of the microinstructions for execution by the execution pipeline. Alternatively, if the immediate field value is not within a predetermined subset of values known by the instruction translator, the instruction translator generates, rather than the constant, a second microinstruction for execution by the execution pipeline to generate the constant. | 10-11-2012 |
20120290816 | Optimized Scalar Promotion with Load and Splat SIMD Instructions - Mechanisms for optimizing scalar code executed on a single instruction multiple data (SIMD) engine are provided. Placement of vector operation-splat operations may be determined based on an identification of scalar and SIMD operations in an original code representation. The original code representation may be modified to insert the vector operation-splat operations based on the determined placement of vector operation-splat operations to generate a first modified code representation. Placement of separate splat operations may be determined based on identification of scalar and SIMD operations in the first modified code representation. The first modified code representation may be modified to insert or delete separate splat operations based on the determined placement of the separate splat operations to generate a second modified code representation. SIMD code may be output based on the second modified code representation for execution by the SIMD engine. | 11-15-2012 |
20120311302 | APPARATUS AND METHOD FOR PROCESSING OPERATIONS IN PARALLEL USING A SINGLE INSTRUCTION MULTIPLE DATA PROCESSOR - A parallel operation processing apparatus and method using a Single Instruction Multiple Data (SIMD) processor are provided. The parallel operation processing apparatus may combine input data of source nodes in a current column with input data of source nodes in a previous column, and may store the combined input data. | 12-06-2012 |
20120311303 | Processor for Executing Wide Operand Operations Using a Control Register and a Results Register - A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers. | 12-06-2012 |
20130013894 | DATA PROCESSOR - A RISC data processor in which the number of flags generated by each instruction is increased so that a decrease of flag-generating instructions exceeds an increase of flag-using instructions in quantity, thereby achieving the decrease in instructions. An instruction for generating flags according to operands' data sizes is defined, and an instruction set handled by the RISC data processor includes an instruction capable of executing an operation on operands in more than one data size. An identical operation process is conducted on the small-size operand and on low-order bits of the large-size operand, and flags are generated capable of coping with the respective data sizes regardless of the data size of each operand subjected to the operation. Thus, a reduction in instruction code space of the RISC data processor can be achieved. | 01-10-2013 |
20130067199 | CONTROL REGISTER MAPPING IN HETEROGENEOUS INSTRUCTION SET ARCHITECTURE PROCESSOR - A microprocessor capable of running both x86 instruction set architecture (ISA) machine language programs and Advanced RISC Machines (ARM) ISA machine language programs. The microprocessor includes a mode indicator that indicates whether the microprocessor is currently fetching instructions of an x86 ISA or ARM ISA machine language program. The microprocessor also includes a plurality of model-specific registers (MSRs) that control aspects of the operation of the microprocessor. When the mode indicator indicates the microprocessor is currently fetching x86 ISA machine language program instructions, each of the plurality of MSRs is accessible via an x86 ISA RDMSR/WRMSR instruction that specifies an address of the MSR. When the mode indicator indicates the microprocessor is currently fetching ARM ISA machine language program instructions, each of the plurality of MSRs is accessible via an ARM ISA MRRC/MCRR instruction that specifies the address of the MSR. | 03-14-2013 |
20130138922 | REGISTER MANAGEMENT IN AN EXTENDED PROCESSOR ARCHITECTURE - Systems and methods are disclosed for enhancing the throughput of a processor by minimizing the number of transfers of data associated with data transfer between a register file and a memory stack. The register file used by a processor running an application is partitioned into a number of blocks. A subset of the blocks of the register file is defined in an application binary interface enabling the subset to be pre-allocated and exposed to the application binary interface. Optionally, blocks other than the subset are not exposed to the application binary interface so that the data relating to application function switch or a context switch is not transferred between the unexposed blocks and a memory stack. | 05-30-2013 |
20130145121 | DYNAMICALLY CONFIGURABLE PLACEMENT ENGINE - A stream application may allocate processing elements to one or more compute nodes (or hosts) to achieve a desired optimization goal. Each optimization mode may define processing element selection criteria and/or host selection criteria. When allocating a processing element to a host, a scheduler may place each processing element individually. Accordingly, the scheduler may use the processing element selection criteria for selecting which processing element in the stream application to allocate next. The scheduler may then determine, based on one or more constraints, which host the processing element can be placed on. If the scheduler determines that multiple hosts are suitable candidates for the processing element, it may use the host selection criteria to pick one of the candidate hosts that further optimize the stream application to meet the desired goal. Examples of different optimization goals that may be achieved using processing element and host selection criteria include optimizing performance, decreasing maintenance and operating costs, increasing solvability, sharing limited computer resources with other applications, and the like. | 06-06-2013 |
20130191614 | PERFORMING A CYCLIC REDUNDANCY CHECKSUM OPERATION RESPONSIVE TO A USER-LEVEL INSTRUCTION - In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed. | 07-25-2013 |
20130198489 | PROCESSING ELEMENT MANAGEMENT IN A STREAMING DATA SYSTEM - Stream applications may inefficiently use the hardware resources that execute the processing elements of the data stream. For example, a compute node may host four processing elements and execute each using a CPU. However, other CPUs on the compute node may sit idle. To take advantage of these available hardware resources, a stream programmer may identify one or more processing elements that may be cloned. The cloned processing elements may be used to generate a different execution path that is parallel to the execution path that includes the original processing elements. Because the cloned processing elements contain the same operators as the original processing elements, the data stream that was previously flowing through only the original processing element may be split and sent through both the original and cloned processing elements. In this manner, the parallel execution path may use underutilized hardware resources to increase the throughput of the data stream. | 08-01-2013 |
20130219150 | Parsing Data Representative of a Hardware Design into Commands of a Hardware Design Environment - A method for implementing a hardware design that includes using a computer for receiving structured data that includes a representation of a basic hardware structure and a complex hardware structure that includes the basic hardware structure, parsing the structured data and generating, based on a result of the parsing, commands of a hardware design environment. | 08-22-2013 |
20130227250 | SIMD ACCELERATOR FOR DATA COMPARISON - Some example embodiments include an apparatus for comparing a first operand to a second operand. The apparatus includes a SIMD accelerator configured to compare first multiple parts (e.g., bytes) of first operand to second multiple parts (e.g., bytes) of the second operand. The SIMD accelerator includes a ones' complement subtraction logic and a twos' complement logic configured to perform logic operations on the multiple parts of the first operand and the multiple parts of the second operand to generate a group of carry out and propagate data across bits of the multiple parts. At least a portion of the group of carry out and propagate data is reused in the group of logic operations. | 08-29-2013 |
20130262820 | EVENT LOGGER FOR JUST-IN-TIME STATIC TRANSLATION SYSTEM - Systems and methods for event logging in a just-in-time static translation system are disclosed. One method includes executing a workload in a computing system having a native instruction set architecture, the workload stored in one or more banks of non-native instructions. At least a portion of the workload is further included in one or more banks of native instructions and executing the workload comprises executing at least part of the workload from the one or more banks of native instructions. The method also includes determining an amount of time during execution of the workload in which the execution of the workload occurs from the one or more banks of native instructions. The method includes generating a log including performance statistics generated during execution of the workload, the performance statistics including the amount of time. | 10-03-2013 |
20130332703 | Shared Register Pool For A Multithreaded Microprocessor - A method of sharing a plurality of registers in a shared register pool among a plurality of microprocessor threads begins with a determination that a first instruction to be executed by a microprocessor in a first microprocessor thread requires a first logical register. Next a determination is made that a second instruction to be executed by the microprocessor in a second microprocessor thread requires a second logical register. A first physical register in the shared register pool is allocated to the first microprocessor thread for execution of the first instruction and the first logical register is mapped to the first physical register. A second physical register in the shared register pool is allocated to the second microprocessor thread for execution of the second instruction. Finally, the second logical register is mapped to the second physical register. | 12-12-2013 |
20140019718 | VECTORIZED PATTERN SEARCHING - Embodiments of computer-implemented methods, systems, computing devices, and computer-readable media are described herein for vectorized searching for a pattern P within a set of data T, the pattern P having a length m. In various embodiments, the vectorized search may include a shift of a sliding window into T by a distance d that is greater than m on determination, based on one or more ordered vectorized comparisons of portions of P and T, that no potential match of P is found within the sliding window. In various embodiments, d and m may be positive integers. In various embodiments, the one or more ordered vectorized comparisons may include one or more single instruction multiple data (“SIMD”) instructions supported by the processor. | 01-16-2014 |
20140019719 | GENERALIZED BIT MANIPULATION INSTRUCTIONS FOR A COMPUTER PROCESSOR - Methods of bit manipulation within a computer processor are disclosed. Improved flexibility in bit manipulation proves helpful in computing elementary functions critical to the performance of many programs and for other applications. In one embodiment, a unit of input data is shifted/rotated and multiple non-contiguous bit fields from the unit of input data are inserted in an output register. In another embodiment, one of two units of input data is optionally shifted or rotated, the two units of input data are partitioned into a plurality of bit fields, bitwise operations are performed on each bit field, and pairs of bit fields are combined with either an AND or an OR bitwise operation. Embodiments are also disclosed to simultaneously perform these processes on multiple units and pairs of units of input data in a Single Input, Multiple Data processing environment capable of performing logical operations on floating point data. | 01-16-2014 |
20140047213 | METHOD AND SYSTEM FOR MEMORY OVERLAYS FOR PORTABLE FUNCTION POINTERS - A system and method for implementing memory overlays for portable pointer variables. The method includes providing a program executable by a heterogeneous processing system comprising a plurality of a processors running a plurality of instruction set architectures (ISAs). The method also includes providing a plurality of processor specific functions associated with a function pointer in the program. The method includes executing the program by a first processor. The method includes dereferencing the function pointer by mapping the function pointer to a corresponding processor specific feature based on which processor in the plurality of processors is executing the program. | 02-13-2014 |
20140068227 | SYSTEMS, APPARATUSES, AND METHODS FOR EXTRACTING A WRITEMASK FROM A REGISTER - Embodiments of systems, apparatuses, and methods for performing in a computer processor mask extraction from a general purpose register in response to a single mask extraction from a general purpose register instruction that includes a source general purpose register operand, a destination writemask register operand, an immediate value, and an opcode are described. | 03-06-2014 |
20140108768 | Computer instructions for Activating and Deactivating Operands - An instruction set architecture (ISA) includes instructions for selectively indicating last-use architected operands having values that will not be accessed again, wherein architected operands are made active or inactive after an instruction specified last-use by an instruction, wherein the architected operands are made active by performing a write operation to an inactive operand, wherein the activation/deactivation may be performed by the instruction having the last-use of the operand or another (prefix) instruction. | 04-17-2014 |
20140281386 | CHAINING BETWEEN EXPOSED VECTOR PIPELINES - Embodiments include a method for chaining data in an exposed-pipeline processing element. The method includes separating a multiple instruction word into a first sub-instruction and a second sub-instruction, receiving the first sub-instruction and the second sub-instruction in the exposed-pipeline processing element. The method also includes issuing the first sub-instruction at a first time, issuing the second sub-instruction at a second time different than the first time, the second time being offset to account for a dependency of the second sub-instruction on a first result from the first sub-instruction, the first pipeline performing the first sub-instruction at a first clock cycle and communicating the first result from performing the first sub-instruction to a chaining bus coupled to the first pipeline and a second pipeline, the communicating at a second clock cycle subsequent to the first clock cycle that corresponds to a total number of latch pipeline stages in the first pipeline. | 09-18-2014 |
20140317381 | METHOD OF PROCESSING IMMEDIATE VALUE IN EISC PROCESSOR - Disclosed is a method of operating an immediate value in an extendable instruction set computer (EISC) processor, comprising: checking whether or not an unsigned immediate value is used to generate an extension register (ER) value for operating an immediate value; and generating the ER value by performing zero extension for the unsigned immediate value using an unsigned load extension register with immediate (ULERI) instruction if the unsigned immediate value is used. It is possible to improve operational efficiency by preventing an LERI instruction from being unnecessarily executed when an immediate value is operated using a 16-bit instruction in the EISC processor. | 10-23-2014 |
20160092238 | COPROCESSOR FOR OUT-OF-ORDER LOADS - Systems and methods for implementing certain load instructions, such as vector load instructions by cooperation of a main processor and a coprocessor. The load instructions which are identified by the main processor for offloading to the coprocessor are committed in the main processor without receiving corresponding load data. Post-commit, the load instructions are processed in the coprocessor, such that latencies incurred in fetching the load data are hidden from the main processor. By implementing an out-of-order load data buffer associated with an in-order instruction buffer, the coprocessor is also configured to avoid stalls due to long latencies which may be involved in fetching the load data from levels of memory hierarchy, such as L2, L3, L4 caches, main memory, etc. | 03-31-2016 |
20160098277 | COMPRESSING INSTRUCTION QUEUE FOR A MICROPROCESSOR - A compressing instruction queue for a microprocessor including a queue and redirect logic. The queue includes a matrix of storage locations including N rows and M columns for storing microinstructions of the microprocessor in sequential order. The redirect logic is configured to receive and write multiple microinstructions per cycle of a clock signal into sequential storage locations of the queue without leaving unused storage locations and beginning at a first available storage location in the queue. The redirect logic performs redirection and compression to eliminate empty locations or holes in the queue and to reduce the number of write ports interfaced with each storage location of the queue. | 04-07-2016 |
20160162294 | RECONFIGURABLE PROCESSORS AND METHODS FOR COLLECTING COMPUTER PROGRAM INSTRUCTION EXECUTION STATISTICS - Reconfigurable processors and methods for collecting computer program instruction execution statistics are disclosed. According to an aspect, a method includes providing a reconfigurable processor configured to execute a set of central processing unit (CPU) instructions that each have a function. The method also includes modifying the function of one or more of the CPU instructions that identifies an instruction address and a destination address pair of the CPU instruction(s) based on a defined test case. Further, the method includes using the reconfigurable processor to execute the set of CPU instructions. The method also includes identifying an instruction address and destination address pair of the CPU instruction(s) having the modified function when the CPU instruction(s) having the modified function is executed during execution of the set of CPU instructions. | 06-09-2016 |
20160378480 | Systems, Methods, and Apparatuses for Improving Performance of Status Dependent Computations - Embodiments for systems, methods, and apparatuses for improving performance of status dependent computations are detailed. In an embodiment, an hardware apparatus comprises decoder hardware to decode an instruction, operand retrieval hardware to retrieve data from at least one source operand associated with the instruction decoded by the decoder hardware, and execution hardware to execute the decoded instruction to generate a result including at least one status bit and to cause the result and at least one status bit to be stored in a single destination physical storage location, wherein the at least one status bit and result are accessible through a read of the single register. | 12-29-2016 |
20160378483 | REUSE OF DECODED INSTRUCTIONS - Systems and methods are disclosed for reusing fetched and decoded instructions in block-based processor architectures. In one example of the disclosed technology, a system includes a plurality of block-based processor cores and an instruction scheduler. A respective core is capable of executing one or more instruction blocks of a program. The instruction scheduler can be configured to identify a given instruction block of the program that is resident on a first processor core of the processor cores and is to be executed again. The instruction scheduler can be configured to adjust a mapping of instruction blocks in flight so that the given instruction block is re-executed on the first processor core without re-fetching the given instruction block. | 12-29-2016 |