Patent application number | Description | Published |
20090083702 | System and Method for Selective Code Generation Optimization for an Advanced Dual-Representation Polyhedral Loop Transformation Framework - A system and method for selective code generation optimization for an advanced dual-representation polyhedral loop transformation framework are provided. The mechanisms of the illustrative embodiments address the weaknesses of the known polyhedral loop transformation based approaches by providing mechanisms for performing code generation transformations on individual statement instances in an intermediate representation generated by the polyhedral loop transformation optimization of the source code. These code generation transformations have the important property that they do not change program order of the statements in the intermediate representation. This property allows the result of the code generation transformations to be provided back to the polyhedral loop transformation mechanisms in a program statement view, via a new re-entrance path of the illustrative embodiments, for additional optimization. | 03-26-2009 |
20110055484 | Detecting Task Complete Dependencies Using Underlying Speculative Multi-Threading Hardware - Mechanisms are provided for tracking dependencies of threads in a multi-threaded computer program execution. The mechanisms detect a dependency of a first thread's execution on results of a second thread's execution in an execution flow of the multi-threaded computer program. The mechanisms further store, in a hardware thread dependency vector storage associated with the first thread's execution, an identifier of the dependency by setting at least one bit in the hardware thread dependency vector storage corresponding to the second thread. Moreover, the mechanisms schedule tasks performed by the multi-threaded computer program based on the hardware thread dependency vector storage to minimize squashing of threads. | 03-03-2011 |
20110219222 | Building Approximate Data Dependences with a Moving Window - Mechanisms for building approximate data dependences using a moving look-back window are provided. The mechanisms track dependence information for memory accesses over iterations of execution of a portion of code. The mechanisms receive a memory access of an iteration of the portion of code, the memory access having an address for access the memory and an access type indicating at least one of a read or a write access type. An entry in a moving look-back window data structure is generated corresponding to a memory location accessed by the memory access. The entry comprises at least an identification of the address, the access type, and an iteration number corresponding to the iteration of the memory access. The moving look-back window data structure is utilized to determine dependence information for memory accesses over a plurality of iterations of the portion of code. | 09-08-2011 |
20110320785 | Binary Rewriting in Software Instruction Cache - Mechanisms are provided for dynamically rewriting branch instructions in a portion of code. The mechanisms execute a branch instruction in the portion of code. The mechanisms determine if a target instruction of the branch instruction, to which the branch instruction branches, is present in an instruction cache associated with the processor. Moreover, the mechanisms directly branch execution of the portion of code to the target instruction in the instruction cache, without intervention from an instruction cache runtime system, in response to a determination that the target instruction is present in the instruction cache. In addition, the mechanisms redirect execution of the portion of code to the instruction cache runtime system in response to a determination that the target instruction cannot be determined to be present in the instruction cache. | 12-29-2011 |
20110320786 | Dynamically Rewriting Branch Instructions in Response to Cache Line Eviction - Mechanisms are provided for evicting cache lines from an instruction cache of the data processing system. The mechanisms store, for a portion of code in a current cache line, a linked list of call sites that directly or indirectly target the portion of code in the current cache line. A determination is made as to whether the current cache line is to be evicted from the instruction cache. The linked list of call sites is processed to identify one or more rewritten branch instructions having associated branch stubs, that either directly or indirectly target the portion of code in the current cache line. In addition, the one or more rewritten branch instructions are rewritten to restore the one or more rewritten branch instructions to an original state based on information in the associated branch stubs. | 12-29-2011 |
20110321002 | Rewriting Branch Instructions Using Branch Stubs - Mechanisms are provided for rewriting branch instructions in a portion of code. The mechanisms receive a portion of source code having an original branch instruction. The mechanisms generate a branch stub for the original branch instruction. The branch stub stores information about the original branch instruction including an original target address of the original branch instruction. Moreover, the mechanisms rewrite the original branch instruction so that a target of the rewritten branch instruction references the branch stub. In addition, the mechanisms output compiled code including the rewritten branch instruction and the branch stub for execution by a computing device. The branch stub is utilized by the computing device at runtime to determine if execution of the rewritten branch instruction can be redirected directly to a target instruction corresponding to the original target address in an instruction cache of the computing device without intervention by an instruction cache runtime system. | 12-29-2011 |
20110321021 | Arranging Binary Code Based on Call Graph Partitioning - Mechanisms are provided for arranging binary code to reduce instruction cache conflict misses. These mechanisms generate a call graph of a portion of code. Nodes and edges in the call graph are weighted to generate a weighted call graph. The weighted call graph is then partitioned according to the weights, affinities between nodes of the call graph, and the size of cache lines in an instruction cache of the data processing system, so that binary code associated with one or more subsets of nodes in the call graph are combined into individual cache lines based on the partitioning. The binary code corresponding to the partitioned call graph is then output for execution in a computing device. | 12-29-2011 |
20120198169 | Binary Rewriting in Software Instruction Cache - Mechanisms are provided for dynamically rewriting branch instructions in a portion of code. The mechanisms execute a branch instruction in the portion of code. The mechanisms determine if a target instruction of the branch instruction, to which the branch instruction branches, is present in an instruction cache associated with the processor. Moreover, the mechanisms directly branch execution of the portion of code to the target instruction in the instruction cache, without intervention from an instruction cache runtime system, in response to a determination that the target instruction is present in the instruction cache. In addition, the mechanisms redirect execution of the portion of code to the instruction cache runtime system in response to a determination that the target instruction cannot be determined to be present in the instruction cache. | 08-02-2012 |
20120198170 | Dynamically Rewriting Branch Instructions in Response to Cache Line Eviction - Mechanisms are provided for evicting cache lines from an instruction cache of the data processing system. The mechanisms store, for a portion of code in a current cache line, a linked list of call sites that directly or indirectly target the portion of code in the current cache line. A determination is made as to whether the current cache line is to be evicted from the instruction cache. The linked list of call sites is processed to identify one or more rewritten branch instructions having associated branch stubs, that either directly or indirectly target the portion of code in the current cache line. In addition, the one or more rewritten branch instructions are rewritten to restore the one or more rewritten branch instructions to an original state based on information in the associated branch stubs. | 08-02-2012 |
20120198429 | Arranging Binary Code Based on Call Graph Partitioning - Mechanisms are provided for arranging binary code to reduce instruction cache conflict misses. These mechanisms generate a call graph of a portion of code. Nodes and edges in the call graph are weighted to generate a weighted call graph. The weighted call graph is then partitioned according to the weights, affinities between nodes of the call graph, and the size of cache lines in an instruction cache of the data processing system, so that binary code associated with one or more subsets of nodes in the call graph are combined into individual cache lines based on the partitioning. The binary code corresponding to the partitioned call graph is then output for execution in a computing device. | 08-02-2012 |
20120204016 | Rewriting Branch Instructions Using Branch Stubs - Mechanisms are provided for rewriting branch instructions in a portion of code. The mechanisms receive a portion of source code having an original branch instruction. The mechanisms generate a branch stub for the original branch instruction. The branch stub stores information about the original branch instruction including an original target address of the original branch instruction. Moreover, the mechanisms rewrite the original branch instruction so that a target of the rewritten branch instruction references the branch stub. In addition, the mechanisms output compiled code including the rewritten branch instruction and the branch stub for execution by a computing device. The branch stub is utilized by the computing device at runtime to determine if execution of the rewritten branch instruction can be redirected directly to a target instruction corresponding to the original target address in an instruction cache of the computing device without intervention by an instruction cache runtime system. | 08-09-2012 |
20120246654 | Constant Time Worker Thread Allocation Via Configuration Caching - Mechanisms are provided for allocating threads for execution of a parallel region of code. A request for allocation of worker threads to execute the parallel region of code is received from a master thread. Cached thread allocation information identifying prior thread allocations that have been performed for the master thread are accessed. Worker threads are allocated to the master thread based on the cached thread allocation information. The parallel region of code is executed using the allocated worker threads. | 09-27-2012 |
20130151822 | Efficient Enqueuing of Values in SIMD Engines with Permute Unit - Mechanisms, in a data processing system having a processor, for generating enqueued data for performing computations of a conditional branch of code are provided. Mask generation logic of the processor operates to generate a mask representing a subset of iterations of a loop of the code that results in a condition of the conditional branch being satisfied. The mask is used to select data elements from an input data element vector register corresponding to the subset of iterations of the loop of the code that result in the condition of the conditional branch being satisfied. Furthermore, the selected data elements are used to perform computations of the conditional branch of code. Iterations of the loop of the code that do not result in the condition of the conditional branch being satisfied are not used as a basis for performing computations of the conditional branch of code. | 06-13-2013 |
20130283250 | Thread Specific Compiler Generated Customization of Runtime Support for Application Programming Interfaces - Mechanisms are provided for generating a customized runtime library for source code. Source code is analyzed to identify a region of code implementing an application programming interface or programming standard of interest. An invocation tree data structure is generated based on results of analysis of functions of the application programming interface or programming standard of interest that the region of code invokes. A custom runtime library is generated based on the invocation tree data structure. The custom runtime library comprises only a subset of runtime library functions, less than a total number of runtime library functions for the application programming interface or programming standard of interest, actually invoked by the region of code and does not include all runtime library functions in the total number of runtime library functions for the application programming interface or programming standard of interest. | 10-24-2013 |
20140068581 | OPTIMIZED DIVISION OF WORK AMONG PROCESSORS IN A HETEROGENEOUS PROCESSING SYSTEM - A compiler implemented by a computer performs optimized division of work across heterogeneous processors. The compiler divides source code into code sections and characterizes each of the code sections based on pre-defined criteria. Each of the code sections is characterized as at least one of: allocate to a main processor, allocate to a processing element, allocate to one of a parameterized main processor and a parameterized processing element, and indeterminate. The compiler analyzes side-effects and costs of executing the code sections on allocated processors, and transforms the code sections based on results of the analyzing. The transforming includes re-characterizing the code sections for alternate execution in a runtime environment. | 03-06-2014 |
20140068582 | OPTIMIZED DIVISION OF WORK AMONG PROCESSORS IN A HETEROGENEOUS PROCESSING SYSTEM - A compiler implemented by a computer performs optimized division of work across heterogeneous processors. The compiler divides source code into code sections and characterizes each of the code sections based on pre-defined criteria. Each of the code sections is characterized as at least one of: allocate to a main processor, allocate to a processing element, allocate to one of a parameterized main processor and a parameterized processing element, and indeterminate. The compiler analyzes side-effects and costs of executing the code sections on allocated processors, and transforms the code sections based on results of the analyzing. The transforming includes re-characterizing the code sections for alternate execution in a runtime environment. | 03-06-2014 |
20140136857 | POWER-CONSTRAINED COMPILER CODE GENERATION AND SCHEDULING OF WORK IN A HETEROGENEOUS PROCESSING SYSTEM - A heterogeneous processing system includes a compiler for performing power-constrained code generation and scheduling of work in the heterogeneous processing system. The compiler produces source code that is executable by a computer. The compiler performs a method. The method includes dividing a power budget for the heterogeneous processing system into a discrete number of power tokens. Each of the power tokens has an equal value of units of power. The method also includes determining a power requirement for executing a code segment on a processing element of the heterogeneous processing system. The determining is based on characteristics of the processing element and the code segment. The method further includes allocating, to the processing element at runtime, at least one of the power tokens to satisfy the power requirement. | 05-15-2014 |
20140136858 | POWER-CONSTRAINED COMPILER CODE GENERATION AND SCHEDULING OF WORK IN A HETEROGENEOUS PROCESSING SYSTEM - An active memory system includes a computer and an active memory device including layers of memory forming a three-dimensional memory device and individual columns of chips forming vaults in communication with a processing element and logic. The processing element is configured to communicate to the chips and other processing elements. The active memory system also includes a compiler configured to implement a method. The method includes dividing a power budget for the active memory device into a discrete number of power tokens, each of the power tokens having an equal value of units of power. The method also includes determining a power requirement for executing a code segment on the processing element of the active memory device based on characteristics of the processing element and the code segment. The method further includes allocating, to the processing element at runtime, one or more power tokens to satisfy the power requirement. | 05-15-2014 |