Patent application number | Description | Published |
20080201699 | Efficient Data Reorganization to Satisfy Data Alignment Constraints - Vectorizing misaligned references in compiled code for SIMD architectures that support only aligned loads and stores is presented. In the framework presented herein, a loop is first simdized as if the memory unit imposes no alignment constraints. The compiler then inserts data reorganization operations to satisfy the actual alignment requirement of the hardware. Finally, the code generation algorithm generates SIMD codes based on the data reorganization graph, addressing realistic issues such as runtime alignments, unknown loop bounds, residue iteration counts, and multiple statements with arbitrary alignment combinations. Beyond generating a valid simdization, a preferred embodiment further improves the quality of the generated codes. Four stream-shift placement policies are disclosed, which minimize the number of data reorganization generated by the alignment handling. | 08-21-2008 |
20080222391 | Apparatus and Method for Optimizing Scalar Code Executed on a SIMD Engine by Alignment of SIMD Slots - An apparatus and method for optimizing scalar code executed on a single instruction multiple data (SIMD) engine is provided that aligns the slots of SIMD registers. With the apparatus and method, a compiler is provided that parses source code and, for each statement in the program, generates an expression tree. The compiler inspects all storage inputs to scalar operations in the expression tree to determine their alignment in the SIMD registers. This alignment is propagated up the expression tree from the leaves. When the alignments of two operands in the expression tree are the same, the resulting alignment is the shared value. When the alignments of two operands in the expression tree are different, one operand is shifted. For shifted operands, a shift operation is inserted in the expression tree. The executable code is then generated for the expression tree and shifts are inserted where indicated. | 09-11-2008 |
20080229291 | Compiler Implemented Software Cache Apparatus and Method in which Non-Aliased Explicitly Fetched Data are Excluded - A compiler implemented software cache apparatus and method in which non-aliased explicitly fetched data are excluded are provided. With the mechanisms of the illustrative embodiments, a compiler uses a forward data flow analysis to prove that there is no alias between the cached data and explicitly fetched data. Explicitly fetched data that has no alias in the cached data are excluded from the software cache. Explicitly fetched data that has aliases in the cached data are allowed to be stored in the software cache. In this way, there is no runtime overhead to maintain the correctness of the two copies of data. Moreover, the number of lines of the software cache that must be protected from eviction is decreased. This leads to a decrease in the amount of computation cycles required by the cache miss handler when evicting cache lines during cache miss handling. | 09-18-2008 |
20080229295 | Ensuring Maximum Code Motion of Accesses to DMA Buffers - A “kill” intrinsic that may be used in programs for designating specific data objects as having been “killed” by a preceding action is provided. The concept of a data object being “killed” is that the compiler is informed that no operations (e.g., loads and stores) on that data object, or its aliases, can be moved across the point in the program flow where the data object is designated as having been “killed.” The “kill” intrinsic limits the reordering capability of an optimization scheduler of a compiler with regard to operations performed on “killed” data objects. The “kill” intrinsic may be used with DMA operations. Data objects being DMA'ed from a local store of a processor may be “killed” through use of the “kill” intrinsic prior to submitting the DMA request. Data objects being DMA'ed to the local store of the processor may be “killed” after verifying the transfer completes. | 09-18-2008 |
20080229298 | Compiler Method for Employing Multiple Autonomous Synergistic Processors to Simultaneously Operate on Longer Vectors of Data - A compiler includes a mechanism for employing multiple synergistic processors to execute long vectors. The compiler receives a single source program. The compiler identifies vectorizable loop code in the single source program and extracts the vectorizable loop code from the single source program. The compiler then compiles the extracted vectorizable loop code for a plurality of synergistic processors. The compiler also compiles a remainder of the single source program for a principal processor to form an executable main program such that the executable main program controls operation of the executable vectorizable loop code on the plurality of synergistic processors. | 09-18-2008 |
20080256521 | Apparatus and Method for Partitioning Programs Between a General Purpose Core and One or More Accelerators - An apparatus and method for partitioning programs between a general purpose core and one or more accelerators are provided. With the apparatus and method, a compiler front end is provided for converting a program source code in a corresponding high level programming language into an intermediate code representation. This intermediate code representation is provided to an interprocedural optimizer which determines which core processor or accelerator each portion of the program should execute on and partitions the program into sub-programs based on this set of decisions. The interprocedural optimizer may further add instructions to the partitions to coordinate and synchronize the sub-programs as required. Each sub-program is compiled on an appropriate compiler backend for the instruction set architecture of the particular core processor or accelerator selected to execute the sub-program. The compiled sub-programs and then linked to thereby generate an executable program. | 10-16-2008 |
20080282064 | System and Method for Speculative Thread Assist in a Heterogeneous Processing Environment - A system and method for speculative assistance to a thread in a heterogeneous processing environment is provided. A first set of instructions is identified in a source code representation (e.g., a source code file) that is suitable for speculative execution. The identified set of instructions are analyzed to determine the processing requirements. Based on the analysis, a processor type is identified that will be used to execute the identified first set of instructions based. The processor type is selected from more than one processor types that are included in the heterogeneous processing environment. The heterogeneous processing environment includes more than one heterogeneous processing cores in a single silicon substrate. The various processing cores can utilize different instruction set architectures (ISAs). An object code representation is then generated for the identified first set of instructions with the object code representation being adapted to execute on the determined type of processor. | 11-13-2008 |
20090055588 | Performing Useful Computations While Waiting for a Line in a System with a Software Implemented Cache - Mechanisms for performing useful computations during a software cache reload operation are provided. With the illustrative embodiments, in order to perform software caching, a compiler takes original source code, and while compiling the source code, inserts explicit cache lookup instructions into appropriate portions of the source code where cacheable variables are referenced. In addition, the compiler inserts a cache miss handler routine that is used to branch execution of the code to a cache miss handler if the cache lookup instructions result in a cache miss. The cache miss handler, prior to performing a wait operation for waiting for the data to be retrieved from the backing store, branches execution to an independent subroutine identified by a compiler. The independent subroutine is executed while the data is being retrieved from the backing store such that useful work is performed. | 02-26-2009 |
20090119652 | Computer Program Functional Partitioning System for Heterogeneous Multi-processing Systems - The present invention provides for a system for computer program functional partitioning for heterogeneous multi-processing systems. At least one system parameter of a computer system comprising one or more disparate processing nodes is identified. Computer program code comprising a program to be run on the computer system is received. A whole program representation is generated based on received computer program code. At least one single-entry-single-exit (SESE) region is identified based on the whole program representation. At least one node-specific SESE region is identified based on identified SESE regions and the at least one system parameter. Each node-specific SESE region is grouped into a node-specific subroutine. Each node-specific subroutine is compiled based on a specified node characteristic. The computer program code is modified based on the node-specific subroutines and the modified computer program code is compiled. | 05-07-2009 |
20090158019 | Computer Program Code Size Partitioning System for Multiple Memory Multi-Processing Systems - The present invention provides for a system for computer program code size partitioning for multiple memory multi-processor systems. At least one system parameter of a computer system comprising one or more disparate processing nodes is identified. Computer program code comprising a program to be run on the computer system is received. A program representation based on received computer program code is generated. At least one single-entry-single-exit (SESE) region is identified based on the whole program representation. At least one SESE region of less than a certain size (store-size-specific) is identified based on identified SESE regions and the at least one system parameter. Each store-size-specific SESE region is grouped into a node-specific subroutine. The non node-specific parts of the computer program code are modified based on the partitioning into node-specific subroutines. The modified computer program code including each node-specific subroutine is compiled based on a specified node characteristic. | 06-18-2009 |
20110145503 | ON-LINE OPTIMIZATION OF SOFTWARE INSTRUCTION CACHE - A method for computing includes executing a program, including multiple cacheable lines of executable code, on a processor having a software-managed cache. A run-time cache management routine running on the processor is used to assemble a profile of inter-line jumps occurring in the software-managed cache while executing the program. Based on the profile, an optimized layout of the lines in the code is computed, and the lines of the program are re-ordered in accordance with the optimized layout while continuing to execute the program. | 06-16-2011 |