| Patent application number | Description | Published |
| 20080209127 | System and method for efficient implementation of software-managed cache - A system and method for an efficient implementation of a software-managed cache is presented. When an application thread executes on a simple processor, the application thread uses a conditional data select instruction for eliminating a conditional branch instruction when accessing a software-managed cache. An application thread issues a conditional data select instruction (DMA transfer) after a cache directory lookup, wherein the size of the requested data is dependent upon the outcome of the cache directory lookup. When the cache directory lookup results in a cache hit, the application thread requests a transfer of zero bits of data, which results in a DMA controller (DMAC) performing a no-op instruction. When the cache directory lookup results in a cache miss, the application thread requests a data block transfer the size of a corresponding cache line. | 08-28-2008 |
| 20080250414 | Dynamically Partitioning Processing Across A Plurality of Heterogeneous Processors - A program is into at least two object files: one object file for each of the supported processor environments. During compilation, code characteristics, such as data locality, computational intensity, and data parallelism, are analyzed and recorded in the object file. During run time, the code characteristics are combined with runtime considerations, such as the current load on the processors and the size of the data being processed, to arrive at an overall value. The overall value is then used to determine which of the processors will be assigned the task. The values are assigned based on the characteristics of the various processors. For example, if one processor is better at handling intensive computations against large streams of data, programs that are highly computationally intensive and process large quantities of data are weighted in favor of that processor. The corresponding object is then loaded and executed on the assigned processor. | 10-09-2008 |
| 20080271003 | Balancing Computational Load Across a Plurality of Processors - Computational load is balanced across a plurality of processors. Source code subtasks are compiled into byte code subtasks whereby the byte code subtasks are translated into processor-specific object code subtasks at runtime. The processor-type selection is based upon one of three approaches which are 1) a brute force approach, 2) higher-level approach, or 3) processor availability approach. Each object code subtask is loaded in a corresponding processor type for execution. In one embodiment, a compiler stores a pointer in a byte code file that references the location of a byte code subtask. In this embodiment, the byte code subtask is stored in a shared library and, at runtime, a runtime loader uses the pointer to identify the location of the byte code subtask in order to translate the byte code subtask. | 10-30-2008 |
| 20080297506 | Ray Tracing with Depth Buffered Display - An image is generated that includes ray traced pixel data and rasterized pixel data. A synergistic processing unit (SPU) uses a rendering algorithm to generate ray traced data for objects that require high-quality image rendering. The ray traced data is fragmented, whereby each fragment includes a ray traced pixel depth value and a ray traced pixel color value. A rasterizer compares ray traced pixel depth values to corresponding rasterized pixel depth values, and overwrites ray traced pixel data with rasterized pixel data when the corresponding rasterized fragment is “closer” to a viewing point, which results in composite data. A display subsystem uses the resultant composite data to generate an image on a user's display. | 12-04-2008 |
| 20090037620 | Apparatus and Method for Efficient Communication of Producer/Consumer Buffer Status - An apparatus and method for efficient communication of producer/consumer buffer status are provided. With the apparatus and method, devices in a data processing system notify each other of updates to head and tail pointers of a shared buffer region when the devices perform operations on the shared buffer region using signal notification channels of the devices. Thus, when a producer device that produces data to the shared buffer region writes data to the shared buffer region, an update to the head pointer is written to a signal notification channel of a consumer device. When a consumer device reads data from the shared buffer region, the consumer device writes a tail pointer update to a signal notification channel of the producer device. In addition, channels may operate in a blocking mode so that the corresponding device is kept in a low power state until an update is received over the channel. | 02-05-2009 |
| 20090138689 | Partitioning Processor Resources Based on Memory Usage - Processor resources are partitioned based on memory usage. A compiler determines the extent to which a process is memory-bound and accordingly divides the process into a number of threads. When a first thread encounters a prolonged instruction, the compiler inserts a conditional branch to a second thread. When the second thread encounters a prolonged instruction, a conditional branch to a third thread is executed. This continues until the last thread conditionally branches back to the first thread. An indirect segmented register file is used so that the “return to” and “branch to” logical registers within each thread are the same (e.g., R | 05-28-2009 |
| 20090327613 | System and Method for a Software Managed Cache in a Multiprocessing Environment - A method for implementing a software-managed cache comprises determining an object identifier (ID) for each of a first set of objects of a plurality of objects resident in a local memory, to generate a first cache table, the first cache table comprising a plurality of entries. Each object comprises an object ID and an effective address. The method receives a request for an object, the request comprising an object ID. The method compares the received object ID with the entries in the first cache table. In the event the received object ID matches an entry in the first cache table, the method returns the matching entry in response to the request. In the event the received object ID does not match an entry in the first cache table, the method calculates an effective address in the local memory of the object associated with the object ID. | 12-31-2009 |
| 20100033493 | System and Method for Iterative Interactive Ray Tracing in a Multiprocessor Environment - A method comprises receiving scene model data including a scene geometry model and a plurality of pixel data describing objects arranged in a scene. The method generates a primary ray based on a selected first pixel data. In the event the primary ray intersects an object in the scene, the method determines primary hit color data and generates a plurality of secondary rays. The method groups the secondary packets and arranges the packets in a queue based on the octant of each direction vector in the secondary ray packet. The method generates secondary color data based on the secondary ray packets in the queue and generates a pixel color based on the primary hit color data, and the secondary color data. The method generates an image based on the pixel color for the pixel data. | 02-11-2010 |
| 20100141652 | System and Method for Photorealistic Imaging Using Ambient Occlusion - Scene model data, including a scene geometry model and a plurality of pixel data describing objects arranged in a scene, is received. A first pixel data of the plurality of pixel data is selected. A primary pixel color and a primary ray are generated based on the first pixel data. If the primary ray intersects an object in the scene, an intersection point, P is determined. A surface normal, N, is determined based on the object intersected and the intersection point, P. A primary hit color is determined based on the intersection point, P. The primary pixel color is modified based on the primary hit color. A plurality of ambient occlusion (AO) rays are generated based on the intersection point, P and the surface normal, N, with each AO ray having a direction, D. For each AO ray, the AO ray direction is reversed, D, the AO ray origin, O, is set to a point outside the scene. Each AO ray is marched from the AO ray origin into the scene to the intersection point, P. If an AO ray intersects an object before reaching point P, that AO ray is excluded from ambient occlusion calculations. If an AO ray does not intersect an object before reaching point P, that ray is included in ambient occlusion calculations. Ambient occlusion is estimated based on included AO rays. The primary pixel color is shaded based on the ambient occlusion and the primary hit color and an image is generated based on the primary pixel color for the pixel data. | 06-10-2010 |
| 20100141665 | System and Method for Photorealistic Imaging Workload Distribution - A graphics client receives a frame, the frame comprising scene model data. A server load balancing factor is set based on the scene model data. A prospective rendering factor is set based on the scene model data. The frame is partitioned into a plurality of server bands based on the server load balancing factor and the prospective rendering factor. The server bands are distributed to a plurality of compute servers. Processed server bands are received from the compute servers. A processed frame is assembled based on the received processed server bands. The processed frame is transmitted for display to a user as an image. | 06-10-2010 |
| 20100166322 | Security Screening Image Analysis Simplification Through Object Pattern Identification - A mechanism is provided for security screening image analysis simplification through object pattern identification. Popular consumer electronics and other items are scanned in a control system, which creates an electronic signature for each known object. The system may reduce the signature to a hash value and place each signature for each known object in a “known good” storage set. For example, popular mobile phones, laptop computers, digital cameras, and the like may be scanned for the known good signature database. At the time of scan, such as at an airport, objects in a bag may be rotated to a common axis alignment and transformed to the same signature or hash value to match against the known good signature database. If an item matches, the scanning system marks it as a known safe object. | 07-01-2010 |
| 20110161548 | Efficient Multi-Level Software Cache Using SIMD Vector Permute Functionality - A cache manager receives a request for data, which includes a requested effective address. The cache manager determines whether the requested effective address matches a most recently used effective address stored in a mapped tag vector. When the most recently used effective address matches the requested effective address, the cache manager identifies a corresponding cache location and retrieves the data from the identified cache location. However, when the most recently used effective address fails to match the requested effective address, the cache manager determines whether the requested effective address matches a subsequent effective address stored in the mapped tag vector. When the cache manager determines a match to a subsequent effective address, the cache manager identifies a different cache location corresponding to the subsequent effective address and retrieves the data from the different cache location. | 06-30-2011 |
| 20110161608 | METHOD TO CUSTOMIZE FUNCTION BEHAVIOR BASED ON CACHE AND SCHEDULING PARAMETERS OF A MEMORY ARGUMENT - Disclosed are a method, a system and a computer program product of operating a data processing system that can include or be coupled to multiple processor cores. In one or more embodiments, each of multiple memory objects can be populated with work items and can be associated with attributes that can include information which can be used to describe data of each memory object and/or which can be used to process data of each memory object. The attributes can be used to indicate one or more of a cache policy, a cache size, and a cache line size, among others. In one or more embodiments, the attributes can be used as a history of how each memory object is used. The attributes can be used to indicate cache history statistics (e.g., a hit rate, a miss rate, etc.). | 06-30-2011 |
| 20110161734 | PROCESS INTEGRITY IN A MULTIPLE PROCESSOR SYSTEM - Disclosed are a method, a system and a computer program product of operating a data processing system that can include or be coupled to multiple processor cores. In one or more embodiments, an error can be determined while two or more processor cores are processing a first group of two or more work items, and the error can be signaled to an application. The application can determine a state of progress of processing the two or more work items and at least one dependency from the state of progress. In one or more embodiments, a second group of two or more work items that are scheduled for processing can be unscheduled, in response to determining the error. In one or more embodiments, the application can process at least one work item that caused the error, and the second group of two or more work items can be rescheduled for processing. | 06-30-2011 |
| 20110161935 | METHOD FOR MANAGING HARDWARE RESOURCES WITHIN A SIMULTANEOUS MULTI-THREADED PROCESSING SYSTEM - A method for managing hardware resources and threads within a data processing system is disclosed. Compilation attributes of a function are collected during and after the compilation of the function. The pre-processing attributes of the function are also collected before the execution of the function. The collected attributes of the function are then analyzed, and a runtime configuration is assigned to the function based of the result of the attribute analysis. The runtime configuration may include, for example, the designation of the function to be executed under either a single-threaded mode or a simultaneous multi-threaded mode. During the execution of the function, real-time attributes of the function are being continuously collected. If necessary, the runtime configuration under which the function is being executed can be changed based on the real-time attributes collected during the execution of the function. | 06-30-2011 |
| 20110161943 | METHOD TO DYNAMICALLY DISTRIBUTE A MULTI-DIMENSIONAL WORK SET ACROSS A MULTI-CORE SYSTEM - A method provides efficient dispatch/completion of an N Dimensional (ND) Range command in a data processing system (DPS). The method comprises: a compiler generating one or more commands from received program instructions; ND Range work processing (WP) logic determining when a command generated by the compiler will be implemented over an ND configuration of operands, where N is greater than one (1); automatically decomposing the ND configuration of operands into a one (1) dimension (1D) work element comprising P sequentially ordered work items that each represent one of the operands; placing the 1D work element within a command queue of the DPS; enabling sequential dispatching of 1D work items in ordered sequence from to one or more processing units; and generating an ND Range output by mapping the 1D work output result to an ND position corresponding to an original location of the operand represented by the 1D work item. | 06-30-2011 |
| 20110161970 | METHOD TO REDUCE QUEUE SYNCHRONIZATION OF MULTIPLE WORK ITEMS IN A SYSTEM WITH HIGH MEMORY LATENCY BETWEEN COMPUTE NODES - Disclosed are a method, a system and a computer program product of operating a data processing system that can include or be coupled to multiple processor cores. The multiple processor cores can be coupled to a memory that can include multiple priority queues associated with multiple respective priorities and store multiple work items. Work items stored in the multiple priority queues can be associated with a bit mask which is associated with a respective priority queue and can be routed to respective groups of one or more processors based on the associated bit mask. In one or more embodiments, at least two groups of processor cores can include at least one processor core that is common to both of the at least two groups of processor cores. | 06-30-2011 |
| 20110161975 | REDUCING CROSS QUEUE SYNCHRONIZATION ON SYSTEMS WITH LOW MEMORY LATENCY ACROSS DISTRIBUTED PROCESSING NODES - A method for efficient dispatch/completion of a work element within a multi-node data processing system. The method comprises: selecting specific processing units from among the processing nodes to complete execution of a work element that has multiple individual work items that may be independently executed by different ones of the processing units; generating an allocated processor unit (APU) bit mask that identifies at least one of the processing units that has been selected; placing the work element in a first entry of a global command queue (GCQ); associating the APU mask with the work element in the GCQ; and responsive to receipt at the GCQ of work requests from each of the multiple processing nodes or the processing units, enabling only the selected specific ones of the processing nodes or the processing units to be able to retrieve work from the work element in the GCQ. | 06-30-2011 |
| 20110161976 | METHOD TO REDUCE QUEUE SYNCHRONIZATION OF MULTIPLE WORK ITEMS IN A SYSTEM WITH HIGH MEMORY LATENCY BETWEEN PROCESSING NODES - A method efficiently dispatches/completes a work element within a multi-node, data processing system that has a global command queue (GCQ) and at least one high latency node. The method comprises: at the high latency processor node, work scheduling logic establishing a local command/work queue (LCQ) in which multiple work items for execution by local processing units can be staged prior to execution; a first local processing unit retrieving via a work request a larger chunk size of work than can be completed in a normal work completion/execution cycle by the local processing unit; storing the larger chunk size of work retrieved in a local command/work queue (LCQ); enabling the first local processing unit to locally schedule and complete portions of the work stored within the LCQ; and transmitting a next work request to the GCQ only when all the work within the LCQ has been dispatched by the local processing units. | 06-30-2011 |