Patent application number | Description | Published |
20120331470 | EMITTING COHERENT OUTPUT FROM MULTIPLE THREADS FOR PRINTF - One embodiment of the present invention sets forth a technique for emitting coherent output from multiple threads for the printf( )function. Additionally, parallel (not divergent) execution of the threads for the printf( )function is maintained when possible to improve run-time performance. Processing of the printf( )function is separated into two tasks, gathering of the per thread data and formatting the gathered data according to the formatting codes for display. The threads emit a coherent stream of contiguous segments, where each segment includes the format string for the printf( )function and the gathered data for a thread. The coherent stream is written by the threads and read by a display processor. The display processor executes a single thread to format the gathered data according to the format string for display. | 12-27-2012 |
20130014118 | SIMULTANEOUS SUBMISSION TO A MULTI-PRODUCER QUEUE BY MULTIPLE THREADS - One embodiment of the present invention sets forth a technique for ensuring that multiple producer threads may simultaneously write entries in a shared queue and one or more consumers may read valid data from the shared queue. Additionally, writing of the shared queue by the multiple producer threads may occur in parallel and the one or more consumer threads may read the shared queue while the producer threads write the shared queue. A “wait-free” mechanism allows any producer thread that writes a shared queue to advance an inner pointer that is used by a consumer thread to read valid data from the shared queue. | 01-10-2013 |
20130046951 | PARALLEL DYNAMIC MEMORY ALLOCATION USING A NESTED HIERARCHICAL HEAP - One embodiment of the present invention sets forth a technique for dynamically allocating memory using a nested hierarchical heap. A lock-free mechanism is used to access to a hierarchical heap data structure for allocating and deallocating memory from the heap. The heap is organized as a series of levels of fixed-size blocks, where all blocks at given level are the same size. At each lower level of the hierarchy, a collection of N blocks in the lower level equals the size of a single block at the level above. When a thread requests an allocation, one or more blocks at only one level are allocated to the thread. When threads are finished using an allocation, each thread deallocates the respective allocated blocks. When all of the blocks for a level have been deallocated, defragmentation is performed at that level. | 02-21-2013 |
20130198419 | LOCK-FREE FIFO - One embodiment of the present invention sets forth a technique that allows multiple producers and/or consumers to access a first-in first-out sub-system (FIFO) using a “lock-free” mechanism. When two or more producers attempt to push data onto the FIFO simultaneously, only one of the producers succeeds. Similarly, when two or more consumers attempt to pop data from the FIFO simultaneously, only one of the consumers succeeds. However, each producer and consumer is provided with an indication of whether their respective access was successful. Unsuccessful accesses may be retried in the following clock cycle, so that simultaneous accesses are serialized. | 08-01-2013 |
20130198479 | PARALLEL DYNAMIC MEMORY ALLOCATION USING A LOCK-FREE POP-ONLY FIFO - One embodiment of the present invention sets forth a technique for dynamically allocating memory using one or more lock-free pop-only FIFOs. One or more lock-free FIFOs are populated with FIFO nodes, where each FIFO node represents a memory allocation of a predetermined size. Each particular lock-free FIFO includes memory allocations of a single size. Different lock-free FIFOs may include memory allocations for different sizes to service allocation requests for different size memory allocations. A lock-free mechanism is used to pop FIFO nodes from the FIFO. The use of the lock-free FIFO allows multiple consumers to simultaneously attempt to pop the head FIFO node without first obtaining a lock to ensure exclusive access of the FIFO. | 08-01-2013 |
20130198480 | Parallel Dynamic Memory Allocation Using A Lock-Free FIFO - One embodiment of the present invention sets forth a technique for dynamically allocating memory using one or more lock-free FIFOs. One or more lock-free FIFOs are populated with FIFO nodes, where each FIFO node represents a memory allocation of a predetermined size. Each particular lock-free FIFO includes memory allocations of a single size. Different lock-free FIFOs may include memory allocations for different sizes to service allocation requests for different size memory allocations. A lock-free mechanism is used to pop FIFO nodes from the FIFO. The use of the lock-free FIFO allows multiple consumers to simultaneously attempt to pop the head FIFO node without first obtaining a lock to ensure exclusive access of the FIFO. | 08-01-2013 |
20130298133 | TECHNIQUE FOR COMPUTATIONAL NESTED PARALLELISM - One embodiment of the present invention sets forth a technique for performing nested kernel execution within a parallel processing subsystem. The technique involves enabling a parent thread to launch a nested child grid on the parallel processing subsystem, and enabling the parent thread to perform a thread synchronization barrier on the child grid for proper execution semantics between the parent thread and the child grid. This technique advantageously enables the parallel processing subsystem to perform a richer set of programming constructs, such as conditionally executed and nested operations and externally defined library functions without the additional complexity of CPU involvement. | 11-07-2013 |
20130305233 | METHOD AND SYSTEM FOR SEPARATE COMPILATION OF DEVICE CODE EMBEDDED IN HOST CODE - Embodiments of the present invention provide a novel solution that supports the separate compilation of host code and device code used within a heterogeneous programming environment. Embodiments of the present invention are operable to link device code embedded within multiple host object files using a separate device linking operation. Embodiments of the present invention may extract device code from their respective host object files and then linked them together to form linked device code. This linked device code may then be embedded back into a host object generated by embodiments of the present invention which may then be passed to a host linker to form a host executable file. As such, device code may be split into multiple files and then linked together to form a final executable file by embodiments of the present invention. | 11-14-2013 |
20140075160 | SYSTEM AND METHOD FOR SYNCHRONIZING THREADS IN A DIVERGENT REGION OF CODE - A system and method are provided for synchronizing threads in a divergent region of code within a multi-threaded parallel processing system. The method includes, prior to any thread entering a divergent region, generating a count that represents a number of threads that will enter the divergent region. The method also includes using the count within the divergent region to synchronize the threads in the divergent region. | 03-13-2014 |