Patent application number | Description | Published |
20090304268 | System and Method for Parallelizing and Accelerating Learning Machine Training and Classification Using a Massively Parallel Accelerator - A method system for training an apparatus to recognize a pattern includes providing the apparatus with a host processor executing steps of a machine learning process; providing the apparatus with an accelerator including at least two processors; inputting training pattern data into the host processor; determining coefficient changes in the machine learning process with the host processor using the training pattern data; transferring the training data to the accelerator; determining kernel dot-products with the at least two processors of the accelerator using the training data; and transferring the dot-products back to the host processor. | 12-10-2009 |
20110029471 | DYNAMICALLY CONFIGURABLE, MULTI-PORTED CO-PROCESSOR FOR CONVOLUTIONAL NEURAL NETWORKS - A coprocessor and method for processing convolutional neural networks includes a configurable input switch coupled to an input. A plurality of convolver elements are enabled in accordance with the input switch. An output switch is configured to receive outputs from the set of convolver elements to provide data to output branches. A controller is configured to provide control signals to the input switch and the output switch such that the set of convolver elements are rendered active and a number of output branches are selected for a given cycle in accordance with the control signals. | 02-03-2011 |
20110119467 | MASSIVELY PARALLEL, SMART MEMORY BASED ACCELERATOR - Systems and methods for massively parallel processing on an accelerator that includes a plurality of processing cores. Each processing core includes multiple processing chains configured to perform parallel computations, each of which includes a plurality of interconnected processing elements. The cores further include multiple of smart memory blocks configured to store and process data, each memory block accepting the output of one of the plurality of processing chains. The cores communicate with at least one off-chip memory bank. | 05-19-2011 |
20110173155 | DATA AWARE SCHEDULING ON HETEROGENEOUS PLATFORMS - Systems and method for data-aware scheduling of applications on a heterogeneous platform having at least one central processing unit (CPU) and at least one accelerator. Such systems and methods include a function call handling module configured to intercept, analyze, and schedule library calls on a processing element. The function call handling module further includes a function call interception module configured to intercept function calls to predefined libraries, a function call analysis module configured to analyze argument size and location, and a function call redirection module configured to schedule library calls and data transfers. The systems and methods also use a memory unification module, configured to keep data coherent between memories associated with the at least one CPU and the at least one accelerator based on the output of the function call redirection module. | 07-14-2011 |
20120079298 | ENERGY EFFICIENT HETEROGENEOUS SYSTEMS - Low-power systems and methods are disclosed for executing an application software on a general purpose processor and a plurality of accelerators with a runtime controller. The runtime controller splits a workload across the processor and the accelerators to minimize energy. The system includes building one or more performance models in an application-agnostic manner; and monitoring system performance in real-time and adjusting the workload splitting to minimize energy while conforming to a target quality of service (QoS). | 03-29-2012 |
20130091507 | OPTIMIZING DATA WAREHOUSING APPLICATIONS FOR GPUS USING DYNAMIC STREAM SCHEDULING AND DISPATCH OF FUSED AND SPLIT KERNELS - Systems and methods for managing a processor and one or more co-processors for a database application whose queries have been processed into an intermediate form (IR) containing kernels of the database application that have been fused and split; dynamically scheduling such kernels on CUDA streams and further dynamically dispatching kernels to GPU devices by estimating execution time in order to achieve high performance. | 04-11-2013 |
20130191612 | INTERFERENCE-DRIVEN RESOURCE MANAGEMENT FOR GPU-BASED HETEROGENEOUS CLUSTERS - Systems and methods are disclosed that share coprocessor resources between two or more applications in a computing cluster using a job selector to receive jobs from a job queue; a node selector coupled to the job selector; an off line profiler with an interference prediction model; a coprocessor dynamic interference detection module; and a coprocessor interference response module. | 07-25-2013 |
Patent application number | Description | Published |
20120124591 | SCHEDULER AND RESOURCE MANAGER FOR COPROCESSOR-BASED HETEROGENEOUS CLUSTERS - A system and method for scheduling client-server applications onto heterogeneous clusters includes storing at least one client request of at least one application in a pending request list on a computer readable storage medium. A priority metric is computed for each application, where the computed priority metric is applied to each client request belonging to that application. The priority metric is determined based on estimated performance of the client request and load on the pending request list. The at least one client request of the at least one application is scheduled based on the priority metric onto one or more heterogeneous resources. | 05-17-2012 |
20120233486 | LOAD BALANCING ON HETEROGENEOUS PROCESSING CLUSTERS IMPLEMENTING PARALLEL EXECUTION - Methods and systems for managing data loads on a cluster of processors that implement an iterative procedure through parallel processing of data for the procedure are disclosed. One method includes monitoring, for at least one iteration of the procedure, completion times of a plurality of different processing phases that are undergone by each of the processors in a given iteration. The method further includes determining whether a load imbalance factor threshold is exceeded in the given iteration based on the completion times for the given iteration. In addition, the data is repartitioned by reassigning the data to the processors based on predicted dependencies between assigned data units of the data and completion times of a plurality of the processers for at least two of the phases. Further, the parallel processing is implemented on the cluster of processors in accordance with the reassignment. | 09-13-2012 |
20140208072 | USER-LEVEL MANAGER TO HANDLE MULTI-PROCESSING ON MANY-CORE COPROCESSOR-BASED SYSTEMS - A method is disclosed to manage a multi-processor system with one or more multiple-core coprocessors by intercepting coprocessor offload infrastructure application program interface (API) calls; scheduling user processes to run on one of the coprocessors; scheduling offloads within user processes to run on one of the coprocessors; and affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-core mapping from a user. | 07-24-2014 |
20140208327 | METHOD FOR SIMULTANEOUS SCHEDULING OF PROCESSES AND OFFLOADING COMPUTATION ON MANY-CORE COPROCESSORS - A method is disclosed to manage a multi-processor system with one or more manycore devices, by managing real-time bag-of-tasks applications for a cluster, wherein each task runs on a single server node, and uses the offload programming model, and wherein each task has a deadline and three specific resource requirements: total processing time, a certain number of manycore devices and peak memory on each device; when a new task arrives, querying each node scheduler to determine which node can best accept the task and each node scheduler responds with an estimated completion time and a confidence level, wherein the node schedulers use an urgency-based heuristic to schedule each task and its offloads; responding to an accept/reject query phase, wherein the cluster scheduler send the task requirements to each node and queries if the node can accept the task with an estimated completion time and confidence level; and scheduling tasks and offloads using a aging and urgency-based heuristic, wherein the aging guarantees fairness, and the urgency prioritizes tasks and offloads so that maximal deadlines are met. | 07-24-2014 |
20140208331 | METHODS OF PROCESSING CORE SELECTION FOR APPLICATIONS ON MANYCORE PROCESSORS - A runtime method is disclosed that dynamically sets up core containers and thread-to-core affinity for processes running on manycore coprocessors. The method is completely transparent to user applications and incurs low runtime overhead. The method is implemented within a user-space middleware that also performs scheduling and resource management for both offload and native applications using the manycore coprocessors. | 07-24-2014 |
20140237477 | SIMULTANEOUS SCHEDULING OF PROCESSES AND OFFLOADING COMPUTATION ON MANY-CORE COPROCESSORS - Methods and systems for scheduling jobs to manycore nodes in a cluster include selecting a job to run according to the job's wait time and the job's expected execution time; sending job requirements to all nodes in a cluster, where each node includes a manycore processor; determining at each node whether said node has sufficient resources to ever satisfy the job requirements and, if no node has sufficient resources, deleting the job; creating a list of nodes that have sufficient free resources at a present time to satisfy the job requirements; and assigning the job to a node, based on a difference between an expected execution time and associated confidence value for each node and a hypothetical fastest execution time and associated hypothetical maximum confidence value. | 08-21-2014 |
20150066988 | SCALABLE PARALLEL SORTING ON MANYCORE-BASED COMPUTING SYSTEMS - Systems and methods for sorting data, including chunking unsorted data such that each chunk is of a size that fits within a last level cache of the system. One or more threads are instantiated in each physical core of the system, chunks assigned physical cores are distributed evenly across the threads on the physical cores. Subchunks in the physical cores are sorted using vector intrinsics, the subchunks being data assigned to the threads in the physical cores, and the subchunks are merged to generate sorted large chunks. A binary tree, which includes leaf nodes that correspond to the sorted large chunks, is built, leaf nodes are assigned to threads, and tree nodes are assigned to a circular buffer, wherein the circular buffer is lock and synchronization free. The large chunks are sorted to generate sorted data as output. | 03-05-2015 |
20150113542 | KNAPSACK-BASED SHARING-AWARE SCHEDULER FOR COPROCESSOR-BASED COMPUTE CLUSTERS - A method is provided for controlling a compute cluster having a plurality of nodes. Each of the plurality of nodes has a respective computing device with a main server and one or more coprocessor-based hardware accelerators. The method includes receiving a plurality of jobs for scheduling. The method further includes scheduling the plurality of jobs across the plurality of nodes responsive to a knapsack-based sharing-aware schedule generated by a knapsack-based sharing-aware scheduler. The knapsack-based sharing-aware schedule is generated to co-locate together on a same computing device certain ones of the plurality of jobs that are mutually compatible based on a set of requirements whose fulfillment is determined using a knapsack-based sharing-aware technique that uses memory as a knapsack capacity and minimizes makespan while adhering to coprocessor memory and thread resource constraints. | 04-23-2015 |
20150212733 | SYSTEMS AND METHODS FOR SWAPPING PINNED MEMORY BUFFERS - Systems and methods for swapping out and in pinned memory regions between main memory and a separate storage location in a system, including establishing an offload buffer in an interposing library; swapping out pinned memory regions by transferring offload buffer data from a coprocessor memory to a host processor memory, unregistering and unmapping a memory region employed by the offload buffer from the interposing library, wherein the interposing library is pre-loaded on the coprocessor, and collects and stores information employed during the swapping out. The pinned memory regions are swapped in by mapping and re-registering the files to the memory region employed by the offload buffer, and transferring data of the offload buffer data from the host memory back to the re-registered memory region. | 07-30-2015 |
Patent application number | Description | Published |
20090164841 | System And Method For Improving The Yield of Integrated Circuits Containing Memory - A system and method for increasing the yield of integrated circuits containing memory partitions the memory into regions and then independently tests each region to determine which, if any, of the memory regions contain one or more memory failures. The test results are stored for later retrieval. Prior to using the memory, software retrieves the test results and uses only the memory sections that contain no memory failures. A consequence of this approach is that integrated circuits containing memory that would have been discarded for containing memory failures now may be used. This approach also does not significantly impact die area. | 06-25-2009 |
20100158031 | METHODS AND APPARATUS FOR TRANSMISSION OF GROUPS OF CELLS VIA A SWITCH FABRIC - In one embodiment, a method can include receiving at an egress schedule module a request to schedule transmission of a group of cells from an ingress queue through a switch fabric of a multi-stage switch. The ingress queue can be associated with an ingress stage of the multi-stage switch. The egress schedule module can be associated with an egress stage of the multi-stage switch. The method can also include determining, in response to the request, that an egress port at the egress stage of the multi-stage switch is available to transmit the group of cells from the multi-stage switch. | 06-24-2010 |
20110149977 | AVOIDING UNFAIR ADVANTAGE IN WEIGHTED ROUND ROBIN (WRR) SCHEDULING - A network device includes multiple queues to store packets to be scheduled, and a weighted round-robin (WRR) scheduler. The WRR scheduler performs a first WRR scheduling iteration including processing of at least one packet from a particular queue of the multiple queues, identifies the particular queue as an empty queue during the performing of the first WRR scheduling iteration, identifies the particular queue as a non-empty queue after the identifying the particular queue as the empty queue, and performs a second WRR scheduling iteration including processing of only one packet of a group of packets from the particular queue of the multiple queues. | 06-23-2011 |
20110216773 | WORK-CONSERVING PACKET SCHEDULING IN NETWORK DEVICES - In general, techniques are described for performing work conserving packet scheduling in network devices. For example, a network device comprising queues that store packets and a control unit may implement these techniques. The control unit stores data defining hierarchically-ordered nodes, which include leaf nodes from which one or more of the queues depend. The control unit executes first and second dequeue operations concurrently to traverse the hierarchically-ordered nodes and schedule processing of packets stored to the queues. During execution, the first dequeue operation masks at least one of the selected ones of the leaf nodes from which one of the queues depends based on scheduling data stored by the control unit. The scheduling data indicates valid child node counts in some instances. The masking occurs to exclude the node from consideration by the second dequeue operation concurrently executing with the first dequeue operation, which may preserve work in certain instances. | 09-08-2011 |
20110252284 | OPTIMIZATION OF PACKET BUFFER MEMORY UTILIZATION - A method performed by an I/O unit connected to another I/O unit in a network device. The method includes receiving a packet; segmenting the packet into a group of data blocks; storing the group of data blocks in a data memory; generating data protection information for a data block of the group of data blocks; creating a control block for the data block; storing, in a control memory, a group of data items for the control block, the group of data items including information associated with a location, of the data block, within the data memory and the data protection information for the data block; performing a data integrity check on the data block, using the data protection information, to determine whether the data block contains a data error; and outputting the data block when the data integrity check indicates that the data block does not contain a data error. | 10-13-2011 |
20130073931 | OPTIMIZATION OF PACKET BUFFER MEMORY UTILIZATION - A method performed by an I/O unit connected to another I/O unit in a network device. The method includes receiving a packet; segmenting the packet into a group of data blocks; storing the group of data blocks in a data memory; generating data protection information for a data block of the group of data blocks; creating a control block for the data block; storing, in a control memory, a group of data items for the control block, the group of data items including information associated with a location, of the data block, within the data memory and the data protection information for the data block; performing a data integrity check on the data block, using the data protection information, to determine whether the data block contains a data error; and outputting the data block when the data integrity check indicates that the data block does not contain a data error. | 03-21-2013 |
20130121343 | METHODS AND APPARATUS FOR TRANSMISSION OF GROUPS OF CELLS VIA A SWITCH FABRIC - In one embodiment, a method can include receiving at an egress schedule module a request to schedule transmission of a group of cells from an ingress queue through a switch fabric of a multi-stage switch. The ingress queue can be associated with an ingress stage of the multi-stage switch. The egress schedule module can be associated with an egress stage of the multi-stage switch. The method can also include determining, in response to the request, that an egress port at the egress stage of the multi-stage switch is available to transmit the group of cells from the multi-stage switch. | 05-16-2013 |