Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Steinmacher-Burow

Burkhard Steinmacher-Burow, Wenau DE

Patent application number	Description	Published
20140122980	COLLECTIVE NETWORK FOR COMPUTER STRUCTURES - A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.	05-01-2014

Burkhard Steinmacher-Burow, Esslingen DE

Patent application number	Description	Published
20090006296	DMA ENGINE FOR REPEATING COMMUNICATION PATTERNS - A parallel computer system is constructed as a network of interconnected compute nodes to operate a global message-passing application for performing communications across the network. Each of the compute nodes includes one or more individual processors with memories which run local instances of the global message-passing application operating at each compute node to carry out local processing operations independent of processing operations carried out at other compute nodes. Each compute node also includes a DMA engine constructed to interact with the application via Injection FIFO Metadata describing multiple Injection FIFOs where each Injection FIFO may containing an arbitrary number of message descriptors in order to process messages with a fixed processing overhead irrespective of the number of message descriptors included in the Injection FIFO.	01-01-2009
20090006546	MULTIPLE NODE REMOTE MESSAGING - A method for passing remote messages in a parallel computer system formed as a network of interconnected compute nodes includes that a first compute node (A) sends a single remote message to a remote second compute node (B) in order to control the remote second compute node (B) to send at least one remote message. The method includes various steps including controlling a DMA engine at first compute node (A) to prepare the single remote message to include a first message descriptor and at least one remote message descriptor for controlling the remote second compute node (B) to send at least one remote message, including putting the first message descriptor into an injection FIFO at the first compute node (A) and sending the single remote message and the at least one remote message descriptor to the second compute node (B).	01-01-2009
20110119521	REPRODUCIBILITY IN A MULTIPROCESSOR SYSTEM - Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed: a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deterministic chip interfaces, zero-impact communication with the system, precise stop of the system and a scan of the system state.	05-19-2011
20110179229	STORE-OPERATE-COHERENCE-ON-VALUE - A system, method and computer program product for performing various store-operate instructions in a parallel computing environment that includes a plurality of processors and at least one cache memory device. A queue in the system receives, from a processor, a store-operate instruction that specifies under which condition a cache coherence operation is to be invoked. A hardware unit in the system runs the received store-operate instruction. The hardware unit evaluates whether a result of the running the received store-operate instruction satisfies the condition. The hardware unit invokes a cache coherence operation on a cache memory address associated with the received store-operate instruction if the result satisfies the condition. Otherwise, the hardware unit does not invoke the cache coherence operation on the cache memory device.	07-21-2011
20110219215	ATOMICITY: A MULTI-PRONGED APPROACH - In a multiprocessor system with speculative execution, atomicity can be approached in several fashions. One approach is to have atomic instructions that achieve multiple functions and are guaranteed to complete. Another approach is to have blocks of code that are grouped to succeed or fail together. A system can incorporate more than one such approach. In implementing more than one approach, the system may prioritize one over another. When conflict detection is done through a directory lookup in cache memory, atomic instructions and atomicity related operations may be implemented in a cache data array access pipeline in that cache memory. This implementation may include feedback to the pipeline for implementing multiple functions within an atomic instruction and also for cascading atomic instructions.	09-08-2011
20110246582	Message Passing with Queues and Channels - In an embodiment, a send thread receives an identifier that identifies a destination node and a pointer to data. The send thread creates a first send request in response to the receipt of the identifier and the data pointer. The send thread selects a selected channel from among a plurality of channels. The selected channel comprises a selected hand-off queue and an identification of a selected message unit. Each of the channels identifies a different message unit. The selected hand-off queue is randomly accessible. If the selected hand-off queue contains an available entry, the send thread adds the first send request to the selected hand-off queue. If the selected hand-off queue does not contain an available entry, the send thread removes a second send request from the selected hand-off queue and sends the second send request to the selected message unit.	10-06-2011
20110265098	Message Passing with Queues and Channels - In an embodiment, a reception thread receives a source node identifier, a type, and a data pointer from an application and, in response, creates a receive request. If the source node identifier specifies a source node, the reception thread adds the receive request to a fast-post queue. If a message received from a network does not match a receive request on a posted queue, a polling thread adds a receive request that represents the message to an unexpected queue. If the fast-post queue contains the receive request, the polling thread removes the receive request from the fast-post queue. If the receive request that was removed from the fast-post queue does not match the receive request on the unexpected queue, the polling thread adds the receive request that was removed from the fast-post queue to the posted queue. The reception thread and the polling thread execute asynchronously from each other.	10-27-2011
20130290473	REMOTE PROCESSING AND MEMORY UTILIZATION - According to one embodiment of the present invention, a system for operating memory includes a first node coupled to a second node by a network, the system configured to perform a method including receiving the remote transaction message from the second node in a processing element in the first node via the network, wherein the remote transaction message bypasses a main processor in the first node as it is transmitted to the processing element. In addition, the method includes accessing, by the processing element, data from a location in a memory in the first node based on the remote transaction message, and performing, by the processing element, computations based on the data and the remote transaction message.	10-31-2013
20140047060	REMOTE PROCESSING AND MEMORY UTILIZATION - According to one embodiment of the present invention, a system for operating memory includes a first node coupled to a second node by a network, the system configured to perform a method including receiving the remote transaction message from the second node in a processing element in the first node via the network, wherein the remote transaction message bypasses a main processor in the first node as it is transmitted to the processing element. In addition, the method includes accessing, by the processing element, data from a location in a memory in the first node based on the remote transaction message, and performing, by the processing element, computations based on the data and the remote transaction message.	02-13-2014
20140143519	STORE OPERATION WITH CONDITIONAL PUSH - According to one embodiment, a method for a store operation with a conditional push of a tag value to a queue is provided. The method includes configuring a queue that is accessible by an application, setting a value at an address in a memory device including a memory and a controller, receiving a request for an operation using the value at the address and performing the operation. The method also includes the controller writing a result of the operation to the address, thus changing the value at the address, the controller determining if the result of the operation meets a condition and the controller pushing a tag value to the queue based on the condition being met, where the tag value in the queue indicates to the application that the condition is met.	05-22-2014
20140156959	CONCURRENT ARRAY-BASED QUEUE - According to one embodiment, a method for implementing an array-based queue in memory of a memory system that includes a controller includes configuring, in the memory, metadata of the array-based queue. The configuring comprises defining, in metadata, an array start location in the memory for the array-based queue, defining, in the metadata, an array size for the array-based queue, defining, in the metadata, a queue top for the array-based queue and defining, in the metadata, a queue bottom for the array-based queue. The method also includes the controller serving a request for an operation on the queue, the request providing the location in the memory of the metadata of the queue.	06-05-2014
20140237045	EMBEDDING GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK - Embodiments of the invention provide a method, system and computer program product for embedding a global barrier and global interrupt network in a parallel computer system organized as a torus network. The computer system includes a multitude of nodes. In one embodiment, the method comprises taking inputs from a set of receivers of the nodes, dividing the inputs from the receivers into a plurality of classes, combining the inputs of each of the classes to obtain a result, and sending said result to a set of senders of the nodes. Embodiments of the invention provide a method, system and computer program product for embedding a collective network in a parallel computer system organized as a torus network. In one embodiment, the method comprises adding to a torus network a central collective logic to route messages among at least a group of nodes in a tree structure.	08-21-2014

Patent applications by Burkhard Steinmacher-Burow, Esslingen DE

Burkhard Steinmacher-Burow, Yorktown Heights, NY US

Patent application number	Description	Published
20110119445	HEAP/STACK GUARD PAGES USING A WAKEUP UNIT - A method and system for providing a memory access check on a processor including the steps of detecting accesses to a memory device including level-1 cache using a wakeup unit. The method includes invalidating level-1 cache ranges corresponding to a guard page, and configuring a plurality of wakeup address compare (WAC) registers to allow access to selected WAC registers. The method selects one of the plurality of WAC registers, and sets up a WAC register related to the guard page. The method configures the wakeup unit to interrupt on access of the selected WAC register. The method detects access of the memory device using the wakeup unit when a guard page is violated. The method generates an interrupt to the core using the wakeup unit, and determines the source of the interrupt. The method detects the activated WAC registers assigned to the violated guard page, and initiates a response.	05-19-2011

Burkhard Steinmacher-Burow US

Patent application number	Description	Published
20110219188	CACHE AS POINT OF COHERENCE IN MULTIPROCESSOR SYSTEM - In a multiprocessor system, a conflict checking mechanism is implemented in the L2 cache memory. Different versions of speculative writes are maintained in different ways of the cache. A record of speculative writes is maintained in the cache directory. Conflict checking occurs as part of directory lookup. Speculative versions that do not conflict are aggregated into an aggregated version in a different way of the cache. Speculative memory access requests do not go to main memory.	09-08-2011

Burkhard Steinmacher-Burow, Boeblingen DE

Patent application number	Description	Published
20110072241	FAST CONCURRENT ARRAY-BASED STACKS, QUEUES AND DEQUES USING FETCH-AND-INCREMENT-BOUNDED AND A TICKET LOCK PER ELEMENT - Implementation primitives for concurrent array-based stacks, queues, double-ended queues (deques) and wrapped deques are provided. In one aspect, each element of the stack, queue, deque or wrapped deque data structure has its own ticket lock, allowing multiple threads to concurrently use multiple elements of the data structure and thus achieving high performance. In another aspect, new synchronization primitives FetchAndIncrementBounded (Counter, Bound) and FetchAndDecrementBounded (Counter, Bound) are implemented. These primitives can be implemented in hardware and thus promise a very fast throughput for queues, stacks and double-ended queues.	03-24-2011
20110119399	DEADLOCK-FREE CLASS ROUTES FOR COLLECTIVE COMMUNICATIONS EMBEDDED IN A MULTI-DIMENSIONAL TORUS NETWORK - A computer implemented method and a system for routing data packets in a multi-dimensional computer network. The method comprises routing a data packet among nodes along one dimension towards a root node, each node having input and output communication links, said root node not having any outgoing uplinks, and determining at each node if the data packet has reached a predefined coordinate for the dimension or an edge of the subrectangle for the dimension, and if the data packet has reached the predefined coordinate for the dimension or the edge of the subrectangle for the dimension, determining if the data packet has reached the root node, and if the data packet has not reached the root node, routing the data packet among nodes along another dimension towards the root node.	05-19-2011
20110119526	LOCAL ROLLBACK FOR FAULT-TOLERANCE IN PARALLEL COMPUTING SYSTEMS - A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.	05-19-2011
20110173399	DISTRIBUTED PARALLEL MESSAGING FOR MULTIPROCESSOR SYSTEMS - A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.	07-14-2011
20110173411	TLB EXCLUSION RANGE - A system and method for accessing memory are provided. The system comprises a lookup buffer for storing one or more page table entries, wherein each of the one or more page table entries comprises at least a virtual page number and a physical page number; a logic circuit for receiving a virtual address from said processor, said logic circuit for matching the virtual address to the virtual page number in one of the page table entries to select the physical page number in the same page table entry, said page table entry having one or more bits set to exclude a memory range from a page.	07-14-2011
20110173413	EMBEDDING GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK - Embodiments of the invention provide a method, system and computer program product for embedding a global barrier and global interrupt network in a parallel computer system organized as a torus network. The computer system includes a multitude of nodes. In one embodiment, the method comprises taking inputs from a set of receivers of the nodes, dividing the inputs from the receivers into a plurality of classes, combining the inputs of each of the classes to obtain a result, and sending said result to a set of senders of the nodes. Embodiments of the invention provide a method, system and computer program product for embedding a collective network in a parallel computer system organized as a torus network. In one embodiment, the method comprises adding to a torus network a central collective logic to route messages among at least a group of nodes in a tree structure.	07-14-2011
20110173420	PROCESSOR RESUME UNIT - A system for enhancing performance of a computer includes a computer system having a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processor. An external unit is external to the processor for monitoring specified computer resources. The external unit is configured to detect a specified condition using the processor. The processor including one or more threads. The thread resumes an active state from a pause state using the external unit when the specified condition is detected by the external unit.	07-14-2011
20110173421	MULTI-INPUT AND BINARY REPRODUCIBLE, HIGH BANDWIDTH FLOATING POINT ADDER IN A COLLECTIVE NETWORK - To add floating point numbers in a parallel computing system, a collective logic device receives the floating point numbers from computing nodes. The collective logic devices converts the floating point numbers to integer numbers. The collective logic device adds the integer numbers and generating a summation of the integer numbers. The collective logic device converts the summation to a floating point number. The collective logic device performs the receiving, the converting the floating point numbers, the adding, the generating and the converting the summation in one pass. One pass indicates that the computing nodes send inputs only once to the collective logic device and receive outputs only once from the collective logic device.	07-14-2011
20110173422	PAUSE PROCESSOR HARDWARE THREAD UNTIL PIN - A system and method for enhancing performance of a computer which includes a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processer. The processor processes instructions from the program. A wait state in the processor waits for receiving specified data. A thread in the processor has a pause state wherein the processor waits for specified data. A pin in the processor initiates a return to an active state from the pause state for the thread. A logic circuit is external to the processor, and the logic circuit is configured to detect a specified condition. The pin initiates a return to the active state of the thread when the specified condition is detected using the logic circuit.	07-14-2011
20110179199	SUPPORT FOR NON-LOCKING PARALLEL RECEPTION OF PACKETS BELONGING TO THE SAME RECEPTION FIFO - A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-blocking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.	07-21-2011
20110219208	MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER - A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.	09-08-2011
20120185672	LOCAL-ONLY SYNCHRONIZING OPERATIONS - Performing a series of successive synchronizing operations by a core on data shared by a plurality of cores may include a first core indicating an upcoming synchronizing operation on shared data. A second memory layer stores the shared data and tracks the first core's ownership of the shared data. The second memory layer is shared via coherency operations among the first core and one or more second cores. The first core may perform one or more synchronization operations on the shared data without requiring interaction from the second memory layer.	07-19-2012
20130024648	TLB EXCLUSION RANGE - A system and method for accessing memory are provided. The system comprises a lookup buffer for storing one or more page table entries, wherein each of the one or more page table entries comprises at least a virtual page number and a physical page number; a logic circuit for receiving a virtual address from said processor, said logic circuit for matching the virtual address to the virtual page number in one of the page table entries to select the physical page number in the same page table entry, said page table entry having one or more bits set to exclude a memory range from a page.	01-24-2013
20150089145	MULTIPLE CORE PROCESSING WITH HIGH THROUGHPUT ATOMIC MEMORY OPERATIONS - A processor comprising multiple processor cores and a bus for exchanging data between the multiple processor cores is disclosed. Each of the multiple processor cores includes: at least one processor register; a cache for storing at least one cache line of memory; a load store unit for executing a memory command to exchange data between the cache and the at least one processor register; an atomic memory operation unit for executing an atomic memory operation on the at least one cache line of memory; and a high throughput register for storing a status indicating a high throughput or a normal status. The load store unit is operable to transfer the atomic memory operation to the atomic memory operation unit of a designated processor core if the atomic memory operation status is the high throughput status using the bus.	03-26-2015

Patent applications by Burkhard Steinmacher-Burow, Boeblingen DE

Burkhard D. Steinmacher-Burow, Mount Kisco, NY US

Patent application number	Description	Published
20080313408	LOW LATENCY MEMORY ACCESS AND SYNCHRONIZATION - A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.	12-18-2008

Patent applications by Burkhard D. Steinmacher-Burow, Mount Kisco, NY US

Burkhard D. Steinmacher-Burow, Esslingen DE

Patent application number	Description	Published
20080307121	Direct Memory Access Transfer Completion Notification - Methods, compute nodes, and computer program products are provided for direct memory access (‘DMA’) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (‘FIFO’) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.	12-11-2008

Burkhard D. Steinmacher-Burow, Wenau DE

Patent application number	Description	Published
20110219280	COLLECTIVE NETWORK FOR COMPUTER STRUCTURES - A system and method for enabling high-speed, low-latency global collective communications among interconnected processing nodes. The global collective network optimally enables collective reduction operations to be performed during parallel algorithm operations executing in a computer structure having a plurality of the interconnected processing nodes. Router devices are included that interconnect the nodes of the network via links to facilitate performance of low-latency global processing operations at nodes of the virtual network and class structures. The global collective network may be configured to provide global barrier and interrupt functionality in asynchronous or synchronized manner. When implemented in a massively-parallel supercomputing structure, the global collective network is physically and logically partitionable according to needs of a processing algorithm.	09-08-2011