Patent application number | Description | Published |
20110145701 | METHOD AND APPARATUS FOR DETECTING PAGINATION CONSTRUCTS INCLUDING A HEADER AND A FOOTER IN LEGACY DOCUMENTS - A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs. | 06-16-2011 |
20110276874 | SYSTEM AND METHOD FOR UNSUPERVISED GENERATION OF PAGE TEMPLATES - A computer-implemented method and system for generation of page templates are provided. The method includes providing a document in computer memory. Using a computer processor, page elements within the document are identified and labeled. For each page of the document, a set of geometric relations between pairs of page elements co-occurring on the page is computed, and the set of geometric relations is associated with the page. The method also includes generating a set of page template candidates based at least in part on the computed geometric relations, selecting page templates from the set of page template candidates, and outputting the selected page templates. | 11-10-2011 |
20120039536 | OPTICAL CHARACTER RECOGNITION WITH TWO-PASS ZONING - An image of a paginated document is zoned to identify text zones. First-pass character recognition is performed on the text zones to generate textual content corresponding to the paginated document. The image of the paginated document is re-zoned based on the textual content to identify one or more new text zones. Second-pass character recognition is performed on at least the new text zones to generate updated textual content corresponding to the paginated document. | 02-16-2012 |
20120079370 | SYSTEM AND METHOD FOR PAGE FRAME DETECTION - A system and method for page frame detection for pages of a document are disclosed. The method includes receiving a set of document pages for a document, each page having at least one detected object. For each page in the set, the method includes determining dimensions of bounding box which encompasses the detected objects of the page and determining margin dimensions, based on a position of the bounding box on the page. A page frame is computed as a combination of bounding box dimensions and margin dimensions, based on frequencies of the bounding box dimensions and margin dimensions computed for the set of pages. The computed page frame is matched to pages of the document. Information based on the matching, such as content of text objects within the matched page frame, can be output. | 03-29-2012 |
20120159313 | SYSTEM AND METHOD FOR LOGICAL STRUCTURING OF DOCUMENTS BASED ON TRAILING AND LEADING PAGES - A system, method, and computer program product for determining the structure of a document are provided. The method includes receiving a set of document pages for a document and linking one page frame to each of a plurality of document pages in the set. For each document page linked to a page frame, a content bounding box surrounding the content on the document page is identified, and the document page categorized, based at least in part on the geometrical relationship between the page frame and the content bounding box of the document page. The document page can then be identified as a logical cut based at least in part on the categorization of the document page. Information, such as a table of contents or updated table of contents, can then be output, based on the determined logical unit(s) of the document. | 06-21-2012 |
20120317470 | GENERATE-AND-TEST METHOD FOR COLUMN SEGMENTATION - A system, method, and computer program product for segmenting a document are disclosed. The method considers a zone of a document, such as a page frame or other zone which is a predetermined ratio thereof, and while there are remaining elements in the zone, iteratively tests different segmentations of the zone into n candidate columns, and computes a width of a gutter for each n-candidate. Assuming that the gutter width computed meets a threshold test, which may be based on the arrangement of the elements in the columns, and the candidate columns for the n-candidate each contain at least a threshold number of elements, elements are assigned to respective ones of n segmented columns within which they are located. For example, line elements are arranged in blocks of text within the columns, enabling a reading order for sequences of text, such as complete sentences and paragraphs, to be computed. | 12-13-2012 |
20120324341 | DETECTION AND EXTRACTION OF ELEMENTS CONSTITUTING IMAGES IN UNSTRUCTURED DOCUMENT FILES - A method and a system for detecting and extracting images in an electronic document are disclosed. The method includes receiving an electronic document comprising a plurality of pages and, for each of at least one of the pages of the document, identifying elements of the page. The identified elements include a set of graphical elements and a set of text elements. The method may include identifying and excluding, from the set of graphical elements, those which serve as graphical page constructs and/or text formatting elements. The page can then be segmented, based on (remaining) graphical elements and identified white spaces, to generate a set of image blocks, each including a respective one or more of the graphical elements. Text elements that are associated with a respective image block are identified as captions. Overlapping candidate images, each including an image block and its caption(s), if any, are then grouped to form a new image. The new image can thus include candidate images which would, without the identification of their caption(s), each be treated as a respective image. | 12-20-2012 |
20130114914 | SIGNATURE MARK DETECTION - A system and method for detection of signature marks in documents are provided. The method includes selecting candidate text objects in document pages and identifying a sequence of elements therein. The sequence has a numbering pattern including an incremental part and optionally a fixed part. Missing elements between two detected elements of the sequence are permitted. For an identified sequence, a model of the sequence is generated, which includes the numbering pattern of the sequence, an increment, which is computed based on the distance between pages on which consecutive elements of the sequence are identified, a valid sequence having an increment of greater than 1, and a first page, which corresponds to a page of the document on which the sequence starts. The sequence is then validated with the model, allowing elements of the sequence in the pages of the document to be identified as signature marks. | 05-09-2013 |