Patent application number | Description | Published |
20080249999 | Interactive cleaning for automatic document clustering and categorization - Documents are clustered or categorized to generate a model associating documents with classes. Outlier measures are computed for the documents indicative of how well each document fits into the model. Outlier documents are identified to a user based on the outlier measures and a user selected outlier criterion. Ambiguity measures are computed for the documents indicative of a number of classes with which each document has similarity under the model. If a document is annotated with a label class, a possible corrective label class is identified if the annotated document has higher similarity with the possible corrective label class under the model than with the annotated label class. The clustering or categorizing is repeated adjusted based on received user input to generate an updated model associating documents with classes. Outlier and. ambiguity measures are also calculated at runtime for new documents classified using the model. | 10-09-2008 |
20100014762 | CATEGORIZER WITH USER-CONTROLLABLE CALIBRATION - A calibrated categorizer comprises: a multi-class categorizer configured to output class probabilities for an input object corresponding to a set of classes; a class probabilities rescaler configured to rescale class probabilities to generate rescaled class probabilities; and a resealing model learner configured to learn calibration parameters for the class probabilities rescaler based on (i) class probabilities output by the multi-class categorizer for a calibration set of class-labeled objects, (ii) confidence measures output by the multi-class categorizer for the calibration set of class-labeled objects, and (iii) class labels of the calibration set of class-labeled objects, the class probabilities rescaler calibrated by the learned calibration parameters defining a calibrated class probabilities rescaler. In a method embodiment, class probabilities are generated for an input object corresponding to a set of classes using a classifier trained on a first set of objects, and are rescaled to form rescaled class probabilities using a resealing algorithm calibrated using a second set of objects different from the first set of objects. The method may further entail thresholding the rescaled class probabilities using thresholds calibrated using the second set of objects or a third set of objects. | 01-21-2010 |
20100070521 | QUERY TRANSLATION THROUGH DICTIONARY ADAPTATION - Cross-lingual information retrieval is disclosed, comprising: translating a received query from a source natural language into a target natural language; performing a first information retrieval operation on a corpus of documents in the target natural language using the translated query to retrieve a set of pseudo-feedback documents in the target natural language; re-translating the received query from the source natural language into the target natural language using a translation model derived from the set of pseudo-feedback documents in the target natural language; and performing a second information retrieval operation on the corpus of documents in the target natural language using the re-translated query to retrieve an updated set of documents in the target natural language | 03-18-2010 |
20100082615 | CROSS-MEDIA SIMILARITY MEASURES THROUGH TRANS-MEDIA PSEUDO-RELEVANCE FEEDBACK AND DOCUMENT RERANKING - A multimedia information retrieval system includes a storage and an electronic processing device. The latter is configured to perform a process including: computing values of a pairwise similarity measure quantifying pairwise similarity of documents of a multimedia reference repository; storing the computed values in the storage; performing an initial information retrieval process respective to the multimedia reference repository to return a set of initial repository documents; and identifying a set of top ranked documents of the multimedia reference repository based at least on the stored computed values pertaining to the set of initial repository documents. | 04-01-2010 |
20100312725 | SYSTEM AND METHOD FOR ASSISTED DOCUMENT REVIEW - A system and method for reviewing documents are provided. A collection of documents is portioned into sets of documents for review by a plurality of reviewers. For each set, documents in the set are displayed on a display device for review by a reviewer and temporarily organized through grouping and sorting. The reviewer's labels for the displayed documents are received. Based on the reviewer's labels, a class from a plurality of classes is assigned to each of the reviewed documents. A classifier model stored in computer memory is progressively trained, based on features extracted from the reviewed documents in the set and their assigned classes. Prior to review of all documents in the set, a calculated subset of documents for which the classifier model assigns a class different from the one assigned based on the reviewer's label is returned for a second review by a reviewer. Models generated from one or more other document sets can be used to assess the review of a first of the sets. | 12-09-2010 |
20100313124 | MANIPULATION OF DISPLAYED OBJECTS BY VIRTUAL MAGNETISM - A computer implemented tactile user interface (TUI) and a method of manipulating objects with a virtual magnet are provided. The TUI includes a display comprising a touch-screen. The display is configured for displaying a set of graphic objects, each graphic object representing a respective one of a set of items, such as documents, e.g., text documents or images. A virtual magnet is caused to move on the display, in response to touching on the touch-screen, e.g., by dragging a finger or other implement across. The magnet is associated with a particular function command such that a subset of the graphic objects exhibits a response to the virtual magnet (e.g., is caused to move, relative to the virtual magnet or exhibits another visible response), each graphic object in the subset moving or otherwise responding as a function of an attribute of the underlying item represented by the graphic object. | 12-09-2010 |
20110072012 | SYSTEM AND METHOD FOR INFORMATION SEEKING IN A MULTIMEDIA COLLECTION - An apparatus and method facilitate combined query based searching with serendipitous browsing in a multimedia collection. A user selects objects to label from a local map, which may include representations of objects retrieved from the collection as being responsive to a text or image base query. The text and image portions of the object can be independently labeled. Unlabeled objects are scored and ranked based on the applied labels of labeled objects, which may take into account cross-media pseudo-relevance and user selectable (or default) parameters, such as a forgetting factor, which tends to place greater weight on more recently labeled objects, and a modality parameter, which laces greater weight on the modality (text, image, or hybrid) currently selected by the user. The local map is modified, based on the ranking, optionally after reranking of objects to improve the diversity of the displayed objects. | 03-24-2011 |
20120203752 | LARGE SCALE UNSUPERVISED HIERARCHICAL DOCUMENT CATEGORIZATION USING ONTOLOGICAL GUIDANCE - A classification method includes constructing queries from category descriptors representing categories of a taxonomy of hierarchically organized categories. The query constructed for a category c includes a query component based on descriptors of the category c and at least one query component based on descriptors of an ancestor or descendant category of the category c. A documents database is queried using the constructed queries to retrieve pseudo-relevant documents. Language models for the categories of the taxonomy are extracted from the pseudo-relevant documents by inferring a hierarchical topic model representing the taxonomy. An input document is classified by optimizing mixture weights of a weighted combination of categories of the hierarchical topic model respective to the input document. | 08-09-2012 |
20130103681 | Relevant persons identification leveraging both textual data and social context - A set of documents is annotated by metadata specifying persons associated with documents and their social roles in the documents. The annotated documents define a group of representation modes including at least one content type and at least one social role. An electronic processing device computes a relevance score for a person of interest using a set of queries each having a target social role by performing a sequence of operations that includes the following operations: computing similarities between documents and queries with respect to at least one similarity mode of the group of representation modes; enriching queries or documents to identify and aggregate nearest neighbor documents that are most similar with respect to at least one enrichment mode of the group of representation modes; aggregating over documents; aggregating over queries; and aggregating over at least one of (i) enrichment modes, (ii) similarity modes, and (iii) target social roles. | 04-25-2013 |
20130262465 | FULL AND SEMI-BATCH CLUSTERING - A method for clustering documents is provided. Each document is represented by a multidimensional data point. The data points are initially assigned to a respective cluster and serve as their initial representative points. Thereafter, in an iterative process, the data points are clustered among the clusters, by assigning the data points to the clusters based on a comparison measure of each data point with the cluster or its representative point, and a threshold of the comparison measure. Based on this clustering, a new representative point for each of the clusters can be computed. Optionally, overlapping clusters are merged. For the next iteration, the new representative points are used as the representative points. An assignment of the documents to the clusters is output, based on a clustering of the data points in the latest iteration. Multiple batches may be processed, retaining the initial clusters to which the original batch was assigned. | 10-03-2013 |
20130311467 | SYSTEM AND METHOD FOR RESOLVING ENTITY COREFERENCE - A method and a system for coreference resolution are provided. The method includes receiving a set of document clusters, each cluster in the set of document clusters including a set of text documents. Instances of each of a set of candidate named entities are identified in the document clusters. For a pairs of the candidate named entities, at least one socio-temporal feature is computed that is based on the similarity of the distributions of identified instances of the respective candidate name entities among the document clusters. A decision for merging for the candidate named entities into a common real named entity is based on the socio-temporal features. | 11-21-2013 |
20140032558 | CATEGORIZATION OF MULTI-PAGE DOCUMENTS BY ANISOTROPIC DIFFUSION - A computer implemented system and method are provided for refining category scores for pages of a sequence of document pages that potentially includes document boundaries. The method uses initial category scores provided by a categorizer that considers one page at a time or concatenated pairs of pages (called bipages). The category scores represent the probability that a page belongs to a particular category. The method uses anisotropic diffusion to refine the initial page category scores using the scores of neighboring pages as a function of the probability that there is a boundary between the pages. The method may be performed iteratively. | 01-30-2014 |
20140074455 | METHOD AND SYSTEM FOR MOTIF EXTRACTION IN ELECTRONIC DOCUMENTS - A method, system, and computer program product for extracting text motifs from the electronic documents is disclosed. A user provides a largest-maximal repeat or a super-maximal repeat as a first text block. The occurrences of the first text block are detected to identify the second text blocks in the vicinity of the occurrences of the first text block on the basis of pre-defined parameters. The text motifs are determined by combining the first text block and the second text block. Finally, the text motifs are extracted from the electronic documents. | 03-13-2014 |
20140280207 | MAILBOX SEARCH ENGINE USING QUERY MULTI-MODAL EXPANSION AND COMMUNITY-BASED SMOOTHING - A retrieval method on a database of documents including text and names of participants associated with the documents includes: receiving a text query facet of keywords and a persons query facet of participant names; computing an enriched text query as an aggregation of the text query facet, a monomodal expansion of the text query facet based on the keywords, a cross-modal expansion of the text query facet based on the participant names, and a topic expansion of the text query facet based on a topic model associating words and topics; computing an enriched persons query as an aggregation of the persons query facet, a monomodal expansion of the persons query facet based on the participant names, a cross-modal expansion of the persons query facet based on the keywords, and a community expansion of the persons query facet based on a community model associating persons and communities. | 09-18-2014 |