Patent application number | Description | Published |
20080215561 | SCORING RELEVANCE OF A DOCUMENT BASED ON IMAGE TEXT - A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification. | 09-04-2008 |
20080215563 | Pseudo-Anchor Text Extraction for Vertical Search - A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help ranking the objects in a search result to improve search performance. Method may be used in vertical search of objects such as published articles, products and images that lack explicit URL and anchor text information. | 09-04-2008 |
20080250009 | ASSESSING MOBILE READINESS OF A PAGE USING A TRAINED SCORER - A method and system for ranking pages of a search result based on the mobile readiness of the pages is provided. A mobile-readiness system receives an indication of pages that are to be ranked. The mobile-readiness system evaluates the mobile readiness for each of the pages. Mobile readiness indicates suitability of the page for a mobile device. The mobile readiness system then ranks the pages based on the generated mobile readiness and some other criterion such as a relevance score or an importance score. The mobile-readiness system may train a classifier to classify pages based on their mobile readiness. | 10-09-2008 |
20080256068 | METHOD AND SYSTEM FOR CALCULATING IMPORTANCE OF A BLOCK WITHIN A DISPLAY PAGE - A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a web page. A block of a web page represents an area of the web page that appears to relate to a similar topic. The importance system provides the characteristics or features of a block to an importance function that generates an indication of the importance of that block to its web page. The importance system “learns” the importance function by generating a model based on the features of blocks and the user-specified importance of those blocks. To learn the importance function, the importance system asks users to provide an indication of the importance of blocks of web pages in a collection of web pages. | 10-16-2008 |
20080313165 | SCALABLE MODEL-BASED PRODUCT MATCHING - Aspects of the subject matter described herein relate to matching product information to products. In aspects, a product matching component receives product information. The product matching component normalizes the product information and obtains keywords from the product information. By querying a database of recognized products, the keywords are used to obtain a list of products that potentially match the product information. A confidence level is assigned to each of the potential matches in the list. A match may be returned for the highest matched product or for a selectable number of products whose confidence level(s) exceed a selectable threshold. | 12-18-2008 |
20090012956 | Retrieval of Structured Documents - This disclosure relates to performing a query for a search term of a database containing a plurality of structured documents. Those structured documents that do not include the search term are ferreted or filtered out during an initial search. Matched structured documents which are those structured documents that do contain the search term are evaluated by ranking the individual elements based on how well each individual element matches the search term, and indicating to the user the ranking of the individual elements wherein the individual elements can be accessed by the user. | 01-08-2009 |
20090024607 | QUERY SELECTION FOR EFFECTIVELY LEARNING RANKING FUNCTIONS - A learning system for a search ranking function model may include a computer program that iteratively refines the model using new queries and associated documents from an unlabeled training set. The unlabeled training set may include a set of queries for which the associated documents have not been labeled as “relevant” or otherwise labeled. The new queries may be selected based on a similarity to and an accuracy of each neighbor from a labeled training set, such as a labeled validation set. Upon selection, the documents associated with the new queries may be labeled. The new queries and their associated documents may be accumulated into a labeled training set, such as a labeled training set, and a refined model may be learned based on the augmented labeled training set. The model may be iteratively refined until it is determined that the model is adequate. | 01-22-2009 |
20100145956 | PSEUDO-ANCHOR TEXT EXTRACTION - A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help rank the objects in a search result to improve search performance. The method may be used in vertical search of objects such as published articles, products and images that lack explicit URLs and anchor text information. | 06-10-2010 |
20100281009 | HIERARCHICAL CONDITIONAL RANDOM FIELDS FOR WEB EXTRACTION - A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system searches for the labeling of records and elements that has the highest probability of being correct. | 11-04-2010 |
20110078131 | EXPERIMENTAL WEB SEARCH SYSTEM - Described is the running of search-related experiments on a full (or partial) offline snapshot copy of the search engine documents of an actual production system. A snapshot experimentation subsystem runs experimental code related to web searches on the offline data, including to run experimental index building code to build an experimental index (e.g., to test a new document feature), and/or to run experimental search-related code, such as to rank search results according to experimental ranking code, to implement an experimental search strategy, and/or to generate experimental captions. | 03-31-2011 |
20110078132 | FLEXIBLE INDEXING AND RANKING FOR SEARCH - Described is a flexible framework for index building and document retrieval in a search environment that allows different search scenario applications to reuse index building and document retrieval code for non-scenario-specific functionality. Interfaces to various functionality of an index builder and retrieval engine are defined. An application calls the interfaces to specify custom code to perform a search scenario when needed, or use default code when non-scenario-specific functionality may be used. | 03-31-2011 |
20110078162 | WEB-SCALE ENTITY SUMMARIZATION - Described is a summarizing a web entity (e.g., a person, place, product or so forth) based upon the entity's appearance in web documents (e.g., on the order of hundreds of millions or billions of webpages). Webpages are separated into blocks, which are then processed according to various features to filter the number of blocks to further process, and rank the most relevant blocks with respect to the entity that remain. A redundancy removal mechanism removes redundant blocks, leaving a set of remaining blocks that are used to provide a summary of information that is relevant to the entity. | 03-31-2011 |
20110078554 | WEBPAGE ENTITY EXTRACTION THROUGH JOINT UNDERSTANDING OF PAGE STRUCTURES AND SENTENCES - Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity. | 03-31-2011 |
20110087660 | SCORING RELEVANCE OF A DOCUMENT BASED ON IMAGE TEXT - A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification. | 04-14-2011 |
20110137886 | Data-Centric Search Engine Architecture - Described is a data-centric web search engine technology/architecture, in which document metadata, including offline-extracted metadata, is used as part of a search indexing and ranking pipeline. A web data management component receives crawled documents and extracts document metadata from the documents. An indexing component uses the document metadata to build an index for the documents. A serving component uses the index and the document metadata to serve content, e.g., search results. Also described is the use of query metadata extracted from queries of a query log for use in the pipeline. | 06-09-2011 |
20110191381 | Interactive System for Extracting Data from a Website - Described is a technology for efficiently labeling a webpage. A wrapper tool labels records of a webpage at the record level. If an existing wrapper exists that is appropriate for labeling a record, the wrapper tool automatically labels that record. For unlabeled records, the tool provides a user interface to label those records, and updates the set of existing wrappers with a new wrapper that is generated based upon the labeling operation; the new wrapper is then applied to any unlabeled records if appropriate for those records. As a result, a user typically needs only to label a relatively few records, with the wrappers generated for those records automatically used to label the other unlabeled records of the webpage. | 08-04-2011 |
20110209048 | INTERACTIVE SYNCHRONIZATION OF WEB DATA AND SPREADSHEETS - Interactive synchronization of Web data and spreadsheets is usable to build data wrappers based on any type of data found in a document. Such data wrappers can be used to interact with source documents, crawl a network for additional data, map data from across domains, and/or synchronize data from dynamic Web documents. | 08-25-2011 |
20110238644 | Using Anchor Text With Hyperlink Structures for Web Searches - This document describes tools for adjusting anchor text weight to provide more relevant search engine results. Specifically, these tools take advantage of a site-relationship model to consider relationships not only between an anchor text source site and a destination page but also relationships between multiple anchor text source sites to improve web searches. Consideration of these relationships aids in determining a new an anchor text weight, which in turn results in more relevant search results. | 09-29-2011 |
20110251984 | WEB-SCALE ENTITY RELATIONSHIP EXTRACTION - Methods and systems for Web-scale entity relationship extraction are usable to build large-scale entity relationship graphs from any data corpora stored on a computer-readable medium or accessible through a network. Such entity relationship graphs may be used to navigate previously undiscoverable relationships among entities within data corpora. Additionally, the entity relationship extraction may be configured to utilize discriminative models to jointly model correlated data found within the selected corpora. | 10-13-2011 |
20110264658 | WEB OBJECT RETRIEVAL BASED ON A LANGUAGE MODEL - A method and system is provided for determining relevance of an object to a term based on a language model. The relevance system provides records extracted from web pages that relate to the object. To determine the relevance of the object to a term, the relevance system first determines, for each record of the object, a probability of generating that term using a language model of the record of that object. The relevance system then calculates the relevance of the object to the term by combining the probabilities. The relevance system may also weight the probabilities based on the accuracy or reliability of the extracted information for each data source. | 10-27-2011 |
20110283205 | AUTOMATED SOCIAL NETWORKING GRAPH MINING AND VISUALIZATION - The automated social networking graph mining and visualization technique described herein mines social connections and allows creation of a social networking graph from general (not necessarily social-application specific) Web pages. The technique uses the distances between a person's/entity's name and related people's/entities names on one or more Web pages to determine connections between people/entities and the strengths of the connections. In one embodiment, the technique lays out these connections, and then clusters them, in a 2-D layout of a social networking graph that represents the Web connection strengths among the related people's or entities' names, by using a force-directed model. | 11-17-2011 |
20120030206 | Employing Topic Models for Semantic Class Mining - A topic modeling architecture is used to discover high-quality semantic classes from a large collection of raw semantic classes (RASCs) for use in generating responses to queries. A specific semantic class is identified from a collection of RASCs, and a preprocessing operation is conducted to remove one or more items with a semantic class frequency less than a predetermined threshold. A topic model is then applied to the specific semantic class for each of the items that remain in the specific semantic class after the preprocessing operation. A postprocessing operation is then conducted on the items of the specific semantic class to merge and sort the results of the topic model and generate final semantic classes for use by a search engine to respond to a query. | 02-02-2012 |
20120109950 | METHOD AND SYSTEM FOR CALCULATING IMPORTANCE OF A BLOCK WITHIN A DISPLAY PAGE - A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a web page. A block of a web page represents an area of the web page that appears to relate to a similar topic. The importance system provides the characteristics or features of a block to an importance function that generates an indication of the importance of that block to its web page. The importance system “learns” the importance function by generating a model based on the features of blocks and the user-specified importance of those blocks. To learn the importance function, the importance system asks users to provide an indication of the importance of blocks of web pages in a collection of web pages. | 05-03-2012 |
20120303557 | INTERACTIVE FRAMEWORK FOR NAME DISAMBIGUATION - A “Name Disambiguator” provides various techniques for implementing an interactive framework for resolving or disambiguating entity names (associated with objects such as publications) for entity searches where two or more same or similar names may refer to different entities. More specifically, the Name Disambiguator uses a combination of user input and automatic models to address the disambiguation problem. In various embodiments, the Name Disambiguator uses a two part process, including: 1) a global SVM trained from large sets of documents or objects in a simulated interactive mode, and 2) further personalization of local SVM models (associated with individual names or groups of names such as, for example, a group of coauthors) derived from the global SVM model. The result of this process is that large sets of documents or objects are rapidly and accurately condensed or clustered into ordered sets by that are organized by entity names. | 11-29-2012 |
20130086024 | Query Reformulation Using Post-Execution Results Analysis - Systems, methods, devices, and media are described to facilitate the training and employing of a three-class classifier for post-execution search query reformulation. In some embodiments, the classification is trained through a supervised learning process, based on a training set of queries mined from a query log. Query reformulation candidates are determined for each query in the training set, and searches are performed using each reformulation candidate and the un-reformulated training query. The resulting documents lists are analyzed to determine ranking and topic drift features, and to calculate a quality classification. The features and classification for each reformulation candidate are used to train the classifier in an offline mode. In some embodiments, the classifier is employed in an online mode to dynamically perform query reformulation on user-submitted queries. | 04-04-2013 |
20130173605 | Extracting Query Dimensions from Search Results - Techniques are described for automatically mining query dimensions from web pages resulting from execution of a search query. Lists of items such as words, terms, or phrases are extracted from the web pages based on the recognition of free text, metadata tag, or repeated region patterns within the web page text. Extracted item lists are weighted according to document matching and/or inverse document frequency, and item lists are clustered based on shared or similar items within the lists to generate query dimensions. The generated query dimensions, and the items within each query dimension, are ranked according to quality, and high-quality query dimensions are provided for display alongside top search results. | 07-04-2013 |
20130339344 | WEB-SCALE ENTITY RELATIONSHIP EXTRACTION - Techniques for displaying a relationship graph are described herein. In one example, a search term may be used to obtain a plurality of documents from a network, such as the Internet. A plurality of entities, and relationships between at least some of those entities, may be extracted from the documents. In an example user interface, representations of a plurality of entities may be displayed, such as by shapes (e.g., circles) labeled to identify people or organizations. Edges (e.g., lines) may be used to connect different representations of entities and to thereby indicate a relationship between the connected entities. In a particular example, input from movement of a cursor over an edge may result in display of a description of a relationship between the connected entities. In a further particular example, size of each entity may be related to a number of connections each has with others. | 12-19-2013 |
20140207746 | Adaptive Query Suggestion - When a user-submitted query is received, a set of candidate queries is identified. For each of the candidate queries, features are extracted that, for each candidate query, reflect a measure of effectiveness of the candidate query. The candidate queries are rank ordered based on the measure of effectiveness, and one or more of the top-ranked candidate queries are presented as suggested alternatives to the user-submitted query. | 07-24-2014 |