Rui Cai, Beijing CN

Patent application number	Description	Published
20090265363	FORUM WEB PAGE CLUSTERING BASED ON REPETITIVE REGIONS - Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.	10-22-2009
20090277322	Scalable Music Recommendation by Search - An exemplary method includes providing a music collection of a particular scale, determining a distance parameter for locality sensitive hashing based at least in part on the scale of the music collection and constructing an index for the music collection. Another exemplary method includes providing a song, extracting snippets from the song, analyzing time-varying timbre characteristics of the snippets and constructing one or more queries based on the analyzing. Such exemplary methods may be implemented by a portable device configured to maintain an index, to perform searches based on selected songs or portions of songs and to generate playlists from search results. Other exemplary methods, devices, systems, etc., are also disclosed.	11-12-2009
20090281906	Music Recommendation using Emotional Allocation Modeling - An exemplary method includes defining a vocabulary for emotions; extracting descriptions for songs; generating distributions for the songs in an emotion space based at least in part on the vocabulary and the extracted descriptions; extracting salient words from a document; generating a distribution for the document in an emotion space based at least in part on the vocabulary and the extracted salient words; and matching the distribution for the document to one or more of the distributions for the songs. Various other exemplary methods, devices, systems, etc., are also disclosed.	11-12-2009
20090327237	WEB FORUM CRAWLING USING SKELETAL LINKS - A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.	12-31-2009
20100205168	Thread-Based Incremental Web Forum Crawling - The incremental web forum crawling technique described herein is a web forum crawling technique that employs a thread-wise strategy that takes into account thread-level statistics, for example, the number of replies and the frequency of replies, to estimate the activity trend of each thread. To extract such statistical information, the technique employs a simple yet very robust approach to extract the timestamp of each post in a discussion thread. It also employs a regression model to predict the time of the next post for each thread.	08-12-2010
20100211533	EXTRACTING STRUCTURED DATA FROM WEB FORUMS - The web forum data extraction technique is designed for the structured data extraction of data on web forums using both page-level information and site-level knowledge. To do this, the technique finds the kinds of page objects a forum site has, which object a page belongs to, and how different page objects are connected with each other. This information can be obtained by re-constructing the sitemap of the target forum which is based on a Data Object Model of the target forum. The web forum data extraction technique collects three kinds of evidence for data extraction: 1) inner-page features which cover both semantic and layout information on an individual page; 2) inter-vertex features which describe linkage-related observations; and 3) inner-vertex features which characterize interrelationships among pages in one vertex. The technique employs Markov Logic Networks to combine the types of evidence statistically for inference and thereby can extract the desired structures.	08-19-2010
20100211927	WEBSITE DESIGN PATTERN MODELING - Website design pattern modeling technique embodiments are presented that model a website's design patterns. This can be based on the website's layout elements, its URL tokens, or both. When based on both, the design patterns can be modeled separately using first the layout elements and then the URL tokens, or vice versa. Alternately, the modeling can be based on coupled layout and URL token patterns. In operation, the modeling involves first identifying layout elements and/or URL tokens found on at least some of the pages of the website. The website design patterns are then modeled based on the occurrences of the identified layout elements and/or URL tokens in pages of the website. In cases where a coupled modeling scheme is employed, a modeling technique that exploits the correlations between the layout elements and URL tokens is used.	08-19-2010
20100250597	MODELING SEMANTIC AND STRUCTURE OF THREADED DISCUSSIONS - A simultaneous semantic and structure threaded discussion modeling system and method for generating a model of a discussion thread and using the model to mine data from the discussion thread. Embodiments of the system and method generate a model that contains both semantic terms and structure terms. The model simultaneously models both semantics and structure of the discussion thread. A model generator includes a semantic module generates two semantic terms for the model and a structure module generates two structure terms for the model. The generator combines the two semantic terms and the two structure terms to generate the simultaneous semantic and structure model. Embodiments of the system and method include an applications module, which contains three application that use the model to reconstruct reply relations among posts in the discussion thread, identify junk posts in the discussion thread, and find experts in each sub-board of web forums.	09-30-2010
20110078159	Long-Query Retrieval - Described herein is a technology that facilitates efficient large-scale similarity-based retrieval. In several embodiments documents, images, and/or other multimedia files are compactly represented and efficiently indexed to enable robust search using a long-query in a large-scale corpus. As described herein, these techniques include performing decomposition of a file, e.g., a document or document-like representation. The techniques use dimension reduction to obtain three parts, topic-related words (major semantics), document specific words (minor semantics), and background words, representing the major semantics in a feature vector and the minor semantics as keywords. Using the techniques described, file vectors are matched in a topic model and the results ranked based on the keywords.	03-31-2011
20110289182	AUTOMATIC ONLINE VIDEO DISCOVERY AND INDEXING - A classifier may be integrated into a pipeline of a general web crawler. The classifier may classify crawled webpages as either video pages or non-video pages. Video pages and information regarding domain importance may be aggregated. Ones of the domains of the video pages may be selected based on domain importance rankings. Webpages of the selected domains may be randomly sampled. The sampled webpages may be structurally analyzed and hint information may be generated with respect to each of the selected domains. The hint information may guide a deep crawling operation for discovering all video pages within the selected domains. Video links within the video pages may be found, one or more videos may be downloaded, and one or more representations of the one or more videos may be indexed.	11-24-2011
20110302124	Mining Topic-Related Aspects From User Generated Content - Described herein is a technology that facilitates efficient automated mining of topic-related aspects of user generated content based on automated analysis of the user generated content. Locations are automatically learned based on dividing documents into document segments, and decomposing the segments into local topics and global topics. Techniques described herein include, for example, computer annotating travelogues with learned tags, performing topic learning to obtain an interest model, and performing location matching based on the interest model.	12-08-2011
20110302162	Snippet Extraction and Ranking - Described herein is a technology that facilitates efficient automated mining of topic-related aspects of user-generated content based on automated analysis of the user-generated content. Locations are automatically learned based on dividing documents into document segments, and decomposing the segments into local topics and global topics. Techniques are described that facilitate automatically extracting snippets. These techniques include, for example, computer annotating travelogues with learned tags and images, performing topic learning to obtain an interest model, performing location matching based on the interest model, calculating geographic and semantic relevance scores, ranking snippets based on the geographic and semantic relevance scores, and searching snippets with a “location+context term” query.	12-08-2011
20110307436	PATTERN TREE-BASED RULE LEARNING - A pattern tree is constructed based on a plurality of key-value pairs representing portions of a data set. In some implementations, the pattern tree may be used for learning one or more rules for interacting with a source of the data set.	12-15-2011
20120117052	WEB FORUM CRAWLING USING SKELETAL LINKS - A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.	05-10-2012
20120125178	SCALABLE MUSIC RECOMMENDATION BY SEARCH - An exemplary method includes providing a music collection of a particular scale, determining a distance parameter for locality sensitive hashing based at least in part on the scale of the music collection and constructing an index for the music collection. Another exemplary method includes providing a song, extracting snippets from the song, analyzing time-varying timbre characteristics of the snippets and constructing one or more queries based on the analyzing. Such exemplary methods may be implemented by a portable device configured to maintain an index, to perform searches based on selected songs or portions of songs and to generate playlists from search results. Other exemplary methods, devices, systems, etc., are also disclosed.	05-24-2012
20120290577	IDENTIFYING VISUAL CONTEXTUAL SYNONYMS - Tools and techniques for identifying visual contextual synonyms are described herein. The described operations use visual words having similar contextual distributions as contextual synonyms to identify and describe visual objects that share semantic meaning. The contextual distribution of a visual word is described using the statistics of co-occurrence and spatial information averaged over image patches that share the visual word. In various implementations, the techniques are employed to construct a visual contextual synonym dictionary for a large visual vocabulary. In various implementations, the visual contextual synonym dictionary narrows the semantic gap for large-scale visual search.	11-15-2012
20120301014	LEARNING TO RANK LOCAL INTEREST POINTS - Tools and techniques for learning to rank local interest points from images using a data-driven scale-invariant feature transform (SIFT) approach termed “Rank-SIFT” are described herein. Rank-SIFT provides a flexible framework to select stable local interest points using supervised learning. A Rank-SIFT application detects interest points, learns differential features, and implements ranking model training in the Gaussian scale space (GSS). In various implementations a stability score is calculated for ranking the local interest points by extracting features from the GSS and characterizing the local interest points based on the features being extracted from the GSS across images containing the same visual objects.	11-29-2012
20120303606	Resource Download Policies Based On User Browsing Statistics - Web crawling polices are generated based on user web browsing statistics. User browsing statistics are aggregated at the granularity of resource identifier patterns (such as URL patterns) that denote groups of resources within a particular domain or website that share syntax at a certain level of granularity. The web crawl policies rank the resource identifier patterns according to their associated aggregated user browsing statistics. A crawl ordering defined by the web crawl polices is used to download and discover new resources within a domain or website.	11-29-2012
20120330922	ANCHOR IMAGE IDENTIFICATION FOR VERTICAL VIDEO SEARCH - Anchor images and information associated therewith are accumulated during a Web crawling operation. One or more rules are applied to the accumulated candidate anchor images to filter out candidate anchor images that are not appropriate for use as the anchor image for a particular target video. The remaining candidate anchor image is then selected as the anchor image for the particular video.	12-27-2012
20120330952	SCALABLE METADATA EXTRACTION FOR VIDEO SEARCH - Video entity templates defining common features that relate to various metadata types shared among a group of video Web pages are generated for target Web sites. Metadata associated with videos contained within Web pages belonging to a particular target Web site can then be automatically and accurately extracted using a video entity template generated for the particular target Web site. This metadata can then be indexed for use by video search applications in providing video search results.	12-27-2012
20130073514	FLEXIBLE AND SCALABLE STRUCTURED WEB DATA EXTRACTION - This document describes techniques that label text nodes of a seed site for each of a plurality of verticals. Once a seed site is labeled for a given vertical, the techniques extract features from the labeled text nodes of the seed site. The techniques learn vertical knowledge for the seed site based on the human labels and the extracted features, and adapt the learned vertical knowledge to a new web site to automatically and accurately identify attributes and extract attribute values targeted within a given vertical for structured web data extraction.	03-21-2013
20130346416	Long-Query Retrieval - Described herein is a technology that facilitates efficient large-scale similarity-based retrieval. In several embodiments documents, images, and/or other multimedia files are compactly represented and efficiently indexed to enable robust search using a long-query in a large-scale corpus. As described herein, these techniques include performing decomposition of a file, e.g., an image, a document containing an image, or a document-like representation of an image. The techniques use dimension reduction to obtain three parts, low-dimensional representations (major semantics), file specific terms (minor semantics), and background words, representing the major semantics in a feature vector and the minor semantics as keywords. Using the techniques described, file vectors are matched in a topic model and the results ranked based on the keywords.	12-26-2013
20140029856	THREE-DIMENSIONAL VISUAL PHRASES FOR OBJECT RECOGNITION - The techniques discussed herein discover three-dimensional (3-D) visual phrases for an object based on a 3-D model of the object. The techniques then describe the 3-D visual phrases. Once described, the techniques use the 3-D visual phrases to detect the object in an image (e.g., object recognition).	01-30-2014
20140122458	Anchor Image Identification for Vertical Video Search - Anchor images and information associated therewith are accumulated during a Web crawling operation. One or more rules are applied to the accumulated candidate anchor images to filter out candidate anchor images that are not appropriate for use as the anchor image for a particular target video. The remaining candidate anchor image is then selected as the anchor image for the particular video.	05-01-2014
20140368620	USER INTERFACE FOR THREE-DIMENSIONAL MODELING - A method of acquiring a set of images useable to 3D model a physical object includes imaging the physical object with a camera, and displaying with the camera a current view of the physical object as imaged by the camera from a current perspective. The method further includes displaying with the camera a visual cue overlaying the current view and indicating perspectives from which the physical object is to be imaged to acquire the set of images.	12-18-2014

Patent applications by Rui Cai, Beijing CN

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Rui Cai, Beijing CN

Rui Cai, Beijing CN