| Patent application number | Description | Published |
| 20090063538 | METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE - Techniques are described for normalizing dynamic URLs using a hierarchical organization of a web site. Given web pages associated with a web site, an information extraction method is used to generate data structures that represent the content or structure of each of the web pages. These data structures are appended to the corresponding dynamic URLs. The modified URLs with the data structures are tokenized with the resulting tokens clustered to create a hierarchical organization. Nodes of the hierarchical organization may be merged based upon occurrence or patterns of content and structure. The merged hierarchical organization may then be pruned to remove irrelevant information and to reduce the memory footprint of the hierarchical organization. When a new dynamic URL is received, the new dynamic URL is matched to the hierarchical organization. Important parameters are taken into account and irrelevant information may be removed. Based upon the matching to the hierarchical organization, a normalized URL is returned. | 03-05-2009 |
| 20090157597 | REDUCTION OF ANNOTATIONS TO EXTRACT STRUCTURED WEB DATA - Document, such as web pages of a domain, are annotated to facilitate extracting structured information from the documents. The documents are clustered. Each cluster is such that the documents within that cluster are similar to each other at least with respect to a first threshold, such as according to a shingling metric, where the first threshold is an 8/8 shingling match. There is at least one overlap cluster, each overlap cluster including at least one of the plurality of clusters such that documents of the at least one cluster included in that overlap cluster are similar to each other at least with respect to a second threshold that is lower than the first threshold. A particular overlap cluster is designated, as is a particular cluster of the particular overlap cluster. For the particular designated cluster, an obtained annotation is transferred to other clusters included in the designated particular overlap cluster. | 06-18-2009 |
| 20090157607 | UNSUPERVISED DETECTION OF WEB PAGES CORRESPONDING TO A SIMILARITY CLASS - A method of detecting web pages belonging to at least one similarity class from a plurality of web pages includes determining clusters of the plurality of web pages based on characteristics of the content of the web pages. For each of the determined clusters, at least one metric is determined indicative of similarity among resource locators associated with the web pages of that cluster. A determination of web pages belonging to the at least one similarity class is based on the determined clusters and the determined similarity metrics. | 06-18-2009 |
| 20090171986 | TECHNIQUES FOR CONSTRUCTING SITEMAP OR HIERARCHICAL ORGANIZATION OF WEBPAGES OF A WEBSITE USING DECISION TREES - A decision tree may be determined that is a site map for a domain of web pages. A clustering of a plurality of web pages of a domain is determined, in an unsupervised fashion, based on content-related features of the plurality of web pages. Each determined cluster includes a plurality of web pages, each of the plurality of web pages characterized by a resource locator and each of the resource locators being characterized by at least one resource locator token. The clustering is processed to organize indications of the content-related features of the plurality of web pages into a decision tree characterized by a plurality of nodes, each node characterized by a feature and a value, the feature being at least one of the resource locator tokens and the value being a value of that resource locator token. | 07-02-2009 |
| 20090313127 | SYSTEM AND METHOD FOR USING CONTEXTUAL SECTIONS OF WEB PAGE CONTENT FOR SERVING ADVERTISEMENTS IN ONLINE ADVERTISING - An improved system and method for using contextual sections of web page content for serving advertisements in online advertising is provided. A publisher may use a tool to identify sections of a web page that represent content to be used in contextual advertising. When rendered by a web browser, content from marked sections may be extracted from the web page and sent to an advertisement server for selectively matching advertisements for display to a user. Features may be identified from the content sections and used to select advertisements matching the extracted content of the web page. In particular, the features identified from the content sections may be matched with features designated by advertisers for advertisements. Web page placements may be allocated for advertisements matching the extracted content, and the advertisements may be served for display with the web page. | 12-17-2009 |
| 20090319481 | FRAMEWORK FOR AGGREGATING INFORMATION OF WEB PAGES FROM A WEBSITE - The present invention is directed towards systems and methods for extending media annotations using collective knowledge. The method according to one embodiment of the present invention comprises receiving a plurality of content items and associated annotations. The method further normalizes the plurality of associated annotations and calculates pair frequencies for the plurality of associated annotations. The method then retrieves a plurality of alternative annotations and provides the plurality of alternative annotations. | 12-24-2009 |
| 20100161588 | UNSUPERVISED DETECTION OF WEB PAGES CORRESPONDING TO A SIMILARITY CLASS - A method of detecting web pages belonging to at least one similarity class from a plurality of web pages includes determining clusters of the plurality of web pages based on characteristics of the content of the web pages. For each of the determined clusters, at least one metric is determined indicative of similarity among resource locators associated with the web pages of that cluster. A determination of web pages belonging to the at least one similarity class is based on the determined clusters and the determined similarity metrics. | 06-24-2010 |