Patent application number | Description | Published |
20090125529 | EXTRACTING INFORMATION BASED ON DOCUMENT STRUCTURE AND CHARACTERISTICS OF ATTRIBUTES - Techniques are disclosed herein for extracting attributes from documents such as web pages. A structure of a training document is compared with a structure of a template to determine a template-node that structurally corresponds to a training-document node that has been annotated with an attribute. Filters can be learned by analyzing characteristics that the attribute possesses in the training document. To extract information for the attribute from a new document, first a set of candidate nodes in a new document are determined by determining which nodes in the new document structurally map to the template node. The filters are applied to eliminate false positives from the candidate nodes. Information can then be extracted from the new document, based on remaining candidate nodes. Even if incremental changes are made to the structure of new documents, nodes that posses the attributes can still be reliably identified. | 05-14-2009 |
20090276506 | GENERATING DOCUMENT TEMPLATES THAT ARE ROBUST TO STRUCTURAL VARIATIONS - A template or wrapper tree for a document such as a web page is generalized from the bottom up (from leaf toward root of a logical tree structure of the template). At a given level in the tree, sub-trees are clustered and the clustered sub-trees are generalized, and the process is repeated at a next higher level in the tree, resulting in a generalized template or wrapper tree. This can be done by generating a nested pattern regular expression based on the sub-tree clusters, merging sub-trees based on the nested pattern regular expression, and then replacing sub-trees in a tree-based regular expression of the template or wrapper at the given level with the merged subtrees. This process is repeated at a next higher level of the tree (progressing from leaf towards root) until the wrapper or tree-based regular expression that represents the template is fully generalized. | 11-05-2009 |
20100174715 | GENERATING DOCUMENT TEMPLATES THAT ARE ROBUST TO STRUCTURAL VARIATIONS - A template or wrapper tree for a document such as a web page is generalized from the bottom up (from leaf toward root of a logical tree structure of the template). At a given level in the tree, sub-trees are clustered and the clustered sub-trees are generalized, and the process is repeated at a next higher level in the tree, resulting in a generalized template or wrapper tree. This can be done by generating a nested pattern regular expression based on the sub-tree clusters, merging sub-trees based on the nested pattern regular expression, and then replacing sub-trees in a tree-based regular expression of the template or wrapper at the given level with the merged sub-trees. This process is repeated at a next higher level of the tree (progressing from leaf towards root) until the wrapper or tree-based regular expression that represents the template is fully generalized. | 07-08-2010 |