Patent application number | Description | Published |
20080235201 | Consistent weighted sampling of multisets and distributions - Techniques are provided that identify near-duplicate items in large collections of items. A list of (value, frequency) pairs is received, and a sample (value, instance) is returned. The value is chosen from the values of the first list, and the instance is a value less than frequency, in such a way that the probability of selecting the same sample from two lists is equal to the similarity of the two lists. | 09-25-2008 |
20090132571 | EFFICIENT USE OF RANDOMNESS IN MIN-HASHING - Documents that are near-duplicates may be determined using techniques such as min-hashing. Randomness that is used in these techniques may be based on sequences of bits. The sequences of bits may be generated from a string of bits, with the sequences determined by parsing the string at each occurrence of a particular value, such as the value “1”. | 05-21-2009 |
20100070511 | REDUCING USE OF RANDOMNESS IN CONSISTENT UNIFORM HASHING - Documents that are near-duplicates may be determined using techniques involving consistent uniform hashing. A biased bit may be placed in the leading position of a sequence of bits that may be generated and subsequently used in comparison techniques to determine near-duplicate documents. Unbiased bits may be used in subsequent positions of the sequence of bits, after the biased bit, for use in comparison techniques. Samples may be used collectively, as opposed to individually, in the generation of biased bits. Sequences of bits may thus be produced not on a single sample basis, but for multiple samples, thereby amortizing the cost of generating randomness for the samples. Less than one bit of randomness per sample may be used. | 03-18-2010 |
20110064221 | DIFFERENTIAL PRIVACY PRESERVING RECOMMENDATION - User rating data may be received at a correlation engine through a network. The user rating data may include ratings generated by a plurality of users for a plurality of items. Correlation data may be generated from the received user rating data by the correlation engine. The correlation data may identify correlations between the items based on the user generated ratings. Noise may be generated by the correlation engine, and the generated noise may be added to the generated correlation data by the correlation engine to provide differential privacy protection to the user rating data. | 03-17-2011 |
20110208763 | DIFFERENTIALLY PRIVATE DATA RELEASE - A query log includes a list of queries and a count for each query representing the number of times that the query was received by a search engine. In order to provide differential privacy protection to the queries, noise is generated and added to each count, and queries that have counts that fall below a threshold are removed from the query log. A distribution associated with a function used to generate the noise is referenced to determine a distribution of a number of times that a hypothetical query having a zero count would have its count exceed the threshold after the addition of noise. Random queries of an amount equal to a sample from the distribution of number of times are added to the query log with a count that is greater than the threshold count. | 08-25-2011 |
20110238611 | PROBABILISTIC INFERENCE IN DIFFERENTIALLY PRIVATE SYSTEMS - Given that a differentially private mechanism has a known conditional distribution, probabilistic inference techniques may be used along with the known conditional distribution, and generated results from previously computed queries on private data, to generate a posterior distribution for the differentially private mechanism used by the system. The generated posterior distribution may be used to describe the probability of every possible result being the correct result. The probability may then be used to qualify conclusions or calculations that may depend on the returned result. | 09-29-2011 |
20130304744 | DIFFERENTIAL DATAFLOW - The techniques discussed herein efficiently perform data-parallel computations on collections of data by implementing a differential dataflow model that performs computations on differences in the collections of data. The techniques discussed herein describe defined operators for use in a data-parallel program that performs the computations on the determined differences between the collections of data by creating a lattice and indexing the differences in the collection of data according to the lattice. | 11-14-2013 |
20140172939 | Reachability-Based Coordination for Cyclic Dataflow - Various embodiments provide techniques for working with large-scale collections of data pertaining to real world systems, such as a social network, a roadmap/GPS system, etc. The techniques perform incremental, iterative, and interactive parallel computation using a coordination clock protocol, which applies to scheduling computations and managing resources such as memory and network resources, etc., in cyclic graphs including those resulting from a differential dataflow model that performs computations on differences in the collections of data. | 06-19-2014 |