Patent application title: Method and Apparatus for Correlating Multiple Cookies as Having Originated from the Same Device Using Device Fingerprinting
Peter H. Horadan (Redmond, WI, US)
Matthew R. Shanahan (Seattle, WA, US)
Mark B. Upson (Mercer Island, WA, US)
IPC8 Class: AG06Q3000FI
Publication date: 2011-11-24
Patent application number: 20110288940
Information that is useful to distinguish between two or more computer
devices (a "device fingerprint") is collected and stored in a database
with corresponding state-management tokens such as HTTP cookies. The
database is searched for a fingerprint, and if the fingerprint is found,
the corresponding stored token is delivered to a computer device for the
device's use in making subsequent requests for resources or services.
1. A method for recognizing repeat visitors to a website among a
plurality of visitors to the website, comprising: if a visitor to the
website fails to present a cookie, issuing a new cookie and collecting
device fingerprint information about the visitor's computer; storing the
new cookie and the device fingerprint information in a database; tracking
activities of each tracked visitor of the plurality of visitors by the
tracked visitor's cookie; and computing a number of unique visitors to
the website by reducing a count of tracked visitors to the website by a
number of cookie-clearing visitors having different cookies but similar
2. The method of claim 1, further comprising: offering advertising impressions on the website to an advertiser at a price computed based on the number of unique visitors to the website.
3. The method of claim 1, further comprising: soliciting a discounted advertising rate based on the number of cookie-clearing visitors.
4. The method of claim 1 wherein collecting device fingerprint information comprises: transmitting an executable program to the visitor's computer; and receiving data collected by the executable program about the visitor's computer.
5. The method of claim 1 wherein reducing the count of tracked visitors comprises: selecting records from the database having different cookies and similar device fingerprints; and reducing the count of tracked visitors by a count of the selected records having distinct device fingerprints.
6. The method of claim 1, further comprising: analyzing the database to estimate a time at which each cookie-clearing visitor lost its cookie; and computing an average lifetime of a cookie based on the estimated times.
7. The method of claim 1, further comprising: identifying unique visitors who experienced at least one cookie-clearing event; and filtering the tracked activities of the tracked visitors to remove tracked activities of unique visitors who were not identified as having experienced at least one cookie-clearing event.
8. A method comprising: transmitting an executable program to a web browser at a client computer, the executable program to cause the web browser to collect information about the client computer; receiving identifying information about the client computer that was collected by the executable program; correlating the identifying information about the client computer with previously-collected identifying information about a plurality of computers; and associating a first browser activity sequence linked with a first persistent activity token with a second browser activity sequence linked with a second, different persistent activity token.
9. The method of claim 8, further comprising: before the transmitting operation, receiving a request from the web browser at the client computer to retrieve a resource, the request lacking a persistent activity token; and after the receiving operation, transmitting a message to cause the web browser to associate the second, different persistent activity token with a subsequent request from the web browser.
11. The method of claim 8 wherein the identifying information comprises at least one of an operating system version, a browser software version, a browser plugin list, or a font list.
12. The method of claim 8 wherein the associating operation produces a plurality of tentative associations, the method further comprising: collecting distinguishing information about a plurality of browser activities associated with the second, different persistent activity token; comparing the distinguishing information with the first browser activity sequence; and selecting one of the plurality of tentative associations based on similarity between the distinguishing information and the first browser activity sequence.
13. A system comprising: a web server to receive requests from clients and deliver requested digital content to the clients; a database to record information about the requests and the clients; and client correlation means to collect distinguishing information from the clients and assign unique identifiers to the clients.
14. The system of claim 13 wherein the client correlation means is to cause a client to collect and transmit information about the client to the web server.
15. The system of claim 13 wherein the client correlation means is to transmit an executable program to the client, said executable program to cause the client to report one of an operating system of the client, a browser software version of the client, a list of browser plugins of the client or a list of display fonts of the client.
16. The system of claim 13, further comprising: an analysis server to report a synthetic history of client activities, wherein the synthetic history of at least one client is constructed by combining a first history of requests associated with a first unique identifier and a second history of requests associated with a second, different unique identifier.
17. A computer-readable medium containing instructions to cause a programmable processor to perform operations comprising: receiving a device fingerprint from a client computer; locating a similar device fingerprint in a database; extracting a persistent token corresponding to the similar device fingerprint in the database; and transmitting a message to cause the client computer to adopt the persistent token for a future sequence of requests for digital resources.
18. The computer-readable medium of claim 17 wherein the device fingerprint comprises information about the client computer.
19. The computer-readable medium of claim 17 wherein the device fingerprint comprises information about an Internet Protocol ("IP") address of the client computer.
20. The computer-readable medium of claim 17 wherein the device fingerprint comprises an approximate geographic location of the client computer.
21. The computer-readable medium of claim 17 containing additional instructions to cause the programmable processor to perform operations comprising: receiving a request from the client computer; preparing a response to the client computer based on the request and on historical data in the database keyed to the persistent token; and transmitting the response to the client computer.
22. The computer-readable medium of claim 17 containing additional instructions to cause the programmable processor to perform operations comprising: comparing a pair of device fingerprints to estimate a likelihood that the device fingerprints were received from the same client computer.
23. The computer-readable medium of claim 17 containing additional instructions to cause the programmable processor to perform operations comprising: searching the database of device fingerprints to locate a device fingerprint that is most similar to the device fingerprint received from the client computer.
CLAIM OF PRIORITY
 This application claims the benefit of U.S. provisional patent application No. 61/347,734, filed 24 May 2010.
 The invention relates to user tracking in online services. More specifically, the invention relates to techniques for improving the accuracy of cookie-based tracking schemes.
 Those who deliver products or services (or, more generally, information) over the Internet have a strong interest--financially and otherwise--in tracking and analyzing visitors, visits, page views, browsing histories and other characteristics of their customers. For example, a publisher may provide a content site and wish to analyze the reach and frequency of advertising delivered to individual visitors. To do this they must have a reliable and long-lasting way to recognize repeat visitors. Providers of digital products or services primarily use HTTP cookies as a tracking mechanism to determine whether the current visitor is the same visitor that was seen before, or is a new visitor. (HTTP cookies are described in detail in Internet Engineering Task Force ("IETF") Request for Comments ("RFC") documents RFC2965, published October 2000.) A publisher's web infrastructure may build up a significant amount of interesting information about a visitor over the course of his many page views. This information is tracked and correlated to that visitor by the means of a cookie issued to the visitor's device. A publisher gets great value from the information it is able to collect about visitors--for example, in estimating user counts, or in selling advertisements to a targeted market, and so on--and thus there is considerable value in being able to build a lasting record of a visitor.
 Unfortunately, cookies are easily and often deleted. When this happens, all of the collected information about a visitor may be lost. After cookie deletion, a new cookie will be issued to that visitor on his next visit, and the process of collecting information starts again. The system no longer has any way to know that the current visitor is the same as the previous visitor, because the original cookie was deleted. Any analysis system relying on cookies may mistakenly believe that there are two different visitors (one from before the cookie deletion, and a new visitor after the cookie deletion)--when in fact these are the same visitor. This causes errors in analysis--for example in this case an analytics system would report two unique users, when in fact there was only one. Significantly increased accuracy of analysis would be achieved if the system were able to "stitch together" those two cookies and understand that they both represent the same visitor.
BRIEF DESCRIPTION OF DRAWINGS
 Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to "an" or "one" embodiment in this disclosure are not necessarily to the same embodiment, and such references mean "at least one."
 FIG. 1 shows a distributed computing environment where an embodiment of the invention may be deployed.
 FIG. 2 shows how cookies may be used in a traditional sequence of Hypertext Transfer Protocol ("HTTP") requests.
 FIG. 3 shows how an embodiment of the invention collects and applies device fingerprint data.
 FIG. 4 shows how an embodiment of the invention operates when a previously-seen client issues a request without an HTTP cookie.
 FIG. 5 shows another distributed computing environment where multiple servers cooperate to preserve information for correlating client identities.
 Embodiments of the present invention are believed to be superior to prior-art cookie-based content-tracking systems for several reasons, including:  Not requiring any specialized software to be deployed to users of the content.  Not relying on a file that can be deleted by the visitor.  Not requiring product or service providers to switch to a new form of tracking (providers can use all of their current cookie-based tools with a slight modification to call the cookie stitching algorithm at certain times)  Not requiring product or service providers to change their data storage schema.  Capable of detecting when a previous cookie was deleted--which is itself interesting analytical information.  Capable of correlating the previous cookie(s) to a newly generated cookie.
 One embodiment of our invention involves the analysis of users of Internet-based web publication using HTTP cookies. FIG. 1 shows some of the entities and interactions involved in such publication: a user 100 operates a web browser 110 executing on a computer 120. The user directs the browser to retrieve some desired information from a web server 130 executing at a remote computer 140. Communication between computers 120 and 140 may occur over a distributed data network 150 such as the Internet. As described below, according to an embodiment of the invention, web server 130 may send an executable program 190 to operate within web browser 110; program 190 may send additional information to web server 130, or otherwise interact with it.
 Web server 130 also interacts with an analysis server 160 which (in the environment depicted here) is executing on another computer 170. A database 180 is provided for storing information used by analysis server 160 to perform its role in the operations detailed below.
 FIG. 2 is a flow chart outlining some important aspects of the communication between browser 110 and web server 130 in an ordinary Hypertext Transfer Protocol ("HTTP") interaction. The interaction is quite a bit more complicated than this flow chart suggests, but the details are well known in the art, and are clearly explained in various IETF RFCs, including RFC2616 and RFC2965.
 At 210, a browser sends a request to a web server to cause the server to provide information. If this is the first request from the browser to the server, the request will not include a cookie. The server receives the request and prepares an appropriate response (220). If the request does not include a cookie (230), then a "Set-Cookie" header will be added to the response (250). The response is sent back to the browser (260) and presented to the user (270). The user may cause the browser to make another request for information (280). This request (and subsequent requests) will include the cookie, so the server will skip step 250 in preparing subsequent responses.
 However, on occasion, the browser's cookie may be cleared or deleted (290), so a subsequent request is again made without a cookie (210), and the server will prepare a response (220) including a new Set-Cookie header (250).
 A web server operating according to known, prior-art HTTP state-maintenance protocols (e.g., cookies) may be unable to distinguish between two series of requests from two different browsers that have never visited the server before, and a single series of requests from one browser, where the single series of requests is interrupted by a cookie-clearing event.
 Next, the browser transmits the cookie and device information to the web server (340). The web server forwards the fingerprint data, and other information about the HTTP request (such as the HTTP headers and source IP address) to an analysis server (350). The analysis server stores the information for future use (360) and may reply to the web server that the browser has not been seen previously (370).
 Subsequent requests from the browser proceed as usual: the browser sends its assigned cookie, and the web server can correlate these requests with previous requests, collecting information of interest to the publisher about the resources the browser references. The web server may continue to transmit fingerprint-data collection code, and changes to the device's fingerprint can be detected and monitored. For example, the user may install a new display device, so the device fingerprint might show a different screen resolution or color depth. All this information can be kept exclusively within the web server, or shared with the analysis server.
 FIG. 4 shows what happens according to an embodiment of the invention if the browser loses its cookie: steps 310 through 350 are identical to those shown described in reference to FIG. 3 (only steps 310 and 350 are shown in FIG. 4; for 320, 330 and 340, refer to the preceding Figure). However, after the fingerprint, HTTP header and other data are sent to the analysis server, the earlier-saved record is located and the "old" cookie is recovered (410). The analysis server responds to the web server that this is an existing client (420) (i.e., that it has been seen previously, and has already had a cookie assigned). The web server may issue a new response to change the client's cookie to the old value (430), or may simply note internally that the two different cookies are associated with only one client (440). In either scenario, the web server and analysis server continue to collect and analyze data, and the analysis can include both the information that the cookie was lost or cleared, and that the two otherwise apparently unrelated request sequences were issued by a single client.
 It is appreciated that the dynamic pattern of interaction between a browser and a web server also yields identifying information, and, to the extent that this information is captured and stored, it may be available to the analysis server to assist in identifying a client without a cookie, or with a newly-assigned cookie, as a previously-seen client whose old cookie was associated with an earlier browsing history. Thus, in some embodiments, the determination that a client with a newly-initialized browsing history is actually the same as an earlier client may be emergent, developing over a series of browsing interactions. Roughly speaking, the bare fingerprint data may suggest that a "new" client is the same as an earlier client, but continued browsing activity may provide an increased level of confidence in the identification. Alternatively, continued browsing may show that the browser tentatively identified as the same as an earlier client based on the fingerprint data, is in fact more likely to be a new client after all.
 Embodiments of the invention may incorporate a number of variants to accomplish the goal of correlating multiple series of web interactions:  The cookie is not an HTTP cookie, but is a Flash Local Shared Object, or an HTML 5 locally stored object, or any other technology for storing data in a browser.  The cookie is not issued by the web server, but is instead issued by an outside analytics system. The cookie that is put back in place is the cookie originally issued by the outside analytics system.  Flexible, selectable and/or plug-in module-based methods for doing device fingerprint comparison.  The client device described in the foregoing as a "web browser" is not a traditional web browser, and the interaction does not happen over the Internet, but instead it is a series of loosely-correlated interactions between a client and a server on any distributed system that attempts to track users.
 Other alternate embodiments include:  A system to correlate cookies in an advertising-delivery network.  A system where cookies can be correlated among different providers (publishers) (see discussion of FIG. 5 below).  A system where cookies can be correlated across data sets.  A system to correlate cookies for a Software As A Service ("SaaS") provider.  system where cookies can be correlated across digital products and services.  A system where the original cookie is provided in real-time rather than issuing a new cookie, followed by a cookie replacement.  A system to create an alternate persistent ID based on a collection of cookies.
 FIG. 5 shows an environment similar to FIG. 1, but here, two data publishers cooperate to stitch cookies together into a unified, reliable client identity. The web browser on client computer 120 displays information in an on-screen window 110. An enlarged sample window is shown at 530. The window contains a main text document 533 and a graphic image 536. The main document was retrieved from a remote server 140, while the graphic was retrieved from a different remote server, 550. Servers 140 and 550 exchange information 560 so that, according to an embodiment of the invention, even if client computer 120 loses or deletes a cookie that server 140 had been using to identify the computer, it may not have lost or deleted a cookie that identified computer 120 to server 550. Thus, information reported over inter-server channel 560 can be used by server 140 to stitch pre-cookie-loss history for client 120 together with post-cookie-loss history. In this case, the main text document 533 from server 140 causes client 120 to report information about itself to server 550 when it requests image 536. Server 550 then reports identical or related information to server 140. In some embodiments, server 550 may be part of an advertisement delivery network, where ads are delivered for inclusion with web-page resources from a variety of publishers. The ad delivery network functions (in part) as a repository of information about clients, and can provide information to the publishers that allows the publishers to identify return visitors who--due to cookie loss or deletion--appear to be unrelated to prior visitors.
 An embodiment of the invention may be a machine-readable medium having stored thereon data and instructions to cause a programmable processor to perform operations as described above. In other embodiments, the operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmed computer components and custom hardware components.
 Instructions for a programmable processor may be stored in a form that is directly executable by the processor ("object" or "executable" form), or the instructions may be stored in a human-readable text form called "source code" that can be automatically processed by a development tool commonly known as a "compiler" to produce executable code. Instructions may also be specified as a difference or "delta" from a predetermined version of a basic source code. The delta (also called a "patch") can be used to prepare instructions to implement an embodiment of the invention, starting with a commonly-available source code package that does not contain an embodiment.
 In some embodiments, the instructions for a programmable processor may be treated as data and used to modulate a carrier signal, which can subsequently be sent to a remote receiver, where the signal is demodulated to recover the instructions, and the instructions are executed to implement the methods of an embodiment at the remote receiver. In the vernacular, such modulation and transmission are known as "serving" the instructions, while receiving and demodulating are often called "downloading." In other words, one embodiment "serves" (i.e., encodes and sends) the instructions of an embodiment to a client, often over a distributed data network like the Internet. The instructions thus transmitted can be saved on a hard disk or other data storage device at the receiver to create another embodiment of the invention, meeting the description of a machine-readable medium storing data and instructions to perform some of the operations discussed above. Compiling (if necessary) and executing such an embodiment at the receiver may result in the receiver performing operations according to a third embodiment.
 In the preceding description, numerous details were set forth. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without some of these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
 Some portions of the detailed descriptions may have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
 It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the preceding discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
 The present invention also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, including without limitation any type of disk including floppy disks, optical disks, compact disc read-only memory ("CD-ROM"), and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), erasable, programmable read-only memories ("EPROMs"), electrically-erasable read-only memories ("EEPROMs"), magnetic or optical cards, or any type of media suitable for storing computer instructions.
 The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will be recited in the claims below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
 The applications of the present invention have been described largely by reference to specific examples and in terms of particular allocations of functionality to certain hardware and/or software components. However, those of skill in the art will recognize that client correlation based on device fingerprints can also be produced by software and hardware that distribute the functions of embodiments of this invention differently than herein described. Such variations and implementations are understood to be captured according to the following claims.
Patent applications by Matthew R. Shanahan, Seattle, WA US
Patent applications by Peter H. Horadan, Redmond, WI US