$Id: cookies.html,v 1.11 2003/02/17 23:49:10 dean Exp $DO NOT EMAIL ME WITH ANY QUESTIONS ABOUT COOKIES, I AM NOT A RESOURCE REGARDING COOKIES, THIS IS AN OP-ED PIECE, WHICH IS MOSTLY OUT OF DATE BY NOW (it was written in 1996).
This page was written (while working at HotWired) as a generic response to the significant number of users' complaints regarding HotWired tracking cookies that have been forwarded to me. I am generally the contact person for cookie questions at HotWired. I believe the users' complaints are misplaced -- instead of complaining about the sites using tracking cookies they should be complaining to the browser developers for more features to control cookies. I am not going to touch on the privacy issues of tracking, but I will mention that cookies are just the tip of the iceberg. I am not going to examine HotWired's motivation to use tracking cookies, that's not my job at all.
I'm hoping to provide enough information for the technical and non-technical readers... and I'm sure I'm failing for both.
HTTP is a stateless protocol. This means that an HTTP server has no information in a request to tie it to any other request. The data in a response is based only on the information the client sends in the request. It's like doing a math problem in high school -- you are only allowed to use the facts given in the problem plus mathematical logic to derive an answer.
HTTP stands out from all the other protocols you're probably familiar with using. These protocols are all "stateful", information divulged in one request can be used to modify future requests. In fact these protocols have a concept of a "session" wherein a batch of requests are sent and responses received. FTP (file transfer protocol) has many states, including "the current directory". SMTP (simple mail transfer protocol) and POP (post office protocol) both include a concept of "who you are" which is used for all requests. NNTP (network news transfer protocol) allows you to "change usenet groups" to direct where future requests for articles will be retrieved from.
Stateless protocols generally have the advantage that they require fewer resources on the server -- the resources are pushed into the client. But the disadvantage is that the client needs to tell the server enough information on each request to be able to get the proper answer. Cookies are a method for a server to ask the client to store arbitrary data for use in future connections. The server is asking the client to keep state information.
Cookies are not part of the HTTP/1.0 specification. They are an optional extension designed by Netscape. For this reason not all clients support cookies. The standard does not specify any method for a client to tell a server that it supports or doesn't support cookies. A server essentially has to guess if a browser supports cookies. One guess is to use the User-Agent string (this is a piece of text that identifies your browser -- "Mozilla/3.01" would indicate Netscape 3.01). But testing for that indicates whether the browser supports cookies, not whether the user wants their browser to support cookies.
In typical tracking cookie implementations, an attempt is made to send a cookie on every hit that didn't have a cookie in the request. Here are potential server-side modifications to tracking cookies which I don't feel are satisfactory solutions. I'd be interested in hearing more.
The primary case where this fails is when a site wishes to use <IMG> tags that reference pages on other sites. You've probably seen this on some ads -- the ad is served from a different site (riddler.com's advertisements are like this everywhere I've seen them).
One thing I didn't mention above is that cookies are tied to the domain that issued them -- if you get a cookie from .hotwired.com it will not be sent to any other domain. Some advertisers want to be able to track image hits. So the server would have to know which images need tracking cookies and which don't. Still pretty easy to do, but adds complexity to an already complex server configuration (I'm referring to HotWired's servers).
Add a new cookie named "choice" and give it the value "accept this cookie and no further cookies will be sent". If a request contains a tracking or a choice cookie, then don't generate a tracking cookie. Otherwise generate both a tracking and a choice cookie.
Note that users that don't have the "Accept Cookies" dialogue enabled will implicitly accept both the tracking and the choice cookie. Users with the "Accept Cookies" dialogue enabled can optionally refuse the tracking cookie and accept the choice cookie thereby stopping the server from sending further tracking cookies.
This breaks down because the interface is confusing, and undefined. There's no control over which order a browser will present the two cookie dialogues. If the "choice" cookie is presented first, the browser will still show the "tracking" cookie that came with the request. This is confusing to the user because they probably just accepted the "choice" cookie and still a single tracking cookie appears.
I have two feeble arguments against this technique. One is that it's hard to know exactly when a URL is offsite. However, it would take a fair bit of explanation to convince you of that (essentially hostname aliases, CNAMEs, and domain-search rules confuse the issue). The other is a problem that arises due to caching. It is in our best interest to make our pages as cacheable as possible, and so we issue a HTTP/1.1 header to control the caching of the Set-Cookie header. (Um, for the non-techies, this just means we tell a cache not to cache the cookie part of the response). If a user going through a cache goes to a cached page before any other page on the site then they will never be issued a tracking cookie. (All subsequent pages they visit will have an onsite Referer.)
The solution which deals with these problems in the best way is to add the following easy to implement features to the "accept cookie" dialogue:
HotWired is one of thousands of sites using tracking cookies. Apache, the most commonly used server has included a module that implements tracking cookies since at least revision 1.0. Even if the reader convinced HotWired to tweak our tracking cookie system are you going to convince the rest of the sites? Whereas (based on HotWired user-agent statistics) there's a 80% chance that the reader is using one of Netscape Navigator or Microsoft Internet Explorer, so there are only two companies to convince to add functionality to their browser.
I completely understand that Netscape and Microsoft are large companies, and are not likely to respond at all to any single request from a single user. However if all the people who have complained to HotWired would write Netscape and Microsoft they would be more likely to listen. I'm sure the original "accept cookie" dialogue was prompted by users annoyed at the privacy issues of the cookie spec.
The HTTP State Management Mechanism draft proposal also requires browsers and servers to provide more information to the user for controlling cookies. It also deals with many of the other (gross) problems that the current cookie spec has.
i have a more recent document which expands on these methods.
So you're busy avoiding tracking cookies thinking you're protecting your privacy. Allow me to outline a basic technique that you cannot control that will allow your session to be tracked. Your ip address is the key to this. Your ip address identifies you uniquely during your session (modulo firewall considerations -- but for the vast majority of dialup and university users, this statement is true).
The heuristics are clear. Consider any hits from an ip address within 10 minutes of each other to be part of the same session. Use the Referer header to track the progress of that user through your site.
There are even more tricks that can glean tracking information. Suppose you go to a page with frames. Unless you use "view document info" or "view document source" you have no idea of the URLs used to load the frame components. It would only be moderately difficult to insert tracking information into those URLs, something similar to what pathfinder does, but without the URL being obvious. If you've got javascript enabled then the links on the pages can show any message they want in the message window at the bottom -- so even if the link goes to some butt-ugly URL with a tracking id embedded in it the message you see looks all nice and pretty.
Then there are sites like Network Fusion which access all their information from databases and use entirely cryptic URLs. The URLs certainly contain tracking information -- you can't read the site without registering, and have to "log in" in order to read. Unfortunately they haven't done it very well because there's absolutely no way to direct someone to documents on the site with a URL. They use "docids" to refer people to parts of their site in email. But this problem can be solved (similar to how you can give URLs to parts of pathfinder's site by removing the cookie).
Here's a method that uses an extension to HTTP/1.0 (a part of HTTP/1.1) called keepalive. In HTTP/1.0 an HTTP request requires a new TCP/IP connection to be initiated and then torn down after the response. This unfortunately causes a significant amount of bandwidth to be wasted doing the "book keeping" for each TCP/IP session. Keepalive addresses this by defining how to issue multiple requests and receive responses using a single TCP/IP connection. With appropriate care in the server implementation it can be ensured that your client will open essentially only 4 connections (4, or whatever your "simultaneous network connections" setting is) to the server for your entire session. The server knows when you've left because your client closes (one or more of) the connections. So not only can you be tracked through the entire session, but the server knows how long you've been visiting. You won't even know it's happening. (I don't think any site presently does this, but I'd be interested in hearing of any that do.)
In short I'm saying that cookies are only one form of tracking.
With the additions in the draft proposal mentioned above plus
the dialogue additions I'm asking for, cookies are quite manageable.
You know the cookies are there, and you can see the site
trying to track you. Isn't that much better than trying to dumb down
your browser enough that there's no known way to track you?
It's an arms race.