[ BACK ]
This section is about what happens when somebody connects to your web site, and
what statistics you can and can't calculate. There is a lot of confusion about
this. It's not helped by statistics programs which claim to calculate things
which cannot really be calculated, only estimated. The simple fact is that certain
data which we would like to know and which we expect to know are simply not available.
And the estimates used by other programs are not just a bit off, but can be very,
very wrong. For example (you'll see why below), if your home page has 10
graphics on, and an AOL user visits it, most programs will count that as 11 different
visitors!
This section is fairly long, but it's worth reading carefully. If you understand
the basics of how the web works, you will understand what your web statistics
are really telling you.
1. The basic model. Let's suppose I visit your web site. I follow a link
from somewhere else to your front page, read some pages, and then follow one
of
your links out of your site.
So, what do you know about it? First, I make one request for your front page.
You know the date and time of the request and which page I asked for (of course),
and the internet address of my computer (my host). I also usually tell
you which page referred me to your site, and the make and model of my browser.
I do not tell you my username or my email address.
Next, I look at the page (or rather my browser does) to see if it's got any
graphics on it. If so, and if I've got image loading turned on in my browser,
I make a separate connection to retrieve each of these graphics. I never log
into your site: I just make a sequence of requests, one for each new file I
want to download. The referring page for each of these graphics is your front
page. Maybe there are 10 graphics on your front page. Then so far I've made
11 requests to your server.
After that, I go and visit some of your other pages, making a new request
for each page and graphic that I want. Finally, I follow a link out of your
site. You never know about that at all. I just connect to the next site without
telling you.
2. Caches. It's not always quite as simple as that. One major problem
is caching. There are two major types of caching. First, my browser automatically
caches files when I download them. This means that if I visit them again, the
next day say, I don't need to download the whole page again. Depending on the
settings on my browser, I might check with you that the page hasn't changed:
in that case, you do know about it, and analog will count it as a new request
for the page. But I might set my browser not to check with you: then I will read
the
page again without you ever knowing about it.
The other sort of cache is on a larger scale. Almost all ISP's now have their
own cache. This means that if I try to look at one of your pages and anyone
else from the same ISP has looked at that page recently, the cache will
have saved it, and will give it out to me without ever telling you about it.
(This applies whatever my browser settings.) So hundreds of people could read
your pages, even though you'd only sent it out once.
3. What you can know. The only things you can know for certain are the
number of requests made to your server, when they were made, which files were
asked for, and which host asked you for them.
You can also know what people told you their browsers were, and what the referring
pages were. You should be aware, though, that many browsers lie deliberately
about what sort of browser they are, or even let users configure the browser
name. Also, a few browsers send incorrect referrers, telling you the last page
that the user was on even if they weren't referred by that page. And some people
use "anonymizers" which deliberately send false browsers and referrers.
4. What you can't know.
- You can't tell the identity of your readers. Unless you explicitly
require users to provide a password, you don't know who connected or what
their email addresses are.
- You can't tell how many visitors you've had. You can guess by looking
at the number of distinct hosts that have requested things from you. Indeed
this is what many programs mean when they report "visitors". But this is
not always a good estimate for three reasons. First, if users get your pages
from a local cache server, you will never know about it. Secondly, sometimes
many users appear to connect from the same host: either users from the same
company or ISP, or users using the same cache server. Finally, sometimes
one user appears to connect from many different hosts. AOL now allocates
users a different hostname for every
request. So if your home page has 10 graphics on, and an AOL
user visits it, most programs will count that as 11 different visitors!
- You can't tell how many visits you've had. Many programs, under
pressure from advertisers' organisations, define a "visit" (or "session")
as a sequence of requests from the same host until there is a half-hour gap.
This is an unsound method for several reasons. First, it assumes that each
host corresponds to a separate person and vice versa. This is simply not
true in the real world, as discussed in the last paragraph. Secondly, it
assumes that there is never a half-hour gap in a genuine visit. This is also
untrue. I quite often follow a link out of a site, then step back in my browser
and continue with the first site from where I left off. Should it really
matter whether I do this 29 or 31 minutes later? Finally, to make the computation
tractable, such programs also need to assume that your logfile is in chronological
order: it isn't always, and analog will produce the same results however
you jumble the lines up.
- Cookies don't solve these problems. Some sites try to count their
visitors by using cookies. But this can only work if you refuse to let people
read your pages who can't or won't take a cookie. And you still have to assume
that your visitors will use the same cookie for their next request.
- You can't follow a person's path through your site. Even if you
assume that each person corresponds one-to-one to a host, you don't know
their path through your site. It's very common for people to go back to pages
they've downloaded before. You never know about these subsequent visits to
that page, because their browser has cached them. So you can't track their
path through your site accurately.
- You often can't tell where they entered your site, or where they found
out about you from. If they are using a cache server, they will often
be able to retrieve your home page from their cache, but not all of the
subsequent pages they want to read. Then the first page you know about
them requesting will be one in the middle of their true visit.
- You can't tell how they left your site, or where they went next.
They never tell you about their connection to another site, so there's no
way for you to know about it.
- You can't tell how long people spent reading each page. Once again,
you can't tell which pages they are reading between successive requests for
pages. They might be reading some pages they downloaded earlier. They might
have followed a link out of your site, and then come back later. They might
have interrupted their reading for a quick game of Minesweeper. You just
don't know.
- You can't tell how long people spent on your site. Apart from the
problems in the previous point, there is one other complete show-stopper.
Programs which report the time on the site count the time between the first
and the last request. But they don't count the time spent on the final page,
and this is often the majority of the whole visit.
5. Real data. Of course, the important question is how much difference
these theoretical difficulties make. In a recent paper, Peter Pirolli and James Pitkow of Xerox Palo Alto Research Center
examined this question using a ten day long logfile from the xerox.com web
site. One of their most striking conclusions is that different commonly-used
methods can give very different results. For example, when trying to measure
the median length of a visit, they got results from 137 seconds to 629 seconds,
depending exactly what you count as a new visitor or a new visit. As they were
looking at a fixed logfile, they didn't consider the effect of server configuration
changes such as refusing caching, which would
change the results still more.
6. Conclusion. The bottom line is that HTTP is a stateless protocol. That
means that people don't log in and retrieve several documents: they make a separate
connection for each file they want. And a lot of the time they don't even
behave as if they were logged into one site. The world is a lot messier
than this naïve view implies. That's why analog reports requests, i.e. what is
going on at your server, which you know, rather than guessing what the users
are doing.
Defenders of counting visits etc. claim that these are just small approximations.
I disagree. For example, almost everyone is now accessing the web through a
cache. If the proportion of requests retrieved from the cache is 50% (a not
unrealistic figure) then half of the users' requests aren't being seen by the
servers.
Other defenders of these methods claim that they're still useful because they
measure something which you can use to compare sites. But this assumes
that the approximations involved are comparable for different sites, and there's
no reason to suppose that this is true. Pirolli & Pitkow's results show
that the figures you get depend very much on how you count them, as well as
on your server configuration. And even once you've agreed on methodology, different
users on different sites have different patterns of behaviour, which affect
the approximations in different ways: for example, Pirolli & Pitkow found
different characteristics of weekday and weekend users at their site.
I've presented a somewhat negative view here, emphasising what you can't find
out. Web statistics are still informative: it's just important not to slip
from "this page has received 30,000 requests" to "30,000 people have read this
page." In some sense these problems are not really new to the web -- they are
present just as much in print media too. For example, you only know how many
magazines you've sold, not how many people have read them. In print media we
have learnt to live with these issues, using the data which are available,
and it would be better if we did on the web too, rather than making up spurious
numbers.
7. Acknowledgements and further reading. Many other people have made these
points too. While originally writing this section, I benefited from three earlier
expositions: Interpreting WWW
Statistics by Doug Linder; Getting Real about
Usage Statistics by Tim Stehle; and Making Sense of Web Usage
Statistics by Dana Noonan (which doesn't seem to be available on the web
any more.)
Another, extremely well-written document on these ideas is Measuring
Web Site Usage: Log File Analysis by Susan Haigh and Janette Megarity.
Being on a Canadian government site, it's available in both English and French. Or for an
even more negative point of view, you could read Why Web Usage Statistics are (Worse
Than) Meaningless by Jeff Goldberg.
Visit the analog home page for more information
about Analog.
[ BACK ] |