April 25, 2007

The Truth about Web Site Statistics

Inherent Problems with Web Stats, and What to do about Them.

Web analytics are growing more sophisticated. We're developing methods to measure media, understand customers, predict trends and assess return on investment (ROI). What no one is telling you is that all these systems and numbers are based on flawed statistics.Web analysis is based on counting a very limited number of things. People visit websites and read pages.

Therefore we can count people, visits and page views. That's all. In counting people, visits and page views, it's important to understand how accurate we can be about them. The bad news is that we can't assess any of these with perfect accuracy. Inaccuracies are unavoidable, caused by the nature of Internet technology itself.

We Can't Precisely Count Visitors

It's not possible to count people on the web. People don't visit web sites. Their computers do. So, web statistics are counting the number of visits from a computer, not from a person. How does Web analytics software determine visits and visitors?Every computer uses an operating system and browser. The combination is the "user agent". Every computer also has an Internet protocol (IP) address, expressed in a format that looks like this: 63.236.64.164. In Web analytics software, the standard method for identifying a unique visitor is to combine the User Agent and IP address. In theory, combining the full User Agent with the IP address produces a unique identity. But this identification methodology is far from accurate. For example, every single person inside the Ford corporation has the same IP address. They all go onto the web from the same gateway in Detroit (even the 88,000 employees in Europe).

Corporations hide internal IP addresses for valid security reasons. Most people in Ford have the same browser and operating system (what Ford calls the Global Client). Thus, according to standard identification procedures, more than 320,000 people are the same unique visitor. This will hold true for any corporation with shared Internet access and a common standard for their workstations. On the reverse side, many Internet service providers will assign a different IP address every time a home user or small business connects to the web. This means the same visitor will look like a different person on every visit — throwing off counts of unique visitors.

"Cookies", an identifying file placed on your computer by Web sites you visit, can help improve the accuracy of visitor identification. However, multiple family members often use the home computer and some people block or remove cookies from their computers. Studies indicate that between three and five percent of all visitors block session cookies and many more delete stored cookie files. The more "techie" the visitor, the more likely they'll avoid being counted. What all of this means is that you're probably only getting about 90 percent accuracy with identification of unique visitors. Not bad, but not perfect, and certainly more valuable than no Web analytics at all.

So, in making business decisions based on Web analytics, you must always take into account a potential 10% error factor in your statistics.

We Can't Precisely Count Duration

Most Web analytics are also inaccurate concerning duration of visit.When someone is visiting your site, they click on a link to retrieve a webpage. Then later they click on another link to retrieve another one. Web analytics software measures the duration between the two clicks as the time spent reading the page. Add all these durations up and you've got the total time of the visit.This creates a problem for one-page visits. Since there is no second page, we can't calculate a page duration. Officially a one-page visit is not a visit; it has to be two pages to count as a visit. Some Web analytics software packages won't count the zero duration one-page visits when they determine average visit duration, but you'd be surprised how many do — producing flawed duration statistics.

In most cases, no duration can be calculated for the last page a visitor reads because there is no second click. As a result, web analytics software is under-reporting the time people spend on your site, because it can't tell how long someone spends on the last page. Or, if someone starts reading a page, then minimizes it for 10 minutes to work on something else before maximizing it to further review the page content, the clock keeps counting throughout the time it is open — thereby overstating actual eyes-on-page duration.

We Can't Precisely Count Visits

A Web visit is usually defined as a series of page requests with a gap of no more than 30 minutes between each one. If someone asks for a page 31 minutes after the preceding page, it is usually counted as a new visit. But page views often exceed 30 minutes, especially on pages with complex products like mortgages, insurance and other financial products.On the other hand, what if someone views your site, goes off and compares it with a competitor, then returns after 20 minutes? That still counts as part of the same visit. Technically it constitutes a single visit of two sessions, but almost no one differentiates sessions and visits.These examples illustrate the inherent inaccuracies when visit counts are based on an arbitrary selection of 30 minutes as the magic number. For most purposes this is fine, so long as you accept it is as a reasonably accurate, workable but flawed number, not a precise measurement of visits.

Log Analysis Issues

Many Web site owners use log analysis to get their stats. Log analysis is much less accurate than page-based tracking. Here's why: Spiders and RobotsSearch engine spiders automatically read your site and so do performance monitoring software packages. Typically, the automated search engine spiders rapidly read every page of the Web site, and thereby dramatically increase the number of page views. Since search engines go through pages at a rate of one per second, their rapid fire "reading" reduces average visit duration and average page read time. Most log analysis software doesn't distinguish between page requests by humans and page requests by automated robots. If you don't account for the activity of spiders within your Web site, you are not getting an accurate picture of usage by human visitors. It is likely that you have fewer human visitors than you think and that the average duration of a visit is longer than your web analytics software is reporting to you.

Most Web site owners believe the average visit duration to a website is three to four minutes and the average page duration is about 30 seconds. In reality it's about twice those lengths.

SWF Files

SWF files are flash files. Without getting into the details, flash files are a problem for log analysis. If you've got flash files as both full pages and as page elements, then it's unlikely you're getting accurate stats from log analysis.Caching and Cache bustingMost browsers store a copy of each webpage you read. If you hit the back button the browser serves you that page instead of bothering to ask the server for another copy. Log analysis misses this because the server never saw the second viewing. Saving pages like this is called "caching."It isn't just browsers that cache. Corporate gateways cache commonly requested pages to save time and bandwidth. Internet Service Providers (ISPs) may cache for the same reasons.It is estimated that uncounted cached pages reduce the reported number of page views and advertising impressions by about 30%.So, if you're using log analysis for your stats, you're missing about one-third of your activity.

Wake Turbulence

Many people exit a site by repeatedly clicking their back button. Log analysis doesn't pick this up (because it doesn't record cached pages), but page-based tracking does. This means many visits end with a series of one or two-second page views in reverse order from the first half. This activity increases the average number of page views per visitor and reduces the average page duration. There's no official term for this, but I call it "wake turbulence." Most analysis tools don't even recognize this problem, let alone deal with it — and there seems to be no practical way at the present time to compensate for it.

User Resistance

Some of your visitors don't trust you. Some major-name Web analytics and tracking systems are listed as spyware and blocked. Some people block cookies. Some people clean out their cookies regularly. If you are tracking repeat visitor behavior with cookies you have to accept some degree of inaccuracy as people block or remove them. Transversal LossesTransversal is what you do when you click a hyperlink — you transverse from one page to the next. Sometimes people click on a link but never arrive at the other end. Browsers crash; people change their mind, and so on. This is becoming a source of contention in pay-per-click (PPC) advertising. The user clicks on the ad, but doesn't get through. Because of this phenomenon, Google often charges for more visits than Web logs show — sometimes by as much as 25 percent or so. Google believes this is a minor and rare problem, but many PPC advertisers are not so sure. This problem is not unique to Google. It occurs to a greater or lesser degree with all forms of inter-site link activity. This means that return-on-investment calculations for PPC advertising and affiliate marketing cannot be perfectly accurate, and need to permit a margin of error.

Conclusions

At the present time, absolute precision is impossible in Web analytics. You have to accept a degree of fuzziness around your stats for visit duration, number of pages read and average page read time. The inaccuracies are an inevitable consequence of the nature of Internet technology, not because analytics software is shoddy. This level of inaccuracy is acceptable for the time being as long as users of analytics software don't make business decisions based on small statistical differences. It is important to understand and accept that visitor stats are accurate plus or minus five or even 10 percent. In general, people are probably spending a little longer on your site, or maybe a little less depending on the content of your pages. To protect against these inaccuracies, it's important to add a margin of error to financial and ROI calculations.In fact, exact numbers shouldn't matter too much. Trends do.

Effective Web analysis, therefore, should focus not on the raw numbers, but on the trends over time. Individual numbers may be inaccurate but trends tell the story.We have to accept that web analytics software is in its infancy. Compared with five years ago, we can do great things with web analytics software today, but we have only just begun. Life's full of uncertainties and web analytics is no different. Somehow we all manage to get by.

Article source: http://www.cyberalert.com/webanalytics.html,Brandt Dainow, Think Metrics