Using a Web Crawler to Determine Usage Share of Web Browsers

Author: Josh Kupershmidt, schmiddy *at* gmail *dot* com
Date: October 22, 2007

Abstract

Traditionally, one has been forced to rely on dubious statistics provided by tracking companies in order to make a rough guess as to the current usage share of different web browsers, such as what percent of web users are browsing with Internet Explorer or Mozilla Firefox. Besides reeking of statistical bias, the numbers provided by such tracking companies do not fully explain their methodology, and are difficult or impossible to independently verify. The few large sites, such as w3schools.com, that provide their own browser share statistics directly to curious users are of less dubious quality, but at the same time provide insight only into the types of users that patronize such sites.

In an effort to combat this lack of reliable statistics, I have constructed a web crawling robot specifically designed to crawl and parse 'Webalizer' website statistics pages covering a 6-month period April to September of 2007 and come up with some aggregate statistics. I am happy to provide all the raw data used in this experiment, as well as the source code behind my bot.

Introduction

The usage share of web browsers has traditionally been both a black art and a fiercely debated topic on the web, with serious ramifications behind the answers. Web developers depend upon a general understanding of the major web browsers that are likely to parse their pages in order to write HTML that will be rendered acceptably by the greatest proportion of their visitors. More broadly, the question of what proportion of web visitors are not using Internet Explorer is vital to understanding the extent of Microsoft's lock on the desktop.

Unfortunately, the current methods for determining web browser breakouts is highly suspect. The two main sources of data on this topic have come from:

Method

In an attempt to combat this lack of reliable data, I have constructed a web crawler tailored to parse Webalizer pages, which are usually generated by system adminstrators to analyze traffic to their websites. These pages are often made public, either knowingly or unknowingly, and thus provide a usable base of data from which to proceed.

The crawling proceeded in several phases. First, the bot performs a google search of the query:

intitle:"usage statistics for" "generated by webalizer"

Currently, Google returns around 688,000 results for this query. The bot goes through as many of these Google results pages as possible, gathering up the URLs returned. For each URL, the bot downloads the page, guesses at the top-level webalizer page, which should contain links to each months' statistics for the past year, and puts these monthly URLs into a database table for future processing.

During this crawling, I discovered that Google limits users to 1000 search results per query. When one attempts to access a Google result URL with a 'start' parameter with value greater than 1000 ( example Google query), the user is presented with a message like:

Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 1000.)

This was a setback for this small project, but hopefully future work can bypass this restriction (e.g. using alternate search engines or alternate Google queries). After collecting all the links from the above Google results pages possible, each of these links was used to seed a table with the individual monthly Webalizer statistics pages that would be parsed and analyzed.

Some simple regular expressions were used to parse the 'Total Hits' row of the aggregate 'Monthly Statistics for (Monthname) 2007' table, and to parse the User Agents and Hits from the 'Top (Number) of (Number) Total User Agents' table present on each Webalizer monthly statistics page. The user agents were broken down into:

Where the 'Bots' column contained all the easily-identifiable bots I knew about sixteen in total, such as googlebot, Yahoo, msnbot, et al. The 'Others' column consisted of all other user agents that did not neatly fall into any other category. Typically, these were composed of some more obscure bots that I didn't have regular expressions for (e.g. PhpDig) as well as some uncommon browsers such as Konqueror and Safari.

Results

In total, 245 unique domains were successfully analyzed, for a total of 1,293 monthly results pages from April 2007 to September 2007 (a few domains did not have all months' data from April to September available). Again, this is a much smaller number of domains than I would have liked to analyze, but I gave up attempting to find more domains after I hit the 1000 result limit from Google. Hopefully future work by myself or others can add more data and work around this limit.

Aggregate Data

Here is a table summarizing the aggregate browser share data compiled.

Total Hits IE6 IE7 Old IE Versions Mozilla/Firefox Opera Bots Others
1779614588 578432048 207978593 29562152 233770232 8806263 168774680 93204824

Notice that the sum of all the individual user agents (1.32 Billion) does not quite add up to the 'Total Hits' column (1.78 Billion). This is because the webalizer statistics pages generally only showed the top ten or fifteen user-agents. Hence, the remaing 460 Million hits must be from the less-popular browsers and bots that did not show up in the webalizer summary.

Monthly Data

Here's a breakdown, month-by-month, of the aggregate results above.

Month Total Hits IE6 IE7 Old IE Versions Mozilla/Firefox Opera Bots Others
April 339440961 123431793 40608046 7010872 45884778 1623053 28631154 12181722
May 345265347 121075413 43887195 6808978 49295648 1991012 27835367 13425210
June 283263963 80383509 32316752 4377386 38352435 1264405 31997023 12756534
July 309895505 91543951 33433960 4038553 37192190 1382956 36938817 19779026
August 250896641 80419400 29339623 3351207 33339528 1235500 27149792 18363503
September 250852171 81577982 28393017 3975156 29705653 1309337 16222527 16698829

Here's the same monthly statistics expressed in percentages, with the percentage for each browser computed not on the 'Total Hits' field, but on the sum of the browsers hits that were picked up hence these percentages should add up to 100%

Month IE6 IE7 Old IE Versions Mozilla/Firefox Opera Bots Others
April 47.6% 15.7% 2.7% 17.7% 0.63% 11.0% 4.7%
May 45.8% 16.6% 2.6% 18.7% 0.75% 10.5% 5.1%
June 39.9% 16.0% 2.2% 19.0% 0.63% 15.9% 6.3%
July 40.8% 14.9% 1.8% 16.6% 0.62% 16.5% 8.8%
August 41.6% 15.2% 1.7% 17.3% 0.64% 14.1% 9.5%
September 45.9% 16.0% 2.2% 16.7% 0.74% 9.1% 9.4%

For those that prefer pretty pictures to hard numbers, here's a graph of the above percentages.

Breakdown of Browser Share Percentages by Month

Aggregate Statistics

Leaving out the bots, we can look at the aggregate data over the whole six month period with the following table. In the first row are the total hits for IE6, IE7, Mozilla, Opera, and 'Others', not including the easily-identifiable bots (the 'Others' category is partially bots and partially marginalized browser such as Konqueror and Safari). The second row has the percentages of each browser of the aggregate browser-only share.

IE6+IE7+Firefox+Opera+Others IE6 IE7 Firefox Opera Others
1122191960 578432048 207978593 233770232 8806263 93204824
100% 51.5% 18.5% 20.8% 0.8% 8.3%

Raw Data

I invite any interested parties to look at the raw data I've used in this experiment, and would enjoy hearing about anyone who is able to build on these meager results. A full SQL dump of the three tables I used ('UnearthedPages' holds the URLs with the corresponding month of the results page, 'Completed' has the breakdowns by user agent of these unearthed pages for URLs that were both successfully crawled and parse, and 'Incomplete' holds some URLs that the bot was able to download but not parse) is available here. For convenience, you can download these same tables in CSV format: Completed, Incomplete, and UnearthedPages.

Some Conclusions

The 1000 Google result limit proved to be the limiting factor in my quest to crawl a significant chunk of the tantalizing 688,000 possible results. On the bright side, the deeper-buried Google results pages turn out to be the significantly less-popular pages, with far fewer total hits. Still, it would be nice to see future work expand on this study and include a broader basket of domains.

Pitfalls

I noticed at least one domain (e.g. 'http://meneame.net/webalizer/usage_200709.html') with what looks to be completely bogus entries for 'Total Hits' -- in this case, the Total Hits field is about 6,000 times greater than the sum of the number of hits from its top 10 User Agents. I had to manually remove this entry from the 'Completed' table upon which the aggregate statistics are based. Note that this bogus 'Total Hits' column wouldn't actually have impacted the monthly percentages table, since these calculations are based on the total aggregated number of hits instead of the "Total Hits" directly reported by the individual websites.

Bias

Although I undertook this project to eliminate the bias and uncertainty surrounding the other measures of web browser usage share, the statistical bias in these types of indirect studies are quite difficult to avoid. One topic I've postponed talking about until now is the bias inherent in using "Webalizer" pages as a proxy for the website traffic of an average website. Although there appear to be ways to use Webalizer on Windows/IIS webservers, it remains a tool most well suited for Apache running on various *nix servers.

Although Apache is still the most popular 1 webserver powering the web, one could argue that *nix/Apache powered websites are probably biased towards non-IE, non-Windows users. Indeed, a naive look at the first page 2 of Google results returned by the search query used shows somewhat atypical websites such as "common-lisp.net" among the results. Arguably, such tech-oriented sites are correlated with their use of Webalizer for generating website statistics, and hence the set of domains used in this study constitutes a biased sample.

Having said that, I suspect that increasing the sample size significantly will reduce the effect of these few biased samples to mere statistical noise. As noted earlier, the deeper-buried Google results pages tend to be the smaller, more "normal" pages. Including enough of these normal pages might improve the current data considerably.


Valid XHTML 1.0 Strict