Traditionally, one has been forced to rely on dubious statistics provided by tracking companies in order to make a rough guess as to the current usage share of different web browsers, such as what percent of web users are browsing with Internet Explorer or Mozilla Firefox. Besides reeking of statistical bias, the numbers provided by such tracking companies do not fully explain their methodology, and are difficult or impossible to independently verify. The few large sites, such as w3schools.com, that provide their own browser share statistics directly to curious users are of less dubious quality, but at the same time provide insight only into the types of users that patronize such sites.
In an effort to combat this lack of reliable statistics, I have constructed a web crawling robot specifically designed to crawl and parse 'Webalizer' website statistics pages covering a 6-month period — April to September of 2007 — and come up with some aggregate statistics. I am happy to provide all the raw data used in this experiment, as well as the source code behind my bot.
The usage share of web browsers has traditionally been both a black art and a fiercely debated topic on the web, with serious ramifications behind the answers. Web developers depend upon a general understanding of the major web browsers that are likely to parse their pages in order to write HTML that will be rendered acceptably by the greatest proportion of their visitors. More broadly, the question of what proportion of web visitors are not using Internet Explorer is vital to understanding the extent of Microsoft's lock on the desktop.
Unfortunately, the current methods for determining web browser breakouts is highly suspect. The two main sources of data on this topic have come from:
In an attempt to combat this lack of reliable data, I have constructed a web crawler tailored to parse Webalizer pages, which are usually generated by system adminstrators to analyze traffic to their websites. These pages are often made public, either knowingly or unknowingly, and thus provide a usable base of data from which to proceed.
The crawling proceeded in several phases. First, the bot performs a google search of the query:
intitle:"usage statistics for" "generated by webalizer"
Currently, Google returns around 688,000 results for this query. The bot goes through as many of these Google results pages as possible, gathering up the URLs returned. For each URL, the bot downloads the page, guesses at the top-level webalizer page, which should contain links to each months' statistics for the past year, and puts these monthly URLs into a database table for future processing.
During this crawling, I discovered that Google limits users to 1000 search results per query. When one attempts to access a Google result URL with a 'start' parameter with value greater than 1000 ( example Google query), the user is presented with a message like:
Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 1000.)
This was a setback for this small project, but hopefully future work can bypass this restriction (e.g. using alternate search engines or alternate Google queries). After collecting all the links from the above Google results pages possible, each of these links was used to seed a table with the individual monthly Webalizer statistics pages that would be parsed and analyzed.
Some simple regular expressions were used to parse the 'Total Hits' row of the aggregate 'Monthly Statistics for (Monthname) 2007' table, and to parse the User Agents and Hits from the 'Top (Number) of (Number) Total User Agents' table present on each Webalizer monthly statistics page. The user agents were broken down into:
Where the 'Bots' column contained all the easily-identifiable bots I knew about — sixteen in total, such as googlebot, Yahoo, msnbot, et al. The 'Others' column consisted of all other user agents that did not neatly fall into any other category. Typically, these were composed of some more obscure bots that I didn't have regular expressions for (e.g. PhpDig) as well as some uncommon browsers such as Konqueror and Safari.
In total, 245 unique domains were successfully analyzed, for a total of 1,293 monthly results pages from April 2007 to September 2007 (a few domains did not have all months' data from April to September available). Again, this is a much smaller number of domains than I would have liked to analyze, but I gave up attempting to find more domains after I hit the 1000 result limit from Google. Hopefully future work by myself or others can add more data and work around this limit.
Here is a table summarizing the aggregate browser share data compiled.
|Total Hits||IE6||IE7||Old IE Versions||Mozilla/Firefox||Opera||Bots||Others|
Notice that the sum of all the individual user agents (1.32 Billion) does not quite add up to the 'Total Hits' column (1.78 Billion). This is because the webalizer statistics pages generally only showed the top ten or fifteen user-agents. Hence, the remaing 460 Million hits must be from the less-popular browsers and bots that did not show up in the webalizer summary.
Here's a breakdown, month-by-month, of the aggregate results above.
|Month||Total Hits||IE6||IE7||Old IE Versions||Mozilla/Firefox||Opera||Bots||Others|
Here's the same monthly statistics expressed in percentages, with the percentage for each browser computed not on the 'Total Hits' field, but on the sum of the browsers hits that were picked up — hence these percentages should add up to 100%
|Month||IE6||IE7||Old IE Versions||Mozilla/Firefox||Opera||Bots||Others|
For those that prefer pretty pictures to hard numbers, here's a graph of the above percentages.
Leaving out the bots, we can look at the aggregate data over the whole six month period with the following table. In the first row are the total hits for IE6, IE7, Mozilla, Opera, and 'Others', not including the easily-identifiable bots (the 'Others' category is partially bots and partially marginalized browser such as Konqueror and Safari). The second row has the percentages of each browser of the aggregate browser-only share.
I invite any interested parties to look at the raw data I've used in this experiment, and would enjoy hearing about anyone who is able to build on these meager results. A full SQL dump of the three tables I used ('UnearthedPages' holds the URLs with the corresponding month of the results page, 'Completed' has the breakdowns by user agent of these unearthed pages for URLs that were both successfully crawled and parse, and 'Incomplete' holds some URLs that the bot was able to download but not parse) is available here. For convenience, you can download these same tables in CSV format: Completed, Incomplete, and UnearthedPages.
The 1000 Google result limit proved to be the limiting factor in my quest to crawl a significant chunk of the tantalizing 688,000 possible results. On the bright side, the deeper-buried Google results pages turn out to be the significantly less-popular pages, with far fewer total hits. Still, it would be nice to see future work expand on this study and include a broader basket of domains.
I noticed at least one domain (e.g. 'http://meneame.net/webalizer/usage_200709.html') with what looks to be completely bogus entries for 'Total Hits' -- in this case, the Total Hits field is about 6,000 times greater than the sum of the number of hits from its top 10 User Agents. I had to manually remove this entry from the 'Completed' table upon which the aggregate statistics are based. Note that this bogus 'Total Hits' column wouldn't actually have impacted the monthly percentages table, since these calculations are based on the total aggregated number of hits instead of the "Total Hits" directly reported by the individual websites.
Although I undertook this project to eliminate the bias and uncertainty surrounding the other measures of web browser usage share, the statistical bias in these types of indirect studies are quite difficult to avoid. One topic I've postponed talking about until now is the bias inherent in using "Webalizer" pages as a proxy for the website traffic of an average website. Although there appear to be ways to use Webalizer on Windows/IIS webservers, it remains a tool most well suited for Apache running on various *nix servers.
Although Apache is still the most popular 1 webserver powering the web, one could argue that *nix/Apache powered websites are probably biased towards non-IE, non-Windows users. Indeed, a naive look at the first page 2 of Google results returned by the search query used shows somewhat atypical websites such as "common-lisp.net" among the results. Arguably, such tech-oriented sites are correlated with their use of Webalizer for generating website statistics, and hence the set of domains used in this study constitutes a biased sample.
Having said that, I suspect that increasing the sample size significantly will reduce the effect of these few biased samples to mere statistical noise. As noted earlier, the deeper-buried Google results pages tend to be the smaller, more "normal" pages. Including enough of these normal pages might improve the current data considerably.