Google Bot mystery

This was no less than a mystery for the first timers. As I mentioned in my previous post that we shifted the server as our site was consuming over 100 GB of bandwidth a month and over few 4 GB of hard disk. The growth rate was the factor which made us take this decision. It was growing in terms of GB every month if not week.

As usual after the shift you are suppose to keep a check on the spiders esp the Google bot. Last time I faced a strange problem and lost almost all the cache. This time our team who were checking the raw log file directly and with log analyzer (sawmill, awstats) told me that Google bot is not visiting our site. I took it lightly and took it as their mistake as I could see the latest cache with Google. When the team forced me to look at the raw log file I found them with no guilt, they reported the truth. I did a grep and found no trace of Google bot. It certainly worried me.

I knew that without Google visiting our site it cant create the cache, I decided to check the log creation section. I also asked prabhat to check it. I saw that the log format is common. What does that mean? I started investigating more and found few documents :-

Format for common
LogFormat "%h %l %u %t \"%r\" %>s %b" common

Here

%…h: Remote host
%…u: Remote user (from auth; may be bogus if return status (%s) is 401)
%…l: Remote logname (from identd, if supplied)
%…t: Time, in common log format time format (standard english format)
%…r: First line of request
%…s: Status. For requests that got internally redirected, this is the status of the *original* request —
%…>s for the last.
%…b: Bytes sent, excluding HTTP headers. In CLF format It was not logging the user agent which keeps a track of google bot and other user agents.
Then i decided to go for NCSA extended/combined log format
“%h %l %u %t \”%r\” %>s %b \”%i\” \”%{User-agent}i\””
Here \”%i\” keeps a track of referral URLs and \”%{User-agent}i\”” of user agents.
For many of us it was no less that a mystery.