Google Bot and Cache

As I promised in my previous post, I am writing about Google bot and cache. Before entering into it lets understand how search engine work and Spiders/Bots role in it.
search engine
(This is the simplest diagram for Search engine cache)
Here the spiders/bots/robots crawl the webpages and stores it in the page repository (huge Databases.If you have used cvs, svn or any version controlling apps then you will understand the word repo better. In simpler terms a store house). Then the algorithm is applied on the cache pages to get the SERPs (Search Engine Ranking Pages).So Ranking depends directly on the cached pages not what you have on your pages currently. Also the logic is redefined for the spider for sites and in general.

Sometimes you will see from your log files that Google is visiting your pages (if you think google is not visiting your pages, do check your log format. Also check the robots.txt) but not caching your pages. There can various reasons for it (filters, bans e.t.c. But with filter and bans I doubt whether google visits the pages). One of the reason is “no modification since last visit”.

With SVN we use svn diff to find the modificiation, in linux we simple do diff. Similarly Google checks whether the page is modified since last visit. IMO it will be a criminal offense to repeat what Gurus and Gods of search engines have already documented in their own excellence.

I commented on Matt’s blog but with no answers yet:-

As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.

  1. Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits).
  2. Is a small change like date update or feeds, a change enough to avoid a Google 304 message?

According to me,

  • Answer 1: Yes the frequency will change, in the diagram see how the center logic redefines the bots logic.
  • Answer 2: Till now I do not think Google is taking bytes into consideration for “If modified since”. As a programmer you can always create a file for the modified content and check the size of modification.

Sometime in futher we can surely see,

Function GoogleIfModifiedSince($LastPageContent, $CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > Y) return true;
return false
}

Current function might be

Function GoogleIfModifiedSince($LastPageContent,$CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > 0) return true;
return false
}

As I have mentioned, add feeds, dates and some dynamic content to your pages to get fresh cache dates. I have always learned that Search Engines like pages with fresh content. So Search Engines considers a page as fresh if it is modified since last visit. Also if you care about bandwidth, you can save some consumed by Google Bots by adding a proper http 304 messages. If you have some doubts you can ask I will try to answer being in my limit :).

Related Posts

Remebering Good old days

I will carry on my posts with google bot activities. I will be writing about how Google Bot visits your blog without making a cache and how you can force it to make a cache. Let’s experience some lighter part of the life.

Do you know this fellow?
Dinesh Upreti
This is my friend Dinesh, Who recently updated his orkut avatar. This reminded us about last world (2002). I with Dinesh and Mehta stayed back during our vacations to study and to work on idealog. Plans remained plans and we adopted a different timetable.

Dinesh says
I can’t forget that incident. “Brazil fans.. good”… Yar that stay at hostel is still in momory. Mehta,me and u.. watching movies,play tt,basket ball, and mehta and me playing fifa.. heavy breakfast.. Pramod kee 4KG ka ekkk mango .. Where is Pramod.. are u listening this.. Get me that mango

I with Mehta decided to support Brazil.
Aji Issac

I wrote back to Dinesh’s scrapbook (Lazy to rewrite)
haan yaar, woh din be din the, aur rathe rathe. Sleeping till 11, then tea and then back to TV room for football matches. Lunch and back to tv room, evening Mango juice, basketball court and then movies. After movie sometimes to GP at night. We stayed back to study but I wonder if we even dared to do that :).

I remember the notices against us to vacate the rooms and after accepting the Brazillian protocol we were respected as Fans :). Pramod ki kahaniya, Warden Maam ke saath breakfast. Chal yaar its already 12:00, I am off for the day and for the night.

Those were some good days of life. Remembering all those times.

Google Bot mystery

This was no less than a mystery for the first timers. As I mentioned in my previous post that we shifted the server as our site was consuming over 100 GB of bandwidth a month and over few 4 GB of hard disk. The growth rate was the factor which made us take this decision. It was growing in terms of GB every month if not week.

As usual after the shift you are suppose to keep a check on the spiders esp the Google bot. Last time I faced a strange problem and lost almost all the cache. This time our team who were checking the raw log file directly and with log analyzer (sawmill, awstats) told me that Google bot is not visiting our site. I took it lightly and took it as their mistake as I could see the latest cache with Google. When the team forced me to look at the raw log file I found them with no guilt, they reported the truth. I did a grep and found no trace of Google bot. It certainly worried me.

I knew that without Google visiting our site it cant create the cache, I decided to check the log creation section. I also asked prabhat to check it. I saw that the log format is common. What does that mean? I started investigating more and found few documents :-

Format for common
LogFormat "%h %l %u %t \"%r\" %>s %b" common

Here

%…h: Remote host
%…u: Remote user (from auth; may be bogus if return status (%s) is 401)
%…l: Remote logname (from identd, if supplied)
%…t: Time, in common log format time format (standard english format)
%…r: First line of request
%…s: Status. For requests that got internally redirected, this is the status of the *original* request —
%…>s for the last.
%…b: Bytes sent, excluding HTTP headers. In CLF format It was not logging the user agent which keeps a track of google bot and other user agents.
Then i decided to go for NCSA extended/combined log format
“%h %l %u %t \”%r\” %>s %b \”%i\” \”%{User-agent}i\””
Here \”%i\” keeps a track of referral URLs and \”%{User-agent}i\”” of user agents.
For many of us it was no less that a mystery.

Google bot went unhappy

This happened when we shifted the server last time (some 6 months earlier). After the shift we were keeping a watch over the bots. We started facing THE PROBLEM with few bots, “Cache loss“.

I checked everything from robots.txt, .htaccess, php programs, frames and everything possible. Validated robots.txt, XHTML validation for all the pages to make sure I am not doing anything wrong.

It did no good. The number was going down and down, from over 20,000 to 10,000 and 10,000 to 5,000. It started worrying me and my team as search engines contributes for your traffic (almost 60% in our case).

Then I started investigating:-

  • Investigation part 1:
    I changed my user agent to Google bot to check like Google bot. I was still able to access the pages.
  • Investigation part 2:
    Checking the Log files manually. I could find no trace of Google bot.
  • Investigation part 3:
    Making sure that Google is having no problems at its end. I read almost all the recent search engine posting at webmasterworld, search engine watch , digg.com, webproworld, hedir.com, blogs like mattcutts.com. I found none. Our other sites were not loosing the cache either.
  • Investigation part 4 to 100:
    Did all possible checks.

No way out – Last shot
When we saw that there is no way out, we decided to swift the servers back. Then while testing with the http live header I saw that the header passed was with content type “text/html”.
Our servers were not passing content type “text/plain” for the txt files. I asked the questions at various forums and all said that it shouldn’t make any difference. I had no options, so thought of passing the right content type “text/plain”. I configured it and left it to God.

It was the Eureka moment as Google started visiting us again and cached all the pages soon. Believe it or not, the header matters for Google bot. They may correct it later but it certainly did matter that time for us.

mysqld dead but subsys locked

Recently we shifted our site to our own server with a better configuration. We did almost everything right and it worked for weeks without problem but then came a problem.

“mysqld dead but subsys locked”

I searched web and found some unrelated discussions. In such cases the true friend of a programmer is the log file. It explained everything, “no disk space available”. This happened as we were logging few queries to check the query speed. Once corrected it started working with all grace.

HtAccess difficult problems #1

I have spent months working with htaccess doing almost everything possible with it, like checking cookie variable, non-www Domain redirection to www (easy), www subdomain redirection to non-www subdomain (a little tough) e.t.c. The best part of the programming was that we had one htaccess for as many as five sites (plus its alpha, beta sites) and it worked fine with all the regular expressions (everything was variable including the domain name).

Some of difficult problems we faced with htaccess,
Problem 1: Comparison of variables

Solution 1:

According to JDMorgan of webmasterworld.com

There is no ‘native’ support in Apache for comparing two variables, although some operating systems support ‘atomic back-referencess’ which can be used to emulate a compare. This depends on the regex library bundled with the OS> Specifically, POSIX 1003.2 atomic back-references can be used to do a compare by using the fact that if A+A = A+B, then A=B.

RewriteCond % ^(http://[^/]+)
RewriteCond %{HTTP_HOST)<>%1 ^([^<]+)<>\1$ [NC]
RewriteRule ^uploads/[^.]+\..{3,4}$ - [L]

Note that the “<>” string is entirely arbitrary and has no special meaning to regular-expressions; It is used here only to demarcate the boundary between the two concatenated variables. The actual ‘compare’ is done in the second RewriteCond, using the atomic back-reference “\1” to ‘copy’ the value of the string matching the parenthesized pattern directly to its left.

Therefore
if %<>%(partial) == %<>%<>%,
then %(partial) == %

This may need some tweaking to fit your actual referrers, since the match between hostname and the partial referrer substring saved in %1 must be exact. And as noted, it will only work on servers which support POSIX 1003.2 regular expressions (FreeBSD is one, and there are others.) I know of no way to support variable-to-variable compares in mod_rewrite without this POSIX 1003.2 trick.

Solution 2:

Set the variable first
SetEnvIfNoCase Referer>http://([a-zA-Z]{2,3})\.idealwebtools\.com\.* HostNameAndReferrerNameAreFromSameDomain=True

And then use it in the logic
RewriteCond %{ENV:HostNameAndReferrerNameAreFromSameDomain} !^True$ [NC]
RewriteRule (.*) redirection [R=301,L]

Bought myblogplanner.com

I am sure that one-day myblogplanner will become a must for all the blogs. So keeping the faith I decided to buy the domain myblogplanner.com. I bought it from godaddy for $9.2 per annum.
Currently in myblogplanner.com I am using three blog planner boxes to arrange the content. I need to spend some time tomorrow. Also this is a good opportunity to earn $1 per referral. I am looking forward to earn some good money through referral system this month. Valerie was working it out with AmPmInsure community Innovation center. Hopefully you can get the first phase product tomorrow with working referral system.

wordpress upgrade

Now I have upgraded wordpress from 1.5.1.3 to 2.0.4, the latest version. Also added the count to each of the categories. I have also reshuffled few posts across proper categories. If you notice I have added another blog planner message box right on the top of template to highlight the latest important topics.

I will find some time tommorow for a better post.

Evangelistic marketing – A social approach

We often attach the word evangelists with Christianity and rightly so, as Christianity defines the word most appropriately. As a Christian I have understood the word and its in-depth meaning in its simplest form.

Who is an evangelist?
An evangelist has two qualities: –

  1. He is sure of his own faith (faith is stronger than plain believe). Believe is about understanding and accepting the fact but faith starts with believing and then following the believe. Christian faith is just not only knowing that Jesus is God but accepting Jesus Christ as God and your personal saviour along with following His commandments. His faith is never shaken even if the current circumstance proves it wrong. He persists with his faith.
  2. He does everything to transfer his faith to others, as he is sure of it. He not only shares his faith with others but also transfers it.

Social space, nodes, networks and cables
Before starting with evangelistic marketing lets understand social space. I consider the whole targeted and untargeted segment as “social space”, each person as a “node” in the social space and relations the social networking. Now look at the social space, it looks cluttered with different color nodes.

Now zoom in and look carefully, you will see nodes connected to each other.

The surprising factor is that each node is connected to every other node of the social space directly or indirectly (directly if is at one hop distance otherwise indirect with n-hopping distance). Look at orkut (best social network so far) stats,

As I explained that each node is connected to other with a network of some type of relations (some strong some weak). I started calling it a social network cable.

Establishing Social Network cables
Now this needs serious attention. How to establish connection between two nodes? In order to establish a reliable connection you need approval by both the parties. If node A wants to establish connection with Node C then there are two ways to do it:-

  • Way 1: Node A approaches Node C and explains the desire. If A convinces C then it can be established. It is a very costly operation in terms of social space.
  • Way 2: Node A is connected to Node B and Node B is connected to Node C. The transitivity applies (node A just need to make Node B active with the theme). Then Node B will connect to Node C and introduce Node A. The acceptance probability is much higher in this case. Once A and C agrees, the cable is established.

Lets take a social node graph

Now look at blue node and the orange node. If blue node (may be a marketer or a company team as a whole) needs to reach the orange node (may be a probable consumer) there is no direct way. Either he needs to establish a new network cable in the social space or use the existing one. As I explained that creating a new network cable needs approval from both the parties and is an expensive social operation (just like making friendship directly). Now using the existing indirect connection for approvals is better as the acceptance chances are higher (friends introducing you to another friend).

What is evangelistic marketing?
Evangelistic marketing is about accepting your products yourself and then convincing other social nodes that it is good. It is not about pushing sales it is about laying the networks (for sales). The company that believes in evangelistic marketing should get its first bunch of customers (nodes) from its own known nodes (employees, partners e.t.c). In other words your marketers should be your customers first (Always allow a channel where your customers can communicate for required changes). Once the marketers become customers they will turn other nodes into customers and customers into marketers.

Evangelistic marketing is all about making nodes active with your theme and establishing the thematic social network. Also active nodes will create other active nodes (you can call it viral marketing). Web 2.0 is empowering evangelistic marketing.

Instant Brands

I have started a wiki page for instant brand. I think in current web 2.0 is it a possibility. I will keep on developing the word. But someone deleted the page, I think they accept only pages which is about present not future. Anyways I was prepared for this so here is the starting article on Instant Brand.

What is it?
Instant brand is about web 2.0 and innovation. Web 2.0 is pushing innovation to become a brand in no time. Blogs, forums, wikis, diggs, bookmarks, orkuts are the channels for transmission. We have always associated Brand recognition as a long-term process but in this current Web 2.0 world an instant brand is also possible. “Word of mouth” was never so simple and mouth was never so open as it is today. One innovation (as information, news) can spread with a speed no one can imagine.

Is instant a proper word?
When I say instant it is very much comparative. It takes comparatively very less time with normal Brand recognition.

What is making this possible?
It has to be blogosphere, orkuts, wikis, diggs, forums which in itself is viral. It can spread a brand (individuals, companies, products, sites) with lightening speed. In other words web 2.0 is making it possible.

Will it last long?
Yes it will last if the strategies are made properly. According to web 2.0 the community and platform is the power. With the propagation of the innovation (as news, information) can be backed with a retention plan creating active nodes within the market matrix.

Some examples
I think web 2.0 is still in its early stage of growth or development so instant branding is still to gain some more speed. But look at the following traffico meter and the growth.


Look at the instant steep growth and its followed reorganization. Also look at


The initial growth and then the constant growth says that it was recognized as a brand instantly and due to its quality it grown stronger as a brand.

It can also be done being under a brand. Google’s recent Image labeler is a good example. It can also be developed as a different brand instantly. It is taking the advantage of existing Google brand for recognition.