As I promised in my previous post, I am writing about Google bot and cache. Before entering into it lets understand how search engine work and Spiders/Bots role in it.
(This is the simplest diagram for Search engine cache)
Here the spiders/bots/robots crawl the webpages and stores it in the page repository (huge Databases.If you have used cvs, svn or any version controlling apps then you will understand the word repo better. In simpler terms a store house). Then the algorithm is applied on the cache pages to get the SERPs (Search Engine Ranking Pages).So Ranking depends directly on the cached pages not what you have on your pages currently. Also the logic is redefined for the spider for sites and in general.
Sometimes you will see from your log files that Google is visiting your pages (if you think google is not visiting your pages, do check your log format. Also check the robots.txt) but not caching your pages. There can various reasons for it (filters, bans e.t.c. But with filter and bans I doubt whether google visits the pages). One of the reason is “no modification since last visitâ€.
With SVN we use svn diff to find the modificiation, in linux we simple do diff. Similarly Google checks whether the page is modified since last visit. IMO it will be a criminal offense to repeat what Gurus and Gods of search engines have already documented in their own excellence.
I commented on Matt’s blog but with no answers yet:-
As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.
- Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits).
- Is a small change like date update or feeds, a change enough to avoid a Google 304 message?
According to me,
- Answer 1: Yes the frequency will change, in the diagram see how the center logic redefines the bots logic.
- Answer 2: Till now I do not think Google is taking bytes into consideration for “If modified since”. As a programmer you can always create a file for the modified content and check the size of modification.
Sometime in futher we can surely see,
Function GoogleIfModifiedSince($LastPageContent, $CurrentPageContent) { $ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent) If (SizeInBytesForFile ($ChangedContentFile) > Y) return true; return false }
Current function might be
Function GoogleIfModifiedSince($LastPageContent,$CurrentPageContent) { $ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent) If (SizeInBytesForFile ($ChangedContentFile) > 0) return true; return false }
As I have mentioned, add feeds, dates and some dynamic content to your pages to get fresh cache dates. I have always learned that Search Engines like pages with fresh content. So Search Engines considers a page as fresh if it is modified since last visit. Also if you care about bandwidth, you can save some consumed by Google Bots by adding a proper http 304 messages. If you have some doubts you can ask I will try to answer being in my limit :).
hi i wnt to know more about seo
the article is useful.
Thanks, great to hear that you found it helpful.
How do I force Google to crawl ,my site? My webmaster tools always say Google last crawled the site on august 24. but the content has changed since then but Google has not stepped there since. I have done the normal pings through technorati and stuff but no change.
Is there a way I can notify them apart from pings?
Hi Farouk,
Welcome to the blog. You can’t force Google to crawl more but you can inspire the bot 🙂 by doing following thing:
1) Keep updating the content of the page. It can be done by adding dynamic sections like latest blog post, or comments or anything in similar lines.
2) Keep your site updated with new content in a frequent manner.
3) Most important: Get more links to the website. More link === site is more important === should be crawled more frequently.
Hope it helps.
Aji
Will try that. Thanks for your help
You are welcome Farouk, for newer pages you can try out with sitemap as well.