How to authenticate Googlebot

Who should read it: Whoever is interested in getting traffic from Search Engine esp Google.

Google Bot is a welcome visitor to all the websites and in recent time I have covered few topics on Google Bot. Here is the list:

  1. Google Bot went Unhappy
  2. Google Bot Mystery
  3. Google Bot and cache

Thanks to Matt cutts for helping me continue with the series. As I have explained earlier that you can fake Google bot by changing the user agents. Firefox extensions are allowing you do this easily, try http://chrispederick.com/work/useragentswitcher/.

Matt and his team says,

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don’t think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

Read more at http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Adsense and SSL

Who should read it: Adsense does not support SSL, if you have same site running with http and HTTPS and you want Adsense on it then this article is certainly for you.

Recently we enabled Secure Sockets Layer (SSL) for one of our sites. How does SSL helps? It encrypts the communication, between the client (the browser) and the web server. It helps in stopping Eavesdropping and your customers feel more secure about it due to the perception they have for the lock. This is just a security for the transport layer.

So we decided to offer both the versions of the site, one with SSL (HTTPS://, http over SSL) and another the normal with no encryption (http://). Also the identity was verified by VeriSign to avoid the browser issued messages(Browsers will issue warning messages if the certificate is not signed by a trusted third party). If you find yourself nowhere with the encryption then certainly you need to read more. I remember my classes on data encryption and data communication with our great professors Dr Pinaki Mitra and Dr A.K.Laha (Yes, I remember that I promised to write more about http and HTTPS, yes I will do that. I will try to do it tonight else in the coming weeks.)

Even after completing everything the browser kept issuing a warning that the page contains some non-encrypted items. After some investigation we saw that the HTTPS pages were having some items with http, like images added with http. We made all the images relative that all the images can come with HTTPS under HTTPS. We changed all the js included path and made it relative.

The problem remained due to Google Adsense and Google Analytics. I searched web to find a solution for it. I got a big “No” from Google Adsense Help pages.

Do you offer an SSL version of your ad code?

Although you may place the Adsense ad code on a page using Secure Socket Layers, we do not currently offer an https version of the Adsense ad code at this time. Therefore, you may see a message asking for confirmation to load all items on the page when placing the Adsense ad code on secure https pages.

All our pages were static pages. Also Adsense was our bread and butter, so removing it wasn’t a feasible solution. Also we desperately wanted https across all pages as many of our customers were asking for it. Only options left were

  1. To make it dynamic and based on the port number (port number 443 whereas http works at 80) we could switch off the Adsense. It will make the pages slow which was again unaccepted.
  2. To have two pages, one for http and another for https. Management problem which was the rejected.

We then decided to use JS to check if it https://

if(!document.location.href.match(/^https:///))
{ Adsense Code}

And similarly for Google Analytics

if(!document.location.href.match(/^https:///))
{ Google analytics Code}

This is what I did urgently but the Google analytics has an option,

< script xsrc="https://ssl.google-analytics.com/urchin.js"
mce_src="https://ssl.google-analytics.com/urchin.js"  type="text/javascript" >

< script type="text/javascript" >
_uacct = "UA-XXXXX-X";
urchinTracker();
< /script >

More details are available at Google help pages. There is no way you can make you Adsense work with Https, so only way is to switch it off with https.

all work no fun not good

Who should read it: Anyone who wants to have fun with work :). Everyone.
Here are a set of games that can be play while at work. Earlier I mentioned about Google image labeler, thats fun and few of us are already addicted to it.

Here is another one, peek a boom, this is cool as well.

The Basics: Peeking and Booming
You and a random partner take turns “peeking” and “booming.” While one of you is peeking, the other is booming. The booming player (Boom) gets an image along with a word related to the image, and the peeking player (Peek) gets no image (see Figure below). Booming consists of clicking parts of the image; when Boom clicks a part of the image, it is revealed to Peek. The object of the game is for Peek to type the word associated to the image — from their perspective, the game consists of a slowly revealing image, which has to be named. From Boom’s perspective, the game consists of clicking on areas of the image so that Peek can guess the word associated to it. Once Peek guesses the correct word, the two of you move on to the next image and switch roles.

So have fun with work. Hey, be careful let not your boss ban those IPs from your office network :). These days people are crazy about orkut too.

Desktop Blogging tool

Who should read this: Anyone serious about Blogging and want to try a desktop tool for it.

These were the tools I tried with few of my friends sometime back. These desktop tools makes your life easier for sure. Check them out.

  1. Ecto – A copy of ecto costs $17.95. (not used yet)
  2. MarsEdit – Only for Mac and $24.95 single license
  3. wbloggar.com – Free (Used), it is simple and usable.
  4. qumana – Free (Used). I have used it and it is a cool with an integrated ad system.
  5. blogjet – $39.95 single license (not used yet)
  6. Elicit $59.00 with life time updates. A 30 day free trail is available.

Firefox extension
Firefox users can try http://performancing.com/firefox. This started as a small project but these days they are offering everything that is needed for blogging. I used to use it.

My Current Desktop Blogging Pet
I like to explore new things, so these days I am try out with Windows Live Writer (beta)

I will write about my experiences with Live writer, I am liking it as it drafts it with ease. Also because it checks the spelling in a simpler way. MS is trying to support free systems, a list can be found at http://ideas.live.com/. Also there are many plugins available for Windows Live Writer.

One of the problem that I face is with the slug. I am not able to post the slug with this tool, either I need to explore more or MS needs to develop it.

Hedir releases web 2.0 bookmarking tool

Who should read it: Anyone who wants an easy solution for social bookmarking on their blog/site taking the minimum space. And all the Hedir fans :).

USP of this tool: Simplicity. Visible web space is expensive and Hedir saves it for you.

I am a big fan of Hedir from the time it started challenging DMOZ. It is as young as a one year old baby and growing as faster as a lover boy wanting to marry his little lover girl. Hedir is desperately going after his lover girl aka success covering milestones with every hop.

my fav avatar
(my favorite avatar by Google Junky)
It is growing faster with so many add-on features. Why I support it over DMOZ? The answer is simple,

  • growth rate when compared to DMOZ (for that matter any other directory)
  • the concept, this is working exactly on web 2.0 concepts.
  • the community, the people are so great. Initially I was inspired by Winterfrost and now the list goes on and on with ADAM, Baggeroli, Google Junky, Francesco, Brett, writergrrrl, Anthony, Bret, Josh, Norah, Lakhya, … (The list is too big to be added under one post). Great community.

Also read Norah’s post on how hedir is different from other directories at http://norah.hedir.com/2006/01/17/how-hedir-is-dif…

Last week under their Friday-release class we saw another great tool, web 2.0 bookmarking. http://www.idealwebtools.com/blog/googlebot-cache/, with the blog post you will be able to see , this allows you to bookmark the page to your browser (typical unavoidable web 1.0 and web x.0) and to all the social bookmarking (social sharing) sites.

Bookmarking tool by Hedir

The best part is the management, it is easy to use.

  1. Go to http://www.hedir.com/web2.0/ ,
  2. choose the sites of your preference and generate the code.
  3. Paste the code to your website/blog.

It starts working for you, it checks multiple browsers and adjusts its bookmarking accordingly. Great tool. Keep watching for next Fridays :).

I have also decided to donate my wordpress work to Hedir community. I am talking to few people at Hedir. I will donate the code and let the community generate money for further development. I will be honored if they accepts it. Love you all at Hedir.

Google need writers

Google Jobs
If you are a fan of Google like me, here is a chance to be in Google. If you are a web2.0 Shakespeare try http://services.google.com/events/wordmasters . I have recommended it to some of friends who are good at writing.

Why to Join Google?

  1. Since it has two “o” in its name. Also it has two “g” in its name. In short since it is Google.
  2. It allows you to work on your own interested area 20% of the time. I think this is very unique.
  3. Google play

And plenty of more reasons , find them all at http://www.google.com/jobs/reasons.html

Asking spammers to work harder

For last two months my blog is gaining popularity. I am not lying I can prove that, Spams. Everyday I get close to 10 to 20 spams. Today they crossed the limit with over 100 spams. I decided to make them work harder for every comment.

I have added two things:-

  1. My Favourite Akismet – When a new comment, trackback, or pingback comes to your blog it is submitted to the Akismet web service which runs hundreds of tests on the comment and returns a thumbs up or thumbs down. Alll you need to do is download it from http://akismet.com/download and then copy it to the plugin folder (wp-content/plugins/). Copy the folder akismet inside wp-content/plugins/, so it will look like wp-content/plugins/akismet/akismet.php. Once this is done get registered at http://wordpress.com/signup/ and then visit http://wordpress.com/profile/ for the API key. Enter the API from your admin section and it will start working.
  2. This is one I liked from matt cutts blog, a spam controller to stop people some more spammers (the ones who are not good at maths 🙂 ). Complete details at Math spam plugin

Sorry for the inconvenience and I am sure my blog will help you recapitulate those mathematical lessons. Have fun. Do you also some spam controller for your blogs?

Google Bot and Cache

As I promised in my previous post, I am writing about Google bot and cache. Before entering into it lets understand how search engine work and Spiders/Bots role in it.
search engine
(This is the simplest diagram for Search engine cache)
Here the spiders/bots/robots crawl the webpages and stores it in the page repository (huge Databases.If you have used cvs, svn or any version controlling apps then you will understand the word repo better. In simpler terms a store house). Then the algorithm is applied on the cache pages to get the SERPs (Search Engine Ranking Pages).So Ranking depends directly on the cached pages not what you have on your pages currently. Also the logic is redefined for the spider for sites and in general.

Sometimes you will see from your log files that Google is visiting your pages (if you think google is not visiting your pages, do check your log format. Also check the robots.txt) but not caching your pages. There can various reasons for it (filters, bans e.t.c. But with filter and bans I doubt whether google visits the pages). One of the reason is “no modification since last visit”.

With SVN we use svn diff to find the modificiation, in linux we simple do diff. Similarly Google checks whether the page is modified since last visit. IMO it will be a criminal offense to repeat what Gurus and Gods of search engines have already documented in their own excellence.

I commented on Matt’s blog but with no answers yet:-

As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.

  1. Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits).
  2. Is a small change like date update or feeds, a change enough to avoid a Google 304 message?

According to me,

  • Answer 1: Yes the frequency will change, in the diagram see how the center logic redefines the bots logic.
  • Answer 2: Till now I do not think Google is taking bytes into consideration for “If modified since”. As a programmer you can always create a file for the modified content and check the size of modification.

Sometime in futher we can surely see,

Function GoogleIfModifiedSince($LastPageContent, $CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > Y) return true;
return false
}

Current function might be

Function GoogleIfModifiedSince($LastPageContent,$CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > 0) return true;
return false
}

As I have mentioned, add feeds, dates and some dynamic content to your pages to get fresh cache dates. I have always learned that Search Engines like pages with fresh content. So Search Engines considers a page as fresh if it is modified since last visit. Also if you care about bandwidth, you can save some consumed by Google Bots by adding a proper http 304 messages. If you have some doubts you can ask I will try to answer being in my limit :).

Related Posts

Remebering Good old days

I will carry on my posts with google bot activities. I will be writing about how Google Bot visits your blog without making a cache and how you can force it to make a cache. Let’s experience some lighter part of the life.

Do you know this fellow?
Dinesh Upreti
This is my friend Dinesh, Who recently updated his orkut avatar. This reminded us about last world (2002). I with Dinesh and Mehta stayed back during our vacations to study and to work on idealog. Plans remained plans and we adopted a different timetable.

Dinesh says
I can’t forget that incident. “Brazil fans.. good”… Yar that stay at hostel is still in momory. Mehta,me and u.. watching movies,play tt,basket ball, and mehta and me playing fifa.. heavy breakfast.. Pramod kee 4KG ka ekkk mango .. Where is Pramod.. are u listening this.. Get me that mango

I with Mehta decided to support Brazil.
Aji Issac

I wrote back to Dinesh’s scrapbook (Lazy to rewrite)
haan yaar, woh din be din the, aur rathe rathe. Sleeping till 11, then tea and then back to TV room for football matches. Lunch and back to tv room, evening Mango juice, basketball court and then movies. After movie sometimes to GP at night. We stayed back to study but I wonder if we even dared to do that :).

I remember the notices against us to vacate the rooms and after accepting the Brazillian protocol we were respected as Fans :). Pramod ki kahaniya, Warden Maam ke saath breakfast. Chal yaar its already 12:00, I am off for the day and for the night.

Those were some good days of life. Remembering all those times.

Google Bot mystery

This was no less than a mystery for the first timers. As I mentioned in my previous post that we shifted the server as our site was consuming over 100 GB of bandwidth a month and over few 4 GB of hard disk. The growth rate was the factor which made us take this decision. It was growing in terms of GB every month if not week.

As usual after the shift you are suppose to keep a check on the spiders esp the Google bot. Last time I faced a strange problem and lost almost all the cache. This time our team who were checking the raw log file directly and with log analyzer (sawmill, awstats) told me that Google bot is not visiting our site. I took it lightly and took it as their mistake as I could see the latest cache with Google. When the team forced me to look at the raw log file I found them with no guilt, they reported the truth. I did a grep and found no trace of Google bot. It certainly worried me.

I knew that without Google visiting our site it cant create the cache, I decided to check the log creation section. I also asked prabhat to check it. I saw that the log format is common. What does that mean? I started investigating more and found few documents :-

Format for common
LogFormat "%h %l %u %t \"%r\" %>s %b" common

Here

%…h: Remote host
%…u: Remote user (from auth; may be bogus if return status (%s) is 401)
%…l: Remote logname (from identd, if supplied)
%…t: Time, in common log format time format (standard english format)
%…r: First line of request
%…s: Status. For requests that got internally redirected, this is the status of the *original* request —
%…>s for the last.
%…b: Bytes sent, excluding HTTP headers. In CLF format It was not logging the user agent which keeps a track of google bot and other user agents.
Then i decided to go for NCSA extended/combined log format
“%h %l %u %t \”%r\” %>s %b \”%i\” \”%{User-agent}i\””
Here \”%i\” keeps a track of referral URLs and \”%{User-agent}i\”” of user agents.
For many of us it was no less that a mystery.