How to authenticate Googlebot

Who should read it: Whoever is interested in getting traffic from Search Engine esp Google.

Google Bot is a welcome visitor to all the websites and in recent time I have covered few topics on Google Bot. Here is the list:

  1. Google Bot went Unhappy
  2. Google Bot Mystery
  3. Google Bot and cache

Thanks to Matt cutts for helping me continue with the series. As I have explained earlier that you can fake Google bot by changing the user agents. Firefox extensions are allowing you do this easily, try http://chrispederick.com/work/useragentswitcher/.

Matt and his team says,

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don’t think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

Read more at http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Adsense and SSL

Who should read it: Adsense does not support SSL, if you have same site running with http and HTTPS and you want Adsense on it then this article is certainly for you.

Recently we enabled Secure Sockets Layer (SSL) for one of our sites. How does SSL helps? It encrypts the communication, between the client (the browser) and the web server. It helps in stopping Eavesdropping and your customers feel more secure about it due to the perception they have for the lock. This is just a security for the transport layer.

So we decided to offer both the versions of the site, one with SSL (HTTPS://, http over SSL) and another the normal with no encryption (http://). Also the identity was verified by VeriSign to avoid the browser issued messages(Browsers will issue warning messages if the certificate is not signed by a trusted third party). If you find yourself nowhere with the encryption then certainly you need to read more. I remember my classes on data encryption and data communication with our great professors Dr Pinaki Mitra and Dr A.K.Laha (Yes, I remember that I promised to write more about http and HTTPS, yes I will do that. I will try to do it tonight else in the coming weeks.)

Even after completing everything the browser kept issuing a warning that the page contains some non-encrypted items. After some investigation we saw that the HTTPS pages were having some items with http, like images added with http. We made all the images relative that all the images can come with HTTPS under HTTPS. We changed all the js included path and made it relative.

The problem remained due to Google Adsense and Google Analytics. I searched web to find a solution for it. I got a big “No” from Google Adsense Help pages.

Do you offer an SSL version of your ad code?

Although you may place the Adsense ad code on a page using Secure Socket Layers, we do not currently offer an https version of the Adsense ad code at this time. Therefore, you may see a message asking for confirmation to load all items on the page when placing the Adsense ad code on secure https pages.

All our pages were static pages. Also Adsense was our bread and butter, so removing it wasn’t a feasible solution. Also we desperately wanted https across all pages as many of our customers were asking for it. Only options left were

  1. To make it dynamic and based on the port number (port number 443 whereas http works at 80) we could switch off the Adsense. It will make the pages slow which was again unaccepted.
  2. To have two pages, one for http and another for https. Management problem which was the rejected.

We then decided to use JS to check if it https://

if(!document.location.href.match(/^https:///))
{ Adsense Code}

And similarly for Google Analytics

if(!document.location.href.match(/^https:///))
{ Google analytics Code}

This is what I did urgently but the Google analytics has an option,

< script xsrc="https://ssl.google-analytics.com/urchin.js"
mce_src="https://ssl.google-analytics.com/urchin.js"  type="text/javascript" >

< script type="text/javascript" >
_uacct = "UA-XXXXX-X";
urchinTracker();
< /script >

More details are available at Google help pages. There is no way you can make you Adsense work with Https, so only way is to switch it off with https.

Asking spammers to work harder

For last two months my blog is gaining popularity. I am not lying I can prove that, Spams. Everyday I get close to 10 to 20 spams. Today they crossed the limit with over 100 spams. I decided to make them work harder for every comment.

I have added two things:-

  1. My Favourite Akismet – When a new comment, trackback, or pingback comes to your blog it is submitted to the Akismet web service which runs hundreds of tests on the comment and returns a thumbs up or thumbs down. Alll you need to do is download it from http://akismet.com/download and then copy it to the plugin folder (wp-content/plugins/). Copy the folder akismet inside wp-content/plugins/, so it will look like wp-content/plugins/akismet/akismet.php. Once this is done get registered at http://wordpress.com/signup/ and then visit http://wordpress.com/profile/ for the API key. Enter the API from your admin section and it will start working.
  2. This is one I liked from matt cutts blog, a spam controller to stop people some more spammers (the ones who are not good at maths 🙂 ). Complete details at Math spam plugin

Sorry for the inconvenience and I am sure my blog will help you recapitulate those mathematical lessons. Have fun. Do you also some spam controller for your blogs?

Google Bot and Cache

As I promised in my previous post, I am writing about Google bot and cache. Before entering into it lets understand how search engine work and Spiders/Bots role in it.
search engine
(This is the simplest diagram for Search engine cache)
Here the spiders/bots/robots crawl the webpages and stores it in the page repository (huge Databases.If you have used cvs, svn or any version controlling apps then you will understand the word repo better. In simpler terms a store house). Then the algorithm is applied on the cache pages to get the SERPs (Search Engine Ranking Pages).So Ranking depends directly on the cached pages not what you have on your pages currently. Also the logic is redefined for the spider for sites and in general.

Sometimes you will see from your log files that Google is visiting your pages (if you think google is not visiting your pages, do check your log format. Also check the robots.txt) but not caching your pages. There can various reasons for it (filters, bans e.t.c. But with filter and bans I doubt whether google visits the pages). One of the reason is “no modification since last visit”.

With SVN we use svn diff to find the modificiation, in linux we simple do diff. Similarly Google checks whether the page is modified since last visit. IMO it will be a criminal offense to repeat what Gurus and Gods of search engines have already documented in their own excellence.

I commented on Matt’s blog but with no answers yet:-

As usual, great post Matt. I will be checking the video soon. I did read some of the university research papers on search engine working, the cache systems, ranking algos e.t.c. This post just made it clearer with an illustration. One solution to this is, adding the current date or some feeds. I have two questions.

  1. Say if a page is not getting updated for last few days, will the frequency of google visit be updated accordingly (less visits).
  2. Is a small change like date update or feeds, a change enough to avoid a Google 304 message?

According to me,

  • Answer 1: Yes the frequency will change, in the diagram see how the center logic redefines the bots logic.
  • Answer 2: Till now I do not think Google is taking bytes into consideration for “If modified since”. As a programmer you can always create a file for the modified content and check the size of modification.

Sometime in futher we can surely see,

Function GoogleIfModifiedSince($LastPageContent, $CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > Y) return true;
return false
}

Current function might be

Function GoogleIfModifiedSince($LastPageContent,$CurrentPageContent)
{
$ChangedContentFile=CatchTheDiff($LastPageContent, $CurrentPageContent)
If (SizeInBytesForFile ($ChangedContentFile) > 0) return true;
return false
}

As I have mentioned, add feeds, dates and some dynamic content to your pages to get fresh cache dates. I have always learned that Search Engines like pages with fresh content. So Search Engines considers a page as fresh if it is modified since last visit. Also if you care about bandwidth, you can save some consumed by Google Bots by adding a proper http 304 messages. If you have some doubts you can ask I will try to answer being in my limit :).

Related Posts

Google Bot mystery

This was no less than a mystery for the first timers. As I mentioned in my previous post that we shifted the server as our site was consuming over 100 GB of bandwidth a month and over few 4 GB of hard disk. The growth rate was the factor which made us take this decision. It was growing in terms of GB every month if not week.

As usual after the shift you are suppose to keep a check on the spiders esp the Google bot. Last time I faced a strange problem and lost almost all the cache. This time our team who were checking the raw log file directly and with log analyzer (sawmill, awstats) told me that Google bot is not visiting our site. I took it lightly and took it as their mistake as I could see the latest cache with Google. When the team forced me to look at the raw log file I found them with no guilt, they reported the truth. I did a grep and found no trace of Google bot. It certainly worried me.

I knew that without Google visiting our site it cant create the cache, I decided to check the log creation section. I also asked prabhat to check it. I saw that the log format is common. What does that mean? I started investigating more and found few documents :-

Format for common
LogFormat "%h %l %u %t \"%r\" %>s %b" common

Here

%…h: Remote host
%…u: Remote user (from auth; may be bogus if return status (%s) is 401)
%…l: Remote logname (from identd, if supplied)
%…t: Time, in common log format time format (standard english format)
%…r: First line of request
%…s: Status. For requests that got internally redirected, this is the status of the *original* request —
%…>s for the last.
%…b: Bytes sent, excluding HTTP headers. In CLF format It was not logging the user agent which keeps a track of google bot and other user agents.
Then i decided to go for NCSA extended/combined log format
“%h %l %u %t \”%r\” %>s %b \”%i\” \”%{User-agent}i\””
Here \”%i\” keeps a track of referral URLs and \”%{User-agent}i\”” of user agents.
For many of us it was no less that a mystery.

Google bot went unhappy

This happened when we shifted the server last time (some 6 months earlier). After the shift we were keeping a watch over the bots. We started facing THE PROBLEM with few bots, “Cache loss“.

I checked everything from robots.txt, .htaccess, php programs, frames and everything possible. Validated robots.txt, XHTML validation for all the pages to make sure I am not doing anything wrong.

It did no good. The number was going down and down, from over 20,000 to 10,000 and 10,000 to 5,000. It started worrying me and my team as search engines contributes for your traffic (almost 60% in our case).

Then I started investigating:-

  • Investigation part 1:
    I changed my user agent to Google bot to check like Google bot. I was still able to access the pages.
  • Investigation part 2:
    Checking the Log files manually. I could find no trace of Google bot.
  • Investigation part 3:
    Making sure that Google is having no problems at its end. I read almost all the recent search engine posting at webmasterworld, search engine watch , digg.com, webproworld, hedir.com, blogs like mattcutts.com. I found none. Our other sites were not loosing the cache either.
  • Investigation part 4 to 100:
    Did all possible checks.

No way out – Last shot
When we saw that there is no way out, we decided to swift the servers back. Then while testing with the http live header I saw that the header passed was with content type “text/html”.
Our servers were not passing content type “text/plain” for the txt files. I asked the questions at various forums and all said that it shouldn’t make any difference. I had no options, so thought of passing the right content type “text/plain”. I configured it and left it to God.

It was the Eureka moment as Google started visiting us again and cached all the pages soon. Believe it or not, the header matters for Google bot. They may correct it later but it certainly did matter that time for us.

mysqld dead but subsys locked

Recently we shifted our site to our own server with a better configuration. We did almost everything right and it worked for weeks without problem but then came a problem.

“mysqld dead but subsys locked”

I searched web and found some unrelated discussions. In such cases the true friend of a programmer is the log file. It explained everything, “no disk space available”. This happened as we were logging few queries to check the query speed. Once corrected it started working with all grace.

HtAccess difficult problems #1

I have spent months working with htaccess doing almost everything possible with it, like checking cookie variable, non-www Domain redirection to www (easy), www subdomain redirection to non-www subdomain (a little tough) e.t.c. The best part of the programming was that we had one htaccess for as many as five sites (plus its alpha, beta sites) and it worked fine with all the regular expressions (everything was variable including the domain name).

Some of difficult problems we faced with htaccess,
Problem 1: Comparison of variables

Solution 1:

According to JDMorgan of webmasterworld.com

There is no ‘native’ support in Apache for comparing two variables, although some operating systems support ‘atomic back-referencess’ which can be used to emulate a compare. This depends on the regex library bundled with the OS> Specifically, POSIX 1003.2 atomic back-references can be used to do a compare by using the fact that if A+A = A+B, then A=B.

RewriteCond % ^(http://[^/]+)
RewriteCond %{HTTP_HOST)<>%1 ^([^<]+)<>\1$ [NC]
RewriteRule ^uploads/[^.]+\..{3,4}$ - [L]

Note that the “<>” string is entirely arbitrary and has no special meaning to regular-expressions; It is used here only to demarcate the boundary between the two concatenated variables. The actual ‘compare’ is done in the second RewriteCond, using the atomic back-reference “\1” to ‘copy’ the value of the string matching the parenthesized pattern directly to its left.

Therefore
if %<>%(partial) == %<>%<>%,
then %(partial) == %

This may need some tweaking to fit your actual referrers, since the match between hostname and the partial referrer substring saved in %1 must be exact. And as noted, it will only work on servers which support POSIX 1003.2 regular expressions (FreeBSD is one, and there are others.) I know of no way to support variable-to-variable compares in mod_rewrite without this POSIX 1003.2 trick.

Solution 2:

Set the variable first
SetEnvIfNoCase Referer>http://([a-zA-Z]{2,3})\.idealwebtools\.com\.* HostNameAndReferrerNameAreFromSameDomain=True

And then use it in the logic
RewriteCond %{ENV:HostNameAndReferrerNameAreFromSameDomain} !^True$ [NC]
RewriteRule (.*) redirection [R=301,L]

Paypal hack – How to hack paypals?

Are bad guys smarter? If yes then blame the good guys, as most of the good guys are ignorant, expecting everything to be good. As you might be knowing I am working on web technologies and I get many complains of a probable paypal hack (imp accounts in general). Paypal is a very safe site and in most of the cases the hacking happens at user level (User PC), we call it 0 level hacking. Most of us started learning at 0th level before using the complex tool and complex algorithm based hacks.

Different level of paypal hack

  • Computer level paypal Hack: Using various keyloggers (where every keyboard press is stored on the computer). I myself had an opportunity to work for such a product. It is advisable not to use your important accounts from public machines like cyber-cafes, where people install such keyloggers. One such famous application is back-orifice. There can be other spywares that can be deadly, so I advice you to use spybot regularly.
  • DNS level paypal hack: Every Site is associated with an IP which is resolved using DNS. DNS has various cache levels and some people can manipulate it too. Also check the hosts file of your computer, it may be taking you to a different server. Let me know if more explanation is needed. Some local DNSes can also be used for such hacks.
  • Interception: A person at proxy reading all your details. HTTPS takes care of it, it encrypted the communication. Also try to read the certificate, this takes care of a lot of issues. If needed I can explain this in detail.

  • Server level paypal hacks: This needs higher level of expertise in hacking. All server admins takes care of it. Paypal surely must be spending a lot of time ensuring a secure server. So don’t worry much of it.
  • User Ignorance: This is a major issue with Paypal hacking. Let me explain it in detail. This leads to maximum damage, keep reading the comments as well. I will keep adding various watchouts.

User Ignorance can be deadly for paypal hacking

Here is a simple case of paypal hack. Earlier I use to ignore all the mails from paypal but these days, since I have a paypal account, I can’t ignore. This is the most common (and cheap) way of hacking, we call it (zeroth) 0th level hacking. Do not forget to send this to all your friends, who one day might end up a prey to these simple cheap hacking. I got a mail and it said,
paypal hack

Everything is so perfect, I checked the url spelling whether there is some phishing trick there. Sometimes it can be payapal.com or paypaal.com. This time it was perfect but still I wasn’t sure. I mouse over the image and I saw

paypal hack

If I were a little naive with technical concepts I might have ended up entering my paypal username and passwords. The website like exactly like paypal, try http://www.oscormerce.dk/images/www.paypal.com/webscr/update.do=profile/index.html. Enter some fake stuff and you will find that it is asking for more details. Be careful.

Some may say that you should look for the secure lock. That’s good but it doesn’t secure it either as we end up in trouble due to our ignorance. https:// or the secure lock just encrypts the communication between the apache and the browser (also changes the port of communication), stopping one way of hacking known as interception. Enabling https is a plane piece of cake, a 5 min task. Be alert and be safe.

Be careful about Broswer hijacking too for paypal hacking

Some use full links http://video.google.com/videoplay?docid=9076288729387457440. Do not install anything that you are not sure of.

Help you friends by sending this post to all whom you think should know this. Keep reading my blog for other articles on orkut, security, marketing.

How to make your websites secure

Here are some of the new methods that are being used by most of the banks:

  1. Identifying image with customer id: When you enter your customer id, it shows an image from the server that is uploaded by you. I generally suggest adding your own image. So the time you enter your customer id, you see the image and verify the website. It avoid the DNS level hacks for paypal.
  2. Using virtual keyboards: Instead of using your computer keyboard now banks are providing virtual keyboards. Now the key loggers will only get a click track but will never get your password. This avoid the keylogger traps.
  3. The secure locks generally takes care of interception level hacks.

You have questions about paypal hacking

Please ask here and let us answer. We don’t hack into paypal accounts but we help people secure their accounts.