Video: Http Vs Https – Duplicate content issues


The solution using htaccess


# Redirecting the robots.txt file for htaccess to stop https crawling to avoid duplicate content.
RewriteCond % 443 [NC]
RewriteRule ^robots.txt$ robots_ssl.txt [L]

There are other ways of solving it too like

< ?php if ($_SERVER["SERVER_PORT"] == 443) { echo "< meta name=" robots " content=" noindex,nofollow " > ";
}
?>

You can even do canonicalization but it is not always possible as you want both the urls to be accessible.

Why a Link is called a Hyperlink?

I was writing a bigger post “SEO is a curse for content writers” when I remember a very interesting question. Recently while training new recruits, I was asked this question that “Why a Link is called a Hyperlink?”. I did not had a proper answer when she asked me this interesting question. I had to go through some documents to give her an answer.

  • Waiting for your new blog post.
    Sundays generally keep me busy with Bible classes. These days we are studying about Hell, Heaven, Angels e.t.c. Very interesting topics. I hope to complete the post on SEO and Content writers by tomorrow night.
  • “Why a Link is called a Hyperlink?
    Lets look at the definition itself, which says, “blah blah **$#&^%”. Sorry, I could not get a good definition. Let me define it myself. Hyperlink is the reference point for a hypertext. This leads us to another question, “what is hypertext?”. Hyper implies excess and thus hypertext implies excess content. A hypertext is a super text which can create another layer of content on the top of the existing content. In other words a text (hypertext or anchor text) which refers to another document is of more value than the traditional text.
  • Interesting!
    Yup, it is basically because of the power of the text which acts as a text with excessive hidden power. Hidden, as it remains like a normal text, with just an indicator (generally underline and blue color), unless demanded with a click or mouse action.
  • Cool!
    To keep the short story short, a hyperlink is a reference point for an excessive hidden and powerful text known as hypertext. And don’t forget to read about link titles and how users can be benefited from it. I call it a meta for the hyperlinks.

Web without hyperlink is beyond our imagination. World’s best websites (be it Yahoo or Google) are more or less a collection of hyperlinks. In Search Engine algorithms these hyperlinks play a very important role as it builds a layer of additional content for the visitors. I think this post will help us answer one of the most debatable SEO topic, “Does outbound link help in SE ranking?” (Wow! this WMW thread has my opinion too, so do check the hyperdoc).

All about SVN – subversion

I can’t imagine life without subversion. If I look back, I can still see we maintaining different “change log files” to track the changes and doing a daily local backups to preserve daily changes. Every other week we used to have (almost unpleasant) discussion, “who made this change?“, “What happen to the previous copy, do you have a local copy?“.

What is SVN?

Definition says, “open source version controlling system using a center repository with great unique features like directory versioning, true versioning, atomic commits, versioned metadata, plugin with Apache for new network mechanisms, consistent data handling, efficient branching and tagging and hackability”.

According to me, “It is my personal programming assistant, I do the logic and program and it takes care of the rest”. I can’t image programming with SVN.

SVN functional cycle

Let the image do all the explanation, I have tried to keep it simple. Do let me know if it is not clear.

SVN - Subversion cycle

Summarizing the red-bean svnbook

Some of the commands that will take care of 99% of your svn work (for details read the book).

  1. SVN Checkout – (shortcut svn co) This will create a private copy/local copy of the project for you.
  2. SVN Commit – (shortcut svn ci) Publishing your changes to main repository, making it a part of main work. You can commit any number of files at one go. Use atomic commits as much as possible.
  3. SVN update – (shortcut svn up) Update your local copy with the latest changes done to the main repo. Remember you can also do a svn update -r445 (445 is revision number, a previous snapshot of the porject).
  4. SVN diff – compare two revisions (revision is like a snapshot of the project). See SVN cat also.
  5. SVN log – this shows the log message done with SVN commits. Always do verbose comments with svn commits for YOURSELF.
  6. SVN add – This is to schedule a file/directory to be added with next commit. It is not added to the center/main repo till next commit.
  7. SVN delete – schedule a file/directory to be deleted with next commit
  8. SVN copy – creates a duplicate copy but it maintains the copy history. The best part is certainly the copy history very useful for branching and tagging.
  9. SVN move – makes like simpler, it is just a rename (I used to do a cp, del, svn add for one command).
  10. SVN status – I think I use this command more often that any other command as it doesn’t do anything. It shows you all the changes that has happened to your local copy. (If you are interested in understanding, read how SVN maintains a local copy of checkout version to track changes). You will A (svn add),C (conflicts, you are changing something that is already changed),D (svn del),M (this is with svn up, new changes syncing to your local copy),X (I have never seen this),? (not a part of SVN),! (SVN expecting a missing file),~ (some changes to the attribute of the object), L (Locked files) and I (ignored) labels for changed files.
  11. Svn revert – undo the changes done to the local copy. Works well with svn diff.
  12. SVN cleanup – When subversion operation (SVN writes to a log file before doing the final task and removes when done) are interrupted (stays unfinished) in between, the log files are not deleted.
  13. SVN propedit – Ignoring some cache folders can be helpful svn propedit svn:ignore ‘cache/temp/’ –editor-cmd ’emacs’. This will open emacs editor, enter * (or any rule) to ignore all and save.
  14. SVN info – Print information about paths.
  15. SVN Merge – It applies the changes unlike svn diff. Useful when you do branching or tagging. I haven’t used it in a convincing way as branches for our projects did not make much sense to me.
  16. SVN Switch
  17. SVN ignore
  18. SVN mkdir
  19. SVN blame
  20. SVN propdel
  21. SVN propget
  22. SVN proplist
  23. SVN resolved

Some admin help

  1. SVNlook – (subcommands) author, cat, changed, date, diff, dirs-changes, history, info, log, propget, proplist, tree, uuid and youngest.
  2. SVNadmin – As an admin you can’t live with out. (subcommands) create (to create the project), deltify, dump (when you need to get a part of the repository as a new repository, very useful when some revisions are corrupted, you can use dunpfilter with it), hotcopy, list-dblogs, list-unused-dblogs, load, lstxns, recover (very helpful when nothing else works), rmtxns, setlog and verify (most of the troubleshooting starts and ends with verify).

Also read about fsfs (name of a Subversion filesystem implementation) and its advantage over Berkeley DB-based implementation. Then there can be a 2 GB Apache problem (which it is solved with the latest apache releases).

Hope this will give you an overall summary of SVN – subversion.

Common Blog API Access URLs

If you are a programmer and work on Blog products then this post makes a lot of sense to you. I am (and was) working on few products for blogosphere and our programmers had mixed API Type for different Blog system. I found a cool document by Google and here it is for all our new programmers. It will act as a referring doc for me. Taking an off tomorrow so expect a lot of changes and posts from me.

Blog System API

URL 
Blogsome

Blogger http://YOURBLOG.blogsome.com/xmlrpc.php
Conversant MovableType http://YOURBLOG/RPC2
Drupal 4.4 + MovableType http://YOURBLOG/PATH/TO/xmlrpc.php
GeekLog Blogger http://YOURBLOG/blog/

JRoller MetaWeblog

http://www.jroller.com/xmlrpc
Manila MetaWeblog http://YOURBLOG/RPC2
MovableType MovableType http://YOURBLOG/PATH/TO/mt-xmlrpc.cgi
Nucleus < 2.5 MetaWeblog http://YOURBLOG/PATH/TO/nucleus/xmlrpc/server.php
Nucleus 2.5 + MovableType http://YOURBLOG/PATH/TO/nucleus/xmlrpc/server.php
PLog MetaWeblog http://YOURBLOG/xmlrpc.php
pyblosxom MetaWeblog http://YOURBLOG/PATH/TO/cgi-bin/pyblosxom.cgi/RPC
pMachine Blogger http://YOURBLOG/pm/pmserver.php
Quick Blog

MetaWeblog http://YOURBLOG/MetaWeblog.aspx
Roller MetaWeblog http://YOURBLOG/xmlrpc or http://YOURSITE/root/xmlrpc
Serendipity MovableType http://YOURBLOG/serendipity/serendipity_xmlrpc.php
TextPattern MetaWeblog http://YOURBLOG/PATH/TO/textpattern/xmlrpcs.php
TypePad Blogger http://www.typepad.com/t/api/xmlrpc.php
Typo MetaWeblog or
MoveableType
http://YOURBLOG/backend/xmlrpc
WordPress MovableType

http://YOURBLOG/PATH/TO/xmlrpc.php

Xaraya MovableType http://YOURBLOG/PATH/TO/ws.php?type=xmlrpc

WordPress database error: [Got error 127 from storage Engine]

I know I am not blogging for last few days, got busy with new recruits. After almost an year (Last time it was when we hired (mass hiring) few programmers and Marketing guys from Army Institute of Management and other B-Schools) I revamped the whole Mentoring program. I am liking this mentorship program as it has the right blend of tech and non-tech dimension. I will post more about it.

Meanwhile I was getting this error
WordPress database error: [Got error 127 from storage Engine]
WordPress database error: [Got error 127 from storage Engine]
SELECT COUNT(comment_ID) FROM wp_comments Where Comment_approved =”Spam”

Error 127

Initially I thought it can be akismet related error and gave it a day. Today after a good 8 hour sleep I just did a search on Error 127 to find “Error 127 indicates a record has crashed”. When there is a crash you need to do a repair.

How to repair WordPress database error

If you are a non-techy guy (or want simpler solution),

  1. Go to cpanel
  2. MySql Databases
  3. See your database
  4. You will see a button named repair under your blog database, click on it and it will take care of it.

If you are a little techy,

  1. use the phpmyadmin
  2. In the main panel, you should see a list of your database tables. Check the boxes by the tables that need repair.
  3. At the bottom of the window just below the list of tables, there is a drop down menu. Choose “Repair Table”.

Even you can do it manually using REPAIR TABLE `wp_useronline` etc. I just did a search for similar error and saw many going through such phase. Hope this article helps.

Below the Top command

Top command is certainly there in every system admins’ frequently used command list. As the man page says, “The top program provides a dynamic real-time view of a running system. It can display system summary information as well as a list of tasks currently being managed by the Linux kernel.” This is simple but still not well explained. I was learning to install munin-node when I decided to read more about every display of top.


Top command
(This is screenshot of top run on idealwebtools.com, shared server)

Lets look at each section.

The first section – Uptime

top - 13:46:02 up 1 day, 14:27,
Starting for left, 13:46:02 is current time, which can get it like
aji@sawyer [~]# date
Wed May 23 13:46:23 EDT 2007

The next section says the server uptime, it is important that servers can run without any restart for many 100s of days. You can also check it with
aji@sawyer [~]# uptime
13:48:09 up 1 day, 14:29, 1 user, load average: 2.38, 1.63, 1.62

The second section – Active User

So we have 1 active user, nothing more to say here.

The Third section – Load Average

This is a very important piece of information.
load average: 2.38, 1.63, 1.62
As most explanations tell you, the three values represent processor load averaged over the last 1 minute, 5 minutes, and 15 minutes, respectively. It is average processes that are queued awaiting processor service at during the given time. Many feel that less that 1 queued awaiting processor service per processor is a good. Some feel it can handle 10 queued awaiting processor service per processor. I will still recommend it to be as low as 1 per processor. It is achievable for sure. This does not give you a complete picture as you need to poll it again and again to see the trend. You can use various application that are available which can run a cron job to poll it after a specific period. You can read more about load average at http://www.teamquest.com/resources/gunther/display/5/index.htm. If you have sometime you can even write a small script and run it every 5 minutes using a cron. This load average is very important data to have.

Fourth section – Tasks

Next section will show the task details
Tasks: 194 total, 2 running, 192 sleeping, 0 stopped, 0 zombie
If you have a lot of tasks in running state, do a good analysis to check it. Tasks shown as running should be more properly thought of as ‘ready to run’. If you want to read more about Zombie task, please visit http://www.ussg.iu.edu/hypermail/linux/kernel/0212.1/0864.html. Rest of it is quite obvious, also see the processes that are running and kill unwanted process.

Fifth section – CPUs

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
What does these things mean? Here is a small explanation for each section :-

  1. us -> User CPU time: The time the CPU has spent running users’ processes that are not niced.
  2. sy -> System CPU time: The time the CPU has spent running the kernel and its processes.
  3. ni -> Nice CPU time: The time the CPU has spent running users’ proccess that have been niced.
  4. wa -> iowait: Amount of time the CPU has been waiting for I/O to complete.
  5. hi -> Hardware IRQ: The amount of time the CPU has been servicing hardware interrupts.
  6. si -> Software Interrupts.: The amount of time the CPU has been servicing software interrupts.
  7. id is idle, in other words CPU idle status
  8. st is Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown

This shows a breakup of CPU usage, depending on your servers role, you need to optimize it. If you have a lot of disk writing keep a watch on iowait. If might be wondering what does “The time the CPU has spent running users’ processes that are not niced.” mean? If you do a “man nice”, it will say “nice – run a program with modified scheduling priority“. It is called “nice” because the number that is given to a process determines how willing a task is to step aside and let other tasks monopolize the processor. The number varies from -20 to 19. The default value is 0, higher values lower the priority and lower values increase it. If you want to read more about nice, visit http://wiki.linuxquestions.org/wiki/Nice.
When you do a top, it shows the NI value for different process
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
17578 root 15 0 13456 13M 9020 S 18.5 1.3 26:35 1 rhn-applet-gu
19154 root 20 0 1176 1176 892 R 0.9 0.1 0:00 1 top
1 root 15 0 168 160 108 S 0.0 0.0 0:09 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
6 root 35 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 15 0 0 0 0 SW 0.0 0.0 0:07 1 bdflush

Sixth section – Memory

Mem: 1536000k total, 1437272k used, 98728k free, 234212k buffers
Swap: 1020116k total, 72k used, 1020044k free, 567208k cached

This is very much self explanatory. Even you can free -m to get a different view

free -m
total used free shared buffers cached
Mem: 1500 1403 96 0 228 553
-/+ buffers/cache: 620 879
Swap: 996 0 996

This is RAM and SWAP. If we recall the memory classes we had during post graduation, there are different types of memory – Physical –

  • CPU Registers – this is the fastest, its like your hands used to do the tasks in the fastest way but very limited.
  • CPU Cache – This is like your office desk, quickly accessible location
  • RAM – Random Access Memory – Its like your office, you will have to walk around to get the work done.
  • Disk – This is a like a different location all together, so you will have to do a lot of traveling to get the work done. SWAP is basically a location of the disk used when RAM itself is not sufficient. The swap partitions are kept separate (not necessary, you can use a swap file instead) that OS can make the access as fast as possible.

If your server is using a lot of SWAP more often then you need to look into it as it will make your server go slow. We try not to use SWAP as much a possible. Swap cached means, written to swap, but still in memory. OS will anticipate memory needs, and pre-swap inactive data, but keep it in memory.

(SwapTotal – SwapFree – SwapCached) is Actual swapping (memory that will need to be read from disk)

Few more commands and reference for help

  1. Look at VMstat (do a man vmstat)
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
    -
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 72 291196 236744 561308 0 0 15 23 6 42 2 0 97 0 0
  2. Also you can try Sysstat Suite of Resource Monitoring Tools.

A very big post for the day, Enjoy bottoms up for top.

Google helps in Simple Hack to other’s server – Munin

Wanted to learn how much CPU or memory other servers utilize? For your own server you can always do some linux command to get it.

Use Top, uptime, free -m. But how to see other server details, use one command Google (this query will show you different people using munins for server monitoring).

Some of the examples :-

  1. http://munin.linefeed.org/
  2. http://monitoring.medias-cite.org/munin/index.html
  3. https://munin.sioban.net/

It is advisable to make munin directory password protected (as you do not want to shell out any information about your server).

SVN error due to htaccess

The interview spree continues but hoping to finalize one tech analyst today. He is able to solve all fizzbizz problems atleast. For last 2 days we were facing some problems with SVN. There is no problem with SVN update or SVN checkout but while committing files it shows an error.

sh-3.00$ svn commit test.php -m "Aji: adding a test file to check svn problem"
Adding test.php
svn: Commit failed (details follow):
svn: PROPFIND request failed on '/svn/blahblah/trunk/test.php'
svn: PROPFIND of '/svn/blahblah/trunk/test.php': 200 OK (http://www.blahblah.com)

We earlier presumed it to be a typical SVN permission problem but everything was ok. Also none of the SVN folders were corrupted. We checked the apache configuration which had

<location /blah-blah/svn>
DAV svn
SVNParentPath /blahblah/svn
AuthType Basic
AuthName grmtech.com
AuthUserFile /blahblah/etc/svn.basic.passwd
Require valid-user
AuthzSVNAccessFile /blahblah/etc/svn-access.conf
</location>

Also we realized that the problem started occurring after adding the htaccess to http://www.blahblah.com. Now as anyone will do we have kept the SVN repos in a different folder than the public_html (Document root of the website used for SVN).

After testing few things we came to a conclusion that it is happening because of the customized 404 catcher
ErrorDocument 404 /sys/common/tools/404handler.php
We were much worried as the document root was not supposed to be accessed during these calls, it should be taking it from the SVN folder. After reading some apache docs I understand the order of its processing:-

The <location> directive provides for access control by URL. It is similar to the <directory> directive, and starts a subsection which is terminated with a </directory></location> directive. <location> sections are processed in the order they appear in the configuration file, after the <directory> sections and .htaccess files are read, and after the <files> sections.</files></directory></location>

.htaccess is getting processed before the <location> directive. Now the issue was how to make the </location><location> directive work before htaccess. After some tryouts we were not able to do it. Then we took the help of Alias. We added

<virtualhost 1.1.1.1:80>
--
--
Alias /blahalah/svn "/complete-path/svn"
<directory "/full document root path for the website/">
--
--
</directory></virtualhost>

It is working fine. Hope it helps someone in similar situation. Joing back to work, planning to stay back at company guest house for server auditing.

Google toolbar/Algo Vs ICANN definition

Recently while discussing about canonicalization at WMW a member (bcrbcr) pointed out an issue with Google Toolbar conflicting with ICANN definition for domain names.

ICANN says,

EXAMpLE.com
EXAMPlE.com
EXAMPLe.com
etc.

In the languages that utilize Latin characters (e.g., English, Finnish, German, Italian, etc.), each letter has two variants: upper case and lower case. The Internet’s basic DNS and hostname specifications provide that the upper-case and lower-case variants of each letter are considered to be equivalent. Thus, all the variant domain names in the above list are treated as the same domain name.

Now does it apply for example.COM (COM is in upper case) too?
In my opinion yes, “as example is to .com, com is to dot (the fully qualified domain name is example.com. not example.com, the silent dot says a lot about Web architecture“. So the case of the TLD (Top Level Domain) should not matter.

Issue with Google toolbar
Google Page rank error
Google Page Rank for google.com

  • google.com – PR 10 (all lower case)
  • GOOGLE.com – PR 10 (The TLD is lower case, rest part is upper case. Even a mix of upper and lower case gives the same PR)
  • google.COM – No PR (When the TLD is in uppercase it doesn’t work)

Can this be a problem?
Yes, only from Google’s ranking. Google gives a lot of value to links and PR is just a small indication of its link juice. If Google is considering both these Domains different at PR level then there is a chance that it happens while counting the Link value as well. If some links are coming to example.COM then may be those links won’t be counted under lower case TLD domain (which is technically the same domain). I will need to some experiments to confirm it.

Other observations
This problem is not there for .co.uk or .co.in. This problem exists for .org. Other TLDs I will have to check.

As I mentioned earlier, we are not able to solve it using any canonicalization. Are we missing anything? Btw the WMW discussion is now labeled as Featured Home Page Discussion.

Linking to http://www.mattcutts.COM as a test case.

Canonicalization Series 2: Domain name to lower case

I will continue the series on canonicalization (I have a big post under draft, still adding points to it). I had started a thread at WMW about canonicalization where I saw a very interesting query today. It was about converting upper case domain name to lower case.

The query says,

The following is my current (simple) Mod rewrite, and I am still confused as to why the capitalisation in the domain doesn’t get forced to lower case.

I assumed that www.EXAMPLE.COM would be forced to www.example.com – doesn’t seem to work that way.

Is making upper case domain name to lower case part of canonicalization?
Certainly not. Let me allow the authority docs do the explanation,

example.com
Example.com
eXample.com
exaMple.com
examPle.com
exampLe.com
examplE.com
EXAMPLE.com
ExAMPLE.com
EXaMPLE.com
EXAmPLE.com
EXAMpLE.com
EXAMPlE.com
EXAMPLe.com
etc.

In the languages that utilize Latin characters (e.g., English, Finnish, German, Italian, etc.), each letter has two variants: upper case and lower case. The Internet’s basic DNS and hostname specifications provide that the upper-case and lower-case variants of each letter are considered to be equivalent. Thus, all the variant domain names in the above list are treated as the same domain name.

Since the lower case and upper case domain names are technically same we do not need to do canonicalization (also we are helpless, we are not able to do anything as the server variable is in lower case always). Canonicalization is needed for the URLs which can be technically different but are same for your domain. Example www.idealwebtools.com can be technically different from idealwebtools.com (without www) but currently represent the same document.