Communication Breakdown - Need To Keep Your Blogs or Webpages From The Search Engines?
Several weeks ago, while I was toying around using the WordPress blogging platform for a test blog, I set the software up live for evaluation. Within an 1 hour, my wordpress directory had had 4 search engines crawl and index it. I had no intention of making that directory public, at all. So how did the engines crawl it all, and so fast at that?
The answer is twofold. Firstly, WordPress itself automatically sends out a ping to various search engines and blog directories whenever you publish a new blog entry. I believe that you can turn this off, but I haven't yet explored all the nuances of WordPress.
Many search engines also love platforms such as WordPress, and come a-crawling faster if your URL includes "/wordpress" in it. (That's apparently also true for URLs with "/blog" in them.)
So how do you stop search engines from crawling a directory you don't really want public? That brings us to the second part of our answer. The quick solution is to remove any "robots.txt" file that you have at the top of your web server.
If you need to keep the robots.txt file intact because you have other directories that are live, the tweak the robots.txt file to "disallow" certain directories. The big engines will respect any instructions to "disallow" certain directories.
If you are just testing WordPress (or any blogging platform), it's best to test the software locally on your computer before going live with it.
Unfortunately, spammers often use the robots.txt to see what directories you are trying to "hide". If someone wants to index your site and knows where your directories are, they can still index your private directories. A robots.txt file basically reveals this info. Spammers often send their own spiders out across the net to scrape for content or parse for email addresses for spamming.
The only way to stop such people is to implement a site-wide script (PHP, Perl, or whatever) that blocks anyone from a list of IP addresses. This requires some technical knowledge about webmastering, or hiring someone to do it for you.
I don't want to get into a lengthy discussion about the details of the robots.txt file or IP-blocking just yet, but if you need to know, please feel free to contact me at rdash001-at-yahoo-dot-ca (email address mangled to fool spambots). I will eventually post a "resource" page for both robots.txt and IP-blocking details.
Links: Wordpress.
(c) Copyright: 2005-present, Raj Kumar Dash, http://blogspinner.countwordula.com/
The answer is twofold. Firstly, WordPress itself automatically sends out a ping to various search engines and blog directories whenever you publish a new blog entry. I believe that you can turn this off, but I haven't yet explored all the nuances of WordPress.
Many search engines also love platforms such as WordPress, and come a-crawling faster if your URL includes "/wordpress" in it. (That's apparently also true for URLs with "/blog" in them.)
So how do you stop search engines from crawling a directory you don't really want public? That brings us to the second part of our answer. The quick solution is to remove any "robots.txt" file that you have at the top of your web server.
If you need to keep the robots.txt file intact because you have other directories that are live, the tweak the robots.txt file to "disallow" certain directories. The big engines will respect any instructions to "disallow" certain directories.
If you are just testing WordPress (or any blogging platform), it's best to test the software locally on your computer before going live with it.
Unfortunately, spammers often use the robots.txt to see what directories you are trying to "hide". If someone wants to index your site and knows where your directories are, they can still index your private directories. A robots.txt file basically reveals this info. Spammers often send their own spiders out across the net to scrape for content or parse for email addresses for spamming.
The only way to stop such people is to implement a site-wide script (PHP, Perl, or whatever) that blocks anyone from a list of IP addresses. This requires some technical knowledge about webmastering, or hiring someone to do it for you.
I don't want to get into a lengthy discussion about the details of the robots.txt file or IP-blocking just yet, but if you need to know, please feel free to contact me at rdash001-at-yahoo-dot-ca (email address mangled to fool spambots). I will eventually post a "resource" page for both robots.txt and IP-blocking details.
Links: Wordpress.
(c) Copyright: 2005-present, Raj Kumar Dash, http://blogspinner.countwordula.com/







