Increasing Site Performance By Slowing Down Search BotsFor the more advanced user, this post may or may not raise some degree of awareness where website speed and security are concerned. Some do not even understand exactly how their content is found by search engines and then cached (also called indexing). While the algorithms of what information is important, duplicate, or other specific credibility factors is extremely more complex, the basic act of a search or “crawling” for the information is quite simple.

I have owned a web hosting company now for about one year. To this day, I learn each and every day a bit or piece here and there that helps me keep the servers clean, secure, and at peak performance. I am not a reseller, nor do I lease my servers. Everyone was built with the same two hands typing this post. Yes, I am quite proud of them.

I was experiencing extreme bog-downs at various times during the day, and sometimes the servers would actually lock-up. Having done extensive research on the correct and highest performance combination of CPUs, Motherboards, Memory (RAM) and other components…even down to the data transfer rate of each sata cable I planned to use, I was stunned by this…actually puzzled is more fitting. Yes, I knew every server that I built was higher performing than many of the big hosting companies. So why was this happening?

75% of my server knowledge is directly from my favorite server forum, How To Forge, and the two incredible super moderators over there – Falko Timme and Till. They have gotten me out of plenty of hot water, saved me countless hours of research, and more. “Thank you” is an understatement. (Even though I can be fussy – Linux will make you crazy!)

Anyhow, back to the server bog-down. I’ll try to keep this really simple, because the bottom line for the end-user here is quite simple. Search engines like Google, Bing, Baidu, and Yandex access the information on a server with a “GET” request. Just how it sounds really…they’re coming to get something from your site and your server…information.

Since I am in the blogging and content rich industry, my natural circle of users — my target market for hosting — are bloggers. Blogs provide most of the information on the web. A brick and mortar or say a bank’s website doesn’t generally post new content very frequently. Bloggers may post once or twice a day. A multi-author blog may post even more.

As most know, WordPress in probably the most widely utilized content management system for bloggers. Built into basic WordPress (WordPress Core) is the ping process….basically a way of telling Google, Bing, and Yahoo!, “Hey, stop by, I just published something new!” Some folks will even invite more bots by adding several more to even several dozen additional ping services. Here comes my dilemma.  I have a wordpress site that posts about 30 new pieces of content each day. Ping, ping! My access logs were overwhelmed, my server was locking up, php was devouring my memory. I adjusted this and that until it hit me….the bots are jamming the data transfer. Gotta stop.

Good Bots and Bad Bots

We obviously don’t want to stop the bots, just slow them down. In good conscience, I could not do this server-wide. This wouldn’t be fair to the others using the server. The robots.txt file is probably by default the simplest to implement. This can be used to block particular folders or pages from indexing using the “Disallow” command. However, while most search engines will respect your command of setting a time delay between crawls, the almighty Google simply disregards this. In order to set your crawl time with Google, you must have a Google account and verify your site in Google’s Webmaster Tools, where you can then set the pace. However, you’ll be making a return visit in 90 days, as the time delay expires.

Now using the .htaccess file is far more powerful. Here you can stop bad bots (email scanning bots, spam bots, virus bots, and other bad news search robots). I implemented both and the servers are running like lightening. Here is the source code to add to your .htaccess file to keep out the “bad news”. Tried and trusted, compliments of .htaccess Guide.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot mailto:[email protected] [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC Web Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web Image Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* – [F,L]

Best of luck!