Search Engine Spiders .... can we shoot them???

pmhoran

New member
I do a basic "behind the scenes" maintenance on a forum owned by a friend of mine.

I have always been used to going onto the site & seeing 1, 2 or even 3 spiders at a time from various search engines.

Lately though ... Inktomi/Yahoo have had multiple dozens of spiders skittering all over the site just about any time of the day or night I have gone into the forum.

The fewest I have seen at one time is 27 (today) ... and have seen up to 80+ ... just from Inktomi/Yahoo.

In my mind, that is being just plain rude. One or even a few I can see. But GEEZ ... wouldn't 30 or 80 of the suckers there at one time be sucking up a bunch of unnecessary bandwidth???

I sent them an email about 10 days ago asking them to explain WHY they felt they needed so many spiders at one time on the site ... but so far have received no response.

I am soooooo tempted just to totally BAN access to any part of the site (not just the forum) by Inktomi/Yahoo. At least until they give me an explanation or something. But ... don't want the site to disappear from their search engine either.

So ... anyone have any thoughts, input, suggestions or solutions????

Many thanks
Peter
 
You could try banning those search engine spiders from the forums only, and then remove the banning for a few days each month - just to keep the rest of the site in Yahoo, and yet slow down the hammering on your server. It may even get the forum posts in Yahoo if the occasional spider gets through.

I don't know how good Yahoo is with explaining themselves. I don't know that they care that much. I know that DMOZ has forums where people can ask questions about that directory's policies. I don't know that Yahoo has any similar mechanism.
 
Peter, I don't think that those spiders cause that much bandwidth at all.
HostingDiscussion is getting lots of search spiders at all times, and yet the bandwidth is within reason. They do not consume much of your forum's resources, but they are good to have around, in my perspective.
 
Thanks for the responses ... much appreciated.

Lesli ... definitely a good suggestion. One I may try as a last resort :)

Artashes ... I know spiders in smaller numbers don't use up that much bandwidth ... but hard to believe that when they get up in the 60 to 80 range that they don't have a visible impact on it. Would LOVE to have so the spiders don't register or show as Guests. The forum software I use, there is an add-on mod to show or hide Googlebots but all the others will still show. Google isn't a problem though. Do you think that many spiders at one time would slow down the server or forum response time??? I get some of the users (dialup ... never high speed) complaining of the forums being slow ... and possibly coincidentally ... it seems to be on days Yahoo has the 60 to 80 spiders running around the site.

Blue ... thanks for reminding me about the robots.txt page. Guess that falls into the "Daaaaaaah" category for me. I had TOTALLY forgotten about being able to use that. And I know from experience that Yahoo at least does pay attention to them ... because I've used a robots page on other sites. I will definitely have to refresh my memory on everything I can do with a page like that. :)
 
UPDATE ... for those interested

The other day I checked the stats on these Inktomi Slurp Yahoo spiders on the site I've been talking about ...

I just about croaked when I saw what I saw ... For the month of November Inktomi Slurp had 116,409 spiders visit the site and total bankwidth used up was 1.01 GB

CAN YOU BELIEVE IT???? No wonder my friends web host (who gives him the space N/C because its a support forum for people with a chronic illness) sent him an email asking why the bandwidth usage has been dramatically increasing these past months.

When I went on the site today ... there was me, one Googlebot & 56 spiders from Inktomi Slurp Yahoo.

So ... I created (finally) the robots.txt file. But ... because I wasn't sure if the info I was using to identify the spiders was correct ... I also went in to the Cpanel & used their IP Ban option to ban 2 entire blocks of IP's that the spiders from them have been using. Then went into the forum software and did an IP Ban on those same entire blocks.

Within a few minutes of finishing doing all that ... there was just me & the Googlebot left on the site. So ... guess its working.

The owner of the site ... after seeing how much bandwidth the Inktomi Slurp Yahoo spiders used last month decided he doesn't want them on his site at all. Not even for just a few days each month.

The stupid thing about it ... we could go on to any other search engine for any of the other spiders that visit and punch in some keywords & his site would come up listed. No site that I could find that I figure would use the Inktomi Slurp Yahoo spiders for their searches even listed the site within the first 10 or so pages. Sheesh !!!

Anyway ... guess the problem is solved now.

Also ... kind of lets everyone know just how much bandwidth SOME of these spiders CAN use. Which is ... obviously ... significant in our case.

Thanks again for the suggestions offered.
Peter

PS ... I never did receive a response from Inktomi Slurp Yahoo for the email I sent them.
 
Well spiders are a good thing in my opinion, the more the better! They dont use much bandwidth because they dont view your website as a human would. Your website does not have to load the images for the spiders to see, they only read your source code to get the information. So they probobaly only use about 5% of what a normal page view would be.
Basicly, I wouldnt worry about it. As I said above, the more the better, this just means that more people are finding your website!
 
Most of these spiders are worthless and they do use bandwidth and system resources.
They are not giving you any more "people finding your website" they are just repetitively accessing your pages.
If you have a custom 404 page for instance that is fairly large this can use up a ton of resourses.
 
pmhoran said:
UPDATE ... for those interested
I just about croaked when I saw what I saw ... For the month of November Inktomi Slurp had 116,409 spiders visit the site and total bandwidth used up was 1.01 GB

EQWebHost .... maybe you misunderstood ... but this 1.01 GB was bandwidth used JUST by the Inktomi Slurp spiders. In one month.

Or maybe my idea of "significant bandwidth usage" is different than yours.

The next closest bandwidth usage for the month by a spider/robot was for Googlebot ... and it used about 65 mb for the month. THAT I can deal with.

But 1.01 gigabytes??? That is just plain ridiculous.

Peter
 
Another quick UPDATE ...

Just checked the daily stats for the site in question. Since I banned the IP's for the Inktomi Slurp spiders ... the daily bandwidth usage for the site has been cut in half. Literally.

So I guess the problem has definitely been resolved.

Peter
 
the daily bandwidth usage for the site has been cut in half. Literally.
Glad to hear you're satisfied with the results.

1Gb of monthly usage was a bit excessive, but things should always be looked at from all perspectives. For my sites, Google sends 99% of search engine traffic, and it's quite a decent amount. As such, I could cut off the other robots without much loss. However, should Google could use 3Gb of data transfer while visiting my sites, I would still let it to its job, since the traffic it sends is so much more valuable than the data transfer expense. :)

From another POV, hosting packages tend to be quite generous. I've never been even remotely close to using mine fully. Is 1Gb really that much for your friend?
 
MOST of the sites traffic from search engines comes from Google too. But the Googlebots have only used about 65mb of bandwidth per month to "do their thing". Granted ... if my problems had been with Google & not Inktomi Slurp ... I would likely have had to think on my decision to ban them a lot harder.

By comparision ... Inktomi Slurp sends very little traffic our way. Uses that huge amount of bandwidth ... and Ask Jeeves (whose bots use less than 1mb a month) send us more traffic than Yahoo does ... or Inktomi Slurp ... combined. Crazy.

My friends web space is "donated" to him ... because it a support site for people disabled with the same illness as I have. It only has 1.5 gig of bandwidth allotted per month. On the last day of November, the site went offline because the bandwidth allottment was exceeded. And my friend really didn't want to approach his web host about increasing bandwidth for the site ... just in case he decided it was using too much. With Inktomi Slurp robots/spiders being banned now ... the bandwidth usage for the entire site should only be 500 to 600 mb per month.

I have offered to host the site on my own web space ... my resellers account still has a fair bit of space available to host it. But ... he is comfortable with where it is & doesn't want to change unless he absolutely has to. If it had been on my space ... I would not have thought twice of just increasing his bandwidth & forgetting about it.
 
You can also manipulate things in your favor.
You can write your robot.txt files to specify only certain folders that are inaccessible.
This will cut down on a lot of bandwidth while still allowing your site to be spidered.
 
Yeah ... sort of tried to do that. But the owner ... when he found out how much bandwidth Inktomi Slurp robots were actually using, decided he didn't want their spiders on the site at all.

Thats what led to the complete IP bans.

Can't say as I blame him.

I don't know why those robots went nuts on his site. I checked a number of my sites & the sites I host ... and none seem to have this "out of control" spidering going on by Inktomi Slurp. The bandwidth used is right in line with the other major robots/spiders.

Mind you ... IF Inktomi Slurp Yahoo had responded to my initial email about the problem ... I doubt if the owner would have felt this total IP ban would have been necessary. So ... its their own fault if they don't like our solution :)
 
I've seen spiders user up gigs of bandwidth, one thing you can do is add iplimitconn which cuts out the amount of downloads allow per IP address. Running a video site, I found there were about 15 large video downloads at the same time from the same IP. This migitates that to your configuration ( it put 3 limit).

There's also mod_evasive which can block a repeated amount of connections from an IP that may otherwise bring down your whole server.
 
Top