I’d been noticing a lot of hits in my website logs from Sosospider. Lots of traffic. I’ve let this one spider my pages for quite some time but I took a look at the search engine to see what my webpages actually looked like in the search results and I saw that the pages returned were mostly commercial webpages. Shopping pages. I’d had no visitors from soso.com anyway so I decided to flag it in robots.txt. I watched the results of this action for a day or so and saw that Sosospider, while it seemed to be taking a look at my robots.txt file, actually ignored it and went ahead and spidered my pages anyway. Huh? I had taken a look at soso.com’s own page about how to block it in robots.txt and entered into my robots.txt file exactly what they said to use. It didn’t work. Spidering continued unabated. Since they lied about honoring robots.txt I decided to block them with .htaccess instead. I don’t like doing this because I think it adds to the server load but my commands must be obeyed so you gotta do what you gotta do.

They use a lot of IP addresses in the 124.115.*.* range so I went ahead and added 124.115 to my htaccess file’s deny list. After watching how this worked for a while I saw that the Sosospider was trying to be sneaky and changing some things to try to bust thru and even a few times used a completely different IP address range, which did get thru but I saw what was going on and added that range to deny as well. That one was 114.80. Also 61.135 range was used, which identified itself as being TencentTraveler, which I believe is the parent of Sosospider. It’s like playing whack-a-mole. Conk one on the head and more show up. I don’t like playing this game but I’ve got plenty of time and I don’t like to lose.

So, the point of this story is that Soso search engine’s spider does not play nice and I saw nothing wrong on my end to enable this bad behavior. The top two lines of my robots.txt file are:

User-agent: Sosospider
Disallow: /

and I don’t know how I can be any more correct than that and that’s what doesn’t work.

.htaccess is a file used by Apache web server. I don’t know how Microsoft servers do it but my webpages are on a fatcow.com server which uses Apache. I made a file called .htaccess and put it in the root directory of my webpages on the fatcow server. Fatcow allows this. Other providers may not allow you to do this. You can put directives into this .htaccess file but I only pay attention to the part which allows you to block by IP address. Here’s what mine looks like at the present time:
# Begin IP blocking #
Order Allow,Deny
Deny from 75.125.229.136/29 #ratepoint
Deny from 208.80.193 #websense
Deny from 142.166.170 #radian6
Deny from 142.166.3 #radian6
Deny from 38.100 #CYVEILLANCE
Deny from 38.105.83.11 #CYVEILLANCE
Deny from 64.21.98.192/27 #relevantnoise
Deny from 64.247.18.155 #relevantnoise
Deny from 64.94.67.203 #moreover
Deny from 124.115 #soso
Deny from 114.80 #soso
#Deny from 77.88 #yandex
Deny from 58.61 #soso
Deny from 61.135 #TencentTraveler
Allow from all
# End IP blocking #
and I show this only as an example. I make changes to this file all the time depending on what is annoying me. I especially don’t like robots which don’t obey the robots.txt rules.

Your own provider may have a web interface for you to edit this file. Mine does but I just edit it on my own computer and then upload it thru FTP. The pound signs(#) in the file indicate comments and are a reminder for me. Anything after a # is not processed. To know more about .htaccess a google search provides lots of information and howtos about how to block unwanted access and do other things with this file.

Terry Coats


Category: musings

About the Author

Vietnam veteran. Single. Amateur musician. Liberal. Straight. Shocked at U.S. right-wing behavior.

12 Responses to Sosospider doesn’t behave

  1. Wil Warren says:

    Thanks for sharing. I face a similar barrage of scans from 124.115.x.x and 114.80.x.x and have monitored their exploits. One of my sites got hammered every few seconds by hundreds of changing IP sub-addresses all in those two blocks but all from SosoSpider. When I got a notice from my hosting provider that my account CPU time was being throttled because of these accesses finally had to block the whole two ranges and of course writing to the ISP achieves nothing at all.

    My only two lines of code are …

    # Block only the most aggressive Chinese Spammers
    deny from 114.0.0.0/8
    deny from 124.0.0.0/8

    … which seems to take care of the worst. They still keep trying every day though but now instead get custom modified 403 Error Pages, which contain my own advertising – hope they enjoy it.

    The rest of the robots seem to be behaving themselves scanning maybe once a day which I think is civil no matter where they come from.

  2. Wiliam says:

    Copied to my .htaccess file, i am getting hit every day

    I also went to their website and their disallow agent, but if they followed their own rules, why so many different ip address ?

  3. Terry says:

    I don’t understand it either. Either they are being dishonest or they do not know what they are doing.
    Terry

  4. i-eBooks says:

    The sosospiders are having a field day on http://www.i-eBooks.com, ever since they arrived in the last couple of weeks, googlebot doesn’t seem to crawl. Could this affect the website from getting indexed by Google? Use to be baudispiders, now its the sosospiders-lots of them…

  5. Terry says:

    @i-eBooks
    Googlebot keeps crawling me just the same.

  6. Thanks for the htaccess code. This spider has been hitting my site pretty hard for a couple hours a day for the past week or so, and it only hits the homepage too, none of the deep links, which seems odd. But whatever, it won’t be hitting much of anything now.

    Again, thank you!

  7. i-eBooks says:

    Use to be Googlebot, now sosospider (over 300 crawls this week) very few googlebots

  8. Terry says:

    @i-eBooks
    I don’t mind googlebot but some of the others are too aggressive and don’t contribute any
    viewers for me. There’s no reason for Chinese search engines to be bothering my webpages
    anyway.

  9. Louise says:

    The only fix is to block _all of_ chinanet via htaccess. I’ve yet to find a master list, so I have to get them by trial and error. I investigate anyone that lands on my front page without a referrer. So far (this is a direct cut & paste):

    Deny from 58.60.0.0/14
    Deny from 114.80.0.0/12
    Deny from 119.147.6.0/24
    Deny from 121.32.0.0/13
    Deny from 123.125.71.0/24
    Deny from 124.114.0.0/15
    Deny from 222.216.0.0/14

  10. Terry says:

    @Louise
    So far my list works pretty well for me. I get some from Baidu which I think are China but it’s not crazy. It’s the ones that malfunction or get my pages several times a day that drive me crazy.
    Thanks for your comment.
    Terry

  11. Steve says:

    Baidu and Sosospider have been hammering my site (http://www.listburn.com). I tried the robots approach but saw it wasn’t being honored so now I will try the htaccess approach! Will this actually ban their access or will it just deliver a 403 error? I don’t even want them showing up in my logs!

  12. Terry says:

    I don’t even see them in the log when they’re listed in htaccess.
    Well, I don’t think they do. Might be wrong. I filter 403s when
    I take a look at the log so I don’t see them even if they are there.
    I forgot about that.

Leave a Reply

Your email address will not be published.


× nine = 9

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

email

RSS Coats Sez @ terrycoats.net

  • Columbine or henbit? May 12, 2012
    I sowed two packets of seeds. One was supposed to be mixed colors of columbine and the other packet was all blue columbine. I just sowed them and then kinda …Continue Reading » […]
    Terry
  • Game trail cam sees skunk May 12, 2012
    My game trail cam from night before last showed a skunk in a picture. I didn’t know skunks were coming around along with the raccoons and gray foxes. I don’t …Continue Reading » […]
    Terry
  • Tamron Hall strikes back May 12, 2012
    Tamron Hall verbally kicked Gop Jim Carney’s ass on her news show yesterday. He wasn’t talking about the questions she was asking but instead was spinning, like all Gops do …Continue Reading » […]
    Terry
  • Newt Gingrich finally going to quit May 2, 2012
    It appears that Newt Gingrich is going to be finally leaving the Republican race for U.S. presidential candidate. It’s about time. He never did have a chance. The idea that …Continue Reading » […]
    Terry
  • Rsync is backing up my $HOME directory April 19, 2012
    I’ve been having trouble getting Ubuntu/Linux backup software to work right on my system so I did some thinking and searching to try to figure it out. I have an …Continue Reading » […]
    Terry

Archives