After reading a couple of articles on the internet about Talk-Talk's spider following their customers around the net, I decided to try an experiment to see if for myself.
First of though, the background. Back in 2010 Talk-Talk introduced an anti-malware system within their network in conjunction with Huawei. The basis of this was that the platform would block any potential dodgy sites to any of their subscribers if malware was detected on the site. In theory this looks like a great idea, but in reality lets look into how it gets a database of sites.
I create a brand new site; http://testsite.richardallen.co.uk at 12:38 on 9th July 2013. I then posted this to facebook and asked a friend who has talk-talk to click on the link, all the time I was tailing the access log.
The frist hit was as expected, Facebooks user-agent for their spider for any links posted to Facebook;
188.8.131.52 - - [09/Jul/2013:12:38:20 -0400] "GET / HTTP/1.1" 206 5650 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
The second hit was my friend, a Talk-Talk customer;
184.108.40.206 - - [09/Jul/2013:14:27:31 -0400] "GET /assets/js/jquery.js HTTP/1.1" 200 247823 "http://testsite.richardallen.co.uk/" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B350 [FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone5,2;FBMD/iPhone;FBSN/iPhone OS;FBSV/6.1.4;FBSS/2; FBCR/EE;FBID/phone;FBLC/en_US;FBOP/1]"
|Location||GB, United Kingdom|
|City||Plymouth, K4 -|
|AS Number||AS13285 TalkTalk Communications Limited|
|Distance||2251.93 km (1399.28 miles)|
|IP Address 220.127.116.11|
Within 30 seconds of them accessing my site - and only them as no-one else knew this site existed, and no one else had attempted to access it, I had two more hits;
18.104.22.168 - - [09/Jul/2013:14:28:01 -0400] "GET /robots.txt HTTP/1.0" 500 688 "http://testsite.richardallen.co.uk/robots.txt" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)"
22.214.171.124 - - [09/Jul/2013:14:28:03 -0400] "GET /robots.txt HTTP/1.0" 500 688 "http://testsite.richardallen.co.uk/robots.txt" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)"
I won't paste all of my log, but both of these IPs requested GETs for every page under this subdomain and both starting with /robots.txt.
To start, we can tell these are bots as their first attempt is to GET the robots.txt file. If we then look up those IP addresses; 126.96.36.199 & 188.8.131.52, we can see that they also fall into the Talk-Talk IP range - a quick google of these also throws up loads of discussions around this.
I then went back to my friend to ask if they had Talk-Talk's homesafe or Malware Protection turned on, to which they confirmed no to both of.
With this in mind, I started to look around on the web and discovered that Talk-Talk where actually referred to the ICO back in 2010 for their malware trial. Within that document is a brief swimlane of how their Anti-Malware system works;