$_$_BEGIN_HTML
$_$_END_HTML
$_$_TITLE Search engine robots
$_$_DESCRIPTION This page lists the search engine robots known to JafSoft Limited
$_$_KEYWORDS scooter, gulliver, slurp, googlebot, netmind, alexa, ia_archiver, architectspider,
$_$_KEYWORDS ultraseek, lycos_spider, diibot, nttdirectory_robot, Linkwalker,
$_$_KEYWORDS linkalarm, linklint, linkscan, linkchecker, linkverify, linkbot,
$_$_KEYWORDS xenu's link sleuth, go!zilla, getright, getsmart, download wonder,
$_$_KEYWORDS netzip download, ecatch, MSIECrawler, MSProxy, CNET_Snoop, search engine robots
$_$_TABLE_HEADER_ROWS 1
$_$_TABLE_MIN_COLUMN_SEPARATION 2
$_$_CHANGE_POLICY column merging factor : 0
$_$_CHANGE_POLICY default table width : 75%
$_$_CHANGE_POLICY search for definitions : no
$_$_RESET_HTML_FRAGMENT HTML_HEADER
$_$_BEGIN_HTML
|
|
|
Convert your text files into web pages (like this one was) |
|
|
Are you using your clipboard to it's fullest potential? |
|
|
|
|
Search engine robots that visit your web site
$_$_END_HTML
*Contents of this page*
$_$_CONTENTS_LIST
Search engines and other sites send robots to read and index your pages. This page reverses that
process and indexes the robots. This information has been gleaned by looking at the server logs
for www.jafsoft.com. You can read a detailed description of [[HYPERLINK URL,http://www.jafsoft.com/searchengines/spider_hunting.html,"how we hunt spiders"]]
Whenever a page is read from a web site, the log file records a number of
details including the time, the IP address and usually the referrer page and the
user agent. You can see this in our [[HYPERLINK URL,http://www.jafsoft.com/searchengines/log_sample.html,"analysis of a server log sample"]].
Unlike many pages that list web robots, this page actually tries to
go visit the robots themselves. Where possible links are provided to the
robots home pages, and descriptions are given of what they're up to. This page is
updated regularly as more information is found (the last update was on *[[TIMESTAMP]]*).
Well behaved robots will identify themselves, often supplying web or email
addresses you can contact. In any case, the pattern of pages being read and the
IP addresses being used soon sorts the men from the robots.
Good robots will read robots.txt to see what your site policy is, but there are
other ways of spotting robots. In addition to the search engine robots, other
"user agents" will visit your site, e.g. to validate links to your site from
other people's pages. Often these will just access the HEAD of the file, rather
than doing a GET on the whole file.
You can also visit our page [search_engines].
_*This page is regularly converted from this [[SOURCE_FILE "text file"]] by the author's
own text to HTML converter [AscToHTM_abs]. The last update was on [[TIMESTAMP]]. This
software is available as shareware (cost $30)*_
Search engine robots and others
===============================
The following table lists the search engines that spider the web, the IP
addresses that they use, and the robot names they send out to visit your site.
Version numbers are usually included in the robot names, but are omitted here
except where it implies a visit from a different IP address or (as in inktomi)
a different search engine.
Often multiple IP addresses are used, in which case we just give a flavour of the
names or numbers. Inktomi is a company that offers search engine technology and
is used by a number of sites (e.g. www.snap.com and www.hotbot.com)
Wherever appears this indicates a number of different digits may be used.
$_$_BEGIN_TABLE
$_$_TABLE_MAY_BE_SPARSE
$_$_TABLE_ALIGN CENTER
$_$_TABLE_LAYOUT 3,33,73,132
Home page/search engine | Robot identifier | IP address(es)
----------------------------------------------------------------------------------------
www.abacho.com | AbachoBOT | srv-ze-robot1.tricus.com
| |
www.abcdatos.com | abcdatos_botlink | 217.126.39.167
| http://www.abcdatos.com/botlink/ |
| |
www.aesop.com | AESOP_com_SpiderMan | 209.189.115.49
| |
www.ah-ha.com | ah-ha.com crawler (crawler@ah-ha.com) | c7pub-216-250-141-186.center7.com
| |
www.alexa.com | ia_archiver | green.alexa.com
| | sarah.alexa.com
| |
www.altavista.com | Scooter | test-scooter.pa.alta-vista.net
| | brillo.pa.alta-vista.net
| | av-dev4.pa.alta-vista.net
| | scooter.aveurope.co.uk
| | bigip1-snat.sv.av.com
| Mercator | mercator.pa-x.dec.com
| | scooter.pa.alta-vista.net
| | election2000crawl-complaints-to-admin.webresearch.pa-x.dec.com
| Scooter2_Mercator_3-1.0 | scooter.sv.av.com
| roach.smo.av.com-1.0 | avfwclient.sv.av.com
| Tv_Merc_resh_26_1_D-1.0 | tv.sv.av.com
| |
www.altavista.co.uk | AltaVista-Intranet | host-119.altavista.se
| jan.gelin@av.com |
| |
www.alltheweb.com | FAST-WebCrawler | 209.67.247.154
| crawler@fast.no |
| www.fast.no/faq/faqfastwebsearch/faqfastwebcrawler.html
| |
| Wget | ext-gw.trd.fast.no
| |
www.acoon.de | Acoon Robot | 194.231.42.178
| |
www.antisearch.net | antibot | 62.210.155.
| |
www.atomz.com | Atomz | router-sc.atomz.com
| | index.atomz.com
| |
www.axmo.com | AxmoRobot | 194.248.208.82
| |
www.buscaplus.com | Buscaplus Robi |
| http://www.buscaplus.com/robi/ |
| |
www.canseek.ca | CanSeek/ | 216.168.111.111
| support@canseek.ca |
| |
www.christcrawler.com/search.cfm| ChristCRAWLER | 207.191.111.231
| http://www.christcrawler.com/ |
| |
www.clush.com | Clushbot | 209.249.80.242
| http://www.clush.com/bot.html |
| |
www.crawler.de | Crawler | crawlit.crawler.de
| admin@crawler.de |
| |
www.daadle.com | DaAdLe.com ROBOT/ | 216.12.213.32
| |
www.daum.net | RaBot | 210.183.28.46
| Agent-admin/ phortse@hanmail.net |
| contact/jylee@kies.co.kr | 211.50.57.6
| |
| RaBot | 202.30.94.34
| Agent-admin/ webmaster@kisco.go.kr |
| |
www.en.deepindex.com | DeepIndex | deepindex.net1.nerim.net
| |
www.ditto.com | DittoSpyder | 65.169.94.188
| |
domanova.co.uk | Jack |
| |
www.earthcom.info | EARTHCOM.info | 194.108.39.74
| |
www.entireweb.com | Speedy Spider | 62.13.25.209
| |
www.excite.com | ArchitextSpider | Musical instrumentss are used
| | in the name such as viola.excite.com
| | cello.excite.com
| | piano.excite.com
| | kazoo.excite.com
| | ride.excite.com
| | sabian.excite.com
| | sax.excite.com
| | bugle.excite.com
| | snare.excite.com
| | ziljian.excite.com
| | bongos.excite.com
| | maturana.excite.com
| | mandolin.excite.com
| | piccolo.excite.com
| | kettle.excite.com
| | ichiban.excite.com
| | (and the rest of the band)
| | more recently first names are being
| | used like philip.excite.com
| | peter.excite.con
| | perdita.excite.com
| | macduff.excite.com
| | agouti.excite.com
| |
| |
| |
(excite) | ArchitectSpider | crimpshrine.atext.com
| | ichiban.atext.com
| |
www.eurip.com | EuripBot | 81.169.172.30
| |
www.euroseek.net | Arachnoidea | 212.209.54.134
| arachnoidea@euroseek.net |
| |
www.ezresults.com | EZResult | 216.28.23.59
| |
www.fastsearch.net | Fast PartnerSite Crawler | psprdcrw001.sac2.fastsearch.net
| FAST Data Search Crawler | 65.198.110.185
| FAST Data Search Document Retriever | 69.38.159.128
| |
www.fireball.de | KIT-Fireball | ????
| |
http://france.misesajour.com/ | france.misesajour.com | 66.98.210.71
| |
www.fybersearch.com | FyberSearch | 69.49.241.9
| |
www.galaxy.com | GalaxyBot | 63.121.41.175
| http://www.galaxy.com/galaxybot.html |
| |
www.geckobot.com | geckobot | ???.rdc1.az.coxatwork.com
| |
www.gendoor.com | GenCrawler | ????
(Genealogical Search Engine) | |
| |
www.geona.com | GeonaBot | 69.59.142.17
| |
www.getrax.com | getRAX | 81.169.156.246
| |
www.google.com | Googlebot | c.googlebot.com
| googlebot@googlebot.com |
| http://googlebot.com/ |
| |
www.goo.ne.jp | moget/2.0 | 202.229.31.13
| moget@goo.ne.jp |
| |
www.girafa.com | Aranha | Aranha.girafa.com
| |
(inktomi) | Slurp.so/1.0 | q2004.inktomisearch.com
| slurp@inktomi.com | j5006.inktomisearch.com
| |
(inktomi) | Slurp/2.0j | 202.212.5.34
| slurp@inktomi.com | goo313.goo.ne.jp
| www.inktomisearch.com |
| |
(inktomi) | Slurp/2.0-KiteHourly | y400.inktomi.com
| slurp@inktomi.com; |
| www.inktomi.com/slurp.html |
| |
(inktomi) | Slurp/2.0-OwlWeekly | 209.185.143.198
| spider@aeneid.com |
| www.inktomi.com/slurp.html |
| |
(inktomi) | Slurp/3.0-AU | j6000.inktomi.com
| slurp@inktomi.com |
| |
http://hoppa.com/ | Toutatis 2.5-2 | tisnix.xs4all.nl
(need V5 browsers to view) | |
| |
www.hubat.com | Hubater | 209.114.176.250
| |
www.almaden.ibm.com | http://www.almaden.ibm.com/cs/crawler | wfp2.almaden.ibm.com
(research centre) | |
| |
www.iltrovatore.it | IlTrovatore-Setaccio | 213.26.21.8
| |
www.incywincy.com | IncyWincy | 64.81.243.66
| |
www.infoseek.com | UltraSeek | cde2c923.infoseek.com
| | cde2c91f.infoseek.com
| InfoSeek Sidewinder | cca26215.infoseek.com
| |
www.intags.de | Mole2/1.0 | 217.160.75.10
| webmaster@intags.de |
| |
http://mp3bot.de/ | MP3Bot | <..>
| |
www.ip3000.com | C-PBWF-ip3000.com-crawler | www.ip3000.com
| ip3000.com-crawler |
| |
www.ipselon.com | Ipselonbot | 80.36.101.108
| |
www.istarthere.com | http://www.istarthere.com | 66.220.24.80
| spider@istarthere.com |
| |
www.knowledge.com | Knowledge.com/ | 213.170.2.69
| |
www.kuloko.com | kuloko-bot/0.2 | 66.90.81.41
| |
www.lexis-nexis.com | LNSpiderguy | firewall5.lexis-nexis.com
| |
www.lapozz.com | LapozzBot/ | 82.131.195.52
| |
www.linknz.co.nz | Linknzbot | 202.191.32.67
| |
www.look.com | lookbot | magma.com
| |
www.looksmart.com | MantraAgent | fjupiter.looksmart.com
| |
www.loopimprovements.com | NetResearchServer | leg-64-133-109-250-STK.sprinthome.com
(see also www.incywincy.com) | www.loopimprovements.com/robot.html |
| |
www.lycos.com | Lycos_Spider_(T-Rex) | bos-spider.bos.lycos.com
| | 216.35.194.188
| |
www.joocer.com | JoocerBot | 80.46.38.169
| |
www.mirago.co.uk | HenryTheMiragoRobot | 194.202.39.46
| |
www.mojeek.com | MojeekBot | ???
| |
www.mozdex.com | mozDex/ | (within comcast.net)
| |
http://search.msn.com/ | MSNBOT/0.1 | 131.107.163.47
| http://search.msn.com/msnbot.htm) |
| |
www.navadoo.com | Navadoo Crawler | ???
| |
www.northernlight.com | Gulliver | marvin.northernlight.com
| | taz.northernlight.com
| |
www.objectssearch.com | ObjectsSearch/0.01 | 68.88.244.177
| |
http://szukaj.onet.pl/ | OnetSzukaj/ | ???
| |
www.picosearch.com | PicoSearch/ | pipe.picosearch.com
| |
www.portaljuice.com | PJspider | timber.nextopia.com
| |
www.powerinter.net | DIIbot | node-d8e93393.powerinter.net
but it won't let us in :-( | |
| |
http://search.privacybird.com/ | PrivacyFinder | 128.2.220.167
| |
http://navi.ocn.ne.jp/ | nttdirectory_robot | lilis00.navi.ocn.ne.jp
| super-robot@super.navi.ocn.ne.jp |
| griffon | lilis04.navi.ocn.ne.jp
| griffon@super.navi.ocn.ne.jp |
| |
http://search.Market-UK.com | ScollSpider | 82.43.129.240
| |
www.maxbot.com | Spider/maxbot.com | search.wport.com
| admin@maxbot.com |
| |
??? | various (fakes agent on each access) | pool0058.cvx2-bradley.dialup.earthlink.net
| |
??? | gazz/1.0 | deleuze.infobee.ne.jp
| gazz@nttrd.com | derrida.infobee.ne.jp
| |
??? | ??? | search-8.xift.com
| |
www.mousefish.com | MouseBOT/ | 66.65.133.195
| |
www.nationaldirectory.com | NationalDirectory-SuperSpider | spider.nationaldirectory.com
| | 209.116.58.143
| |
www.naver.com | dloader(NaverRobot)/ | 211.218.151.209
| dumrobo(NaverRobot)/ |
| |
www.noxtrum.com | noxtrumbot/ | 194.224.199.52
| |
www.openfind.com | Openfind piranha,Shark | ???
(Chinese language) | robot-response@openfind.com.tw |
| Openbot/ | abovenet4.openfind.com
| |
www.picsearch.org | psbot | 217.75.104.26
| www.picsearch.org/bot.html |
| |
www.pinpoint.com | CrawlerBoy Pinpoint.com | nitrogen.pinpoint.com
| |
www.petersnews.com | user.ip3000.com | news.petersnews.com
| |
www.qweery.nl | QweeryBot | 84.82.133.41
| http://qweerybot.qweery.com) |
| |
www.vestris.com/alkaline | AlkalineBOT | host130.uv-ray.com
| |
www.rambler.ru | StackRambler/ | 81.222.64.10
| |
www.seznam.cz | SeznamBot | 212.80.76.87
| |
www.search-10.com | Search-10 | 82.41.144.99
| |
www.searchhippo.com | Fluffy the spider | 208.148.122.27
| info@searchhippo.com) |
| |
www.scrubtheweb.com | Scrubby/ | 208.145.190.254
| |
www.singingfish.com | asterias | grouper.singingfish.com
| |
www.speedfind.de | speedfind ramBot xtreme | BWEB.highway.telekom.at
| |
www.s.u-tokyo.ac.jp | Kototoi/0.1 | crawler-red3.is.s.u-tokyo.ac.jp
| |
www.searchbyusa.com | SearchByUsa | ???
| |
www.searchspider.com | Searchspider/ | 24.90.243.203
| |
www.sightquest.com | SightQuestBot/ | 64.49.245.212
| http://www.sightquest.com/bot.htm |
| |
www.spidermonkey.ca | Spider_Monkey/ | 66.163.18.197
| |
www.surfnomore.com | Surfnomore Spider v1.1 | 165.90.194.245
| |
www.supersnooper.com | Robot@SuperSnooper.Com | 207.8.212.162
| |
www.teoma.com | teoma_agent1 | 63.236.92.148
| teoma_admin@hawkholdings.com |
| |
http://mapper.teradex.com | Teradex_Mapper | 65.110.6.26
| mapper@teradex.com |
| |
www.travel-finder.com | ESISmartSpider | 202.46.33.15
| |
www.traficdublu.ro | Spider TraficDublu | 81.196.*.*, 193.16.218.66
| |
www.tutorgig.com | Tutorial Crawler | 216.40.225.75
| http://www.tutorgig.com/crawler |
| |
www.updated.com | updated/0.1beta | 38.119.96.107
| crawler@updated.com |
| |
www.uksearcher.co.uk | UK Searcher Spider | -
| |
www.vivante.com | Vivante Link Checker | 216.93.167.106
(coming soon) | |
| |
www.walhello.com | appie | uses an address at planet.nl, a Dutch ISP
| |
www.websmostlinked.com | Nazilla | -
| |
www.webwombat.com.au | www.WebWombat.com.au | 202.139.99.131
| |
www.webseek.de | marvin/infoseek | arthur4.sda.t-online.de
| marvin-team@webseek.de |
| |
www.webtop.com | MuscatFerret | ferret.webtop.com
| |
www.whizbanglabs.com | WhizBang! Lab | 216.250.143.108
| |
| |
www.wisenut.com | ZyBorg | -
| (info@WISEnut.com) |
| |
www.wire.co.uk | WIRE WebRefiner: | brighton.wire.co.uk
| webrefiner@wire.co.uk |
| |
www.worldsearchcenter.com | WSCbot | ???
| |
www.yandex.com | Yandex | ya.yandex.ru
| |
www.yellowpet.com | Yellopet-Spider | 212-82-36-23.ip.zeitraum.com
pet-based search engine | |
| |
www.yelo.no | Findexa Crawler | ???
| |
www.yourbettersearch.com | YBSbot search engine indexer | 12.25.90.3
| |
| libwww-perl | www.linpro.no/lwp/
| |
http://verno.ueda.info.waseda.ac.jp/ |
| Iron33 | 207.18.183.251
$_$_END_TABLE
Browsers
========
Most browsers identify themselves with a string that begins "Mozilla...".
I've chosen not to document those (as yet). Here are a few of the rarer
browser identifiers that I've seen.
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
Browser identifier Information
-------------------------------------------
AmigaVoyager http://v3.vapor.com/
Voyager browser for the Amiga
xChaos_Arachne http://browser.arachne.cz/
(DOS-compatible browser. Linux version under development)
ELinks http://elinks.or.cz/
Open source text browser (probably Linx-only)
IBrowse www.hisoft.co.uk (search for IBrowse)
Amiga-based browser
ICab www.icab.de/index.html
(Macintosh-only)
JustView http://www3.justsystem.co.jp/download/justview/3.01win1a.html
(I *think* this is a browser. Site is in Japanese)
KMeleon http://kmeleon.sourceforge.net/
(Light browser based on the Mozilla code base)
Konqueror www.konqueror.org/konq-browser.html
(Linux KDE browser)
Lynx http://lynx.browser.org/
(Cross-platform text based browser)
OmniWeb www.omnigroup.com/products/omniweb/
(Macintosh-only)
Opera www.opera.com
(Cross-platform, small, efficient and standards lead browser)
Plucker www.plkr.org/index.pl/faq#1.1
(Palm handhelds. Written in Python)
pwWebSpeak www.prodworks.com/issound/catalog/catalog_pwwebspeak.html
Audio Browser
QWeb http://sunsite.auc.dk/qweb/ (Linux browser)
(see also http://browswerwatch.internet.com/news/story/qweb8.html)
retawq http://retawq.sourceforge.net/
Text-based browser for text terminals. Runs under Linux
SlimBrowser www.flashpeak.com/sbrowser/sbrowser.htm
Freeware tabbed browser
Sleipnir http://sleipnir.pos.to/software/sleipnir/index.html (Japanese)
Japanese browser with apparantly an English version available.
VMS_Mosaic http://vaxa.wvnet.edu/vmswww/vms_mosaic.html
(OpenVMS only version of Mosaic, a pre-Netscape browser)
WannaBe http://mindstory.com/wb2/
(Macintosh text-only browser)
w3m http://w3m.sourceforge.net/
(text-based browser)
$_$_END_TABLE
Link Checkers, Link monitors and bookmark managers
==================================================
Link checkers and bookmark managers are run by people wanting to keep their
pages and bookmarks up to date. Being visited by a link checker is good news
as it means that someone has linked to you, and cares that you're still alive.
Link monitors regularly check your pages for changes, usually because someone
has selected your page as "one to watch".
(pause for warm glow :-)
If you have access to the server log, check the referrer page to try and get
the URL from which you are linked. Sometimes these URLs are inside password
protected parts of sites, so you won't be able to view the page.
If you build up a list of sites that link to you, these are the guys you should
tell when you move (moral - never move)
It's also quite common for the Link checker to give no indication of which URL
it's coming from. Some link checkers always come from the same IP address,
more usually they come from the client's site. It depends on whether the site
owner has purchased a copy of the link checking software, or signed up to some
centralized link checking service. If you get the client's IP address you can
always try visiting that if they blank the referrer URL field, and surfing their
site.
Some of these tools appear to imply they're extracting email addresses
(e.g. emailSiphon). As such they're probably unwelcome visitors
since these addresses are probably being collected for spammers.
A page listing various link checkers (and other tools) can be found at
www.softwareqatest.com/qatweb1.html#LINK
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
$_$_TABLE_MIN_COLUMN_SEPARATION 2
Robot identifier IP address(es) Link Checker home page
--------------------------------------------------------------------------------------
ActiveBookmark http://libmaster.com/software.php
ALink http://www.info-pack.com/alink/
Reciprocal Link Checker, Manager and Page Generator.
AMeta http://www.info-pack.com/ameta/
Meta Tag Generator
ASPSearch URL Checker http://search.santry.com/downloads/
a site search engine/index maintenance tool
BlogBot http://sourceforge.net/projects/blogbot/
BMChecker www.fureai.or.jp/~yoichi37/soft/bmchecker.html
(Japanese Bookmark Checker)
Bookmark Buddy www.bookmarkbuddy.net/about.shtml
Check&Get www.checkget.com
CheckWeb www.checkweb.com
CNET_Snoop www.download.com
(only if you have software listed at that site)
Cocoal.icio.us A Mac clent for the del.icio.us bookmark sharing site
www.scifihifi.com/cocoalicious
CSE HTML Validator www.htmlvalidator.com
HTML page validator that includes a link checker
amongst it's functions.
DRKSpider www.drk.com.ar/spider/ (An Open Source project)
DISCo Watchman www.t-guild.com/gamesite/Software/Disco_w/Disco_w.htm
DoctorHTML draco.imagiware.com http://www2.imagiware.com/RxHTML/
Email Extractor We don't list links to
email collectors on this site
EmailSiphon We don't list links to
email collectors on this site
EmailWolf www.pixeltech.com.au/~msw/ewolf/index.html
FavOrg http://www.pcmag.com/article2/0,1759,1558477,00.asp
A utility written by PC Magazine to fetch icons files
(favicon.ico) for your IE favorites
Favorites Sweeper www.manitoolssoftware.cjb.net
Another "favorites" tidy-up utility
FreshLinks.exe www.resqpc.com/features.html
Funnel Web Profiler www.quest.com/funnel_web/profiler/
Profiles your site, including links to/from it
Html Link Validator www.lithopssoft.com/hlv/index.html
HTMLParser http://htmlparser.sourceforge.net/ an open source
HTML parser, that is probably exercising it's
link-checking features.
The Informant cosmo.dartmouth.edu http://informant.dartmouth.edu/
The Intraformant
InternetLinkAgent http://www1.odn.ne.jp/freeware/rank/ineternet/internetlinkagent.html
(in Japanese)
InternetPeriscope www.lokboxsoftware.com/internetperiscope.asp
javElink salix.ingetech.com www.dailydiffs.com
jdwhatsnew.cgi www.jdrowell.com/projects/jdwhatsnew/view
JRTS Check Favorites Utility www.jrtwine.com/Products/CheckFavs/
Lambda LinkCheck 195.139.70.25 www.stud.ifi.uio.no/~lmariusg/download/python/LinkCheck.html
LinkLint-checkonly -- www.goldwarp.com/bowlin/linklint/
LinkAlarm linkalarm.com www.linkalarm.com
Linkbot www.tetranetsoftware.com/products/linkbot.htm
Linkman (Mozilla...) 66.89.128.242 http://www.outertech.com/product.php?product=5
LinkProver www.tafweb.com/linkprover.html
Links -- http://gossamer-threads.com/scripts/links/
(Link management cgi script)
LinkScan Server www.elsop.com
LinkSweeper www.lss.com.au/lss/windows/ls/linksweeper.htm
Link Valet Online 195.82.114.5 www.htmlhelp.com/tools/valet/
LinkVerify Spider frances.yourwebhost.com www.enduser.co.uk/linkverify/
LinkWalker lw.seventwentyfour.com www.seventwentyfour.com
209.167.50.23
Morning Paper www.boutell.com/morning/
MoveAnnouncer -- www.moveannouncer.com
(notifies webmasters when your pages have moved)
mylinkcheck -- www.mylinkcheck.de (German)
NetLookout -- [[TEXT www.frugalsoft.com]]
NetMechanic gamma.netmechanic2.com www.netmechanic.com
www.elsop.com
NetMind-Minder marvin.netmind.com (retired) www.netmind.com
gary.netmind.com
meg.netmind.com
inyanga.netmind.com
leo.netmind.com
gemini.netmind.com
NetMonitor -- www.modemwizard.com/netmonitor.html
Netprospector JavaCrawler www.actaddons.com/products/netprospector.asp
online link validator 216.93.171.138 www.dead-links.com
(online link checker - submit your URL)
Rational SiteCheck www.rational.com/products/teamtest/prodinfo/sitecheck.jtmpl
Robozilla h-206---.netscape.com http://dmoz.org/
(checks links in the dmoz directory)
RPT-HTTPClient www.purplefrog.com/~thoth/jchecklinks/
Java utility that uses the Java HTTPClient class library
SiteBar www.sitebar.org
SpurlBot ??? www.spurl.net Online bookmark agent
SurfMaster www.maskbit.com/surfmaster.htm
SyncIT www.bookmarksync.com
Watchfire WebXM www.watchfire.com/products/webxm.asp
WatzNew Agent www.watznew.com
WebSite-Watcher www.aignes.com
WebTrends Link Analyzer www.webtrends.com
Weblink Scanner www.iterix.com/products/WeblinkScanner/weblinkScanner.asp
Xenu's Link Sleuth www.snafu.de/~tilman/xenulink.html
Z-Add Link Checker http://w3.z-add.co.uk/linkcheck/
$_$_END_TABLE
Validators
==========
Validators check your web pages for HTML correctness and standards compliance.
Since other people are unlikely to send a validator to *your* site, you don't
usually see much of this. Consequently the "list" below is restricted to the
on-line validators I've used myself.
However if you choose to validate your own site, then the validation attempts
will appear in your logs. The following list is thus limited to the on-line
validator I use (and recommend) and a URL submission service that I use.
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
Robot Identifier IP address Validator home page
-------------------------------------------------------------------
W3C_Validator abyss.w3.org http://validator.w3.org/
WDG_Validator/ 64.29.16.182 www.htmlhelp.com/tools/validator/
Tooter selfpromotion.com www.selfpromotion.com. This is
used as part of a link submission
agent (trebor@animeigo.com)
$_$_END_TABLE
FTP clients and download managers
=================================
If you offer files for download, then you'll start to be visited by various FTP
clients. Clients like Go!Zilla and GetRight are smart in that they can resume
downloads that have been interrupted. This relies on your web server supporting
the necessary protocol, but that's fairly standard these days.
If your download files are over 1Mb in size (or if your server is slow), you'll
often see the same IP address make multiple partial downloads of your file (look
at the file size). In the case of Clients line Go!Zilla and GetRight if these
add up to the right number of bytes, then chances are the download succeeded.
$_$_BEGIN_TABLE
$_$_TABLE_LAYOUT 2,"31","255"
$_$_TABLE_ALIGN CENTER
Client Identifier FTP Client home page
----------------------------------------------------
Alligator www.nearsoftware.com/alligator/maininfo/
BatchFTP www.dynamicnet.net/products/batchftp.htm
ChinaClaw http://download.pchome.net/internet/download/860.html (Chinese)
(Chinese download utility)
DA www.lidan.com
www.downloadaccelerator.com
DLExpert www.yanew.com (English and Chinese versions available)
Download Demon www.netzip.com
Download Master www.one.com.ua/dm/ (Russian)
Download Ninja www.h-fd.org/~mkro/mt/archives/000585.html (Japanese)
Download Wonder www.forty.com
Ez Auto Downloader www.anatari.com/ezad/index.html
Downloads all files of a given type from a site, so it's
more like a site grabber
FreshDownload www.freshdevices.com/freshdown.html
Go!Zilla www.gozilla.com
GetRight www.getright.com
MyGetRight
GetSmart http://getsmart.hypermart.net/
HiDownload www.hidownload.com
JetCar (or FlashGet) www.amazesoft.com
Kapere www.kapere.com/menu.php?lang=english
Kontiki Client www.kontiki.com/products/index.html
LeechFTP http://stud.fh-heilbronn.de/~jdebis/leechftp/
LeechGet www.leechget.de
LightningDownload www.lightningdownload.com
Mass Downloader www.geocities.com/SiliconValley/Vista/2865/md.htm
MetaProducts Download Express www.metaproducts.com/DE.html
NetZip Downloader www.netzip.com
SmartDownload
NetAnts www.netants.com
NetButler www.webcelerator.com/netbutler/
NetPumper www.netpumper.com
Net Vampire www.netvampire.com
Nitro Downloader www.klsofttools.com/nitro.html
Octopus http://moskalyuk.com/octopus/
PuxaRapido www.puxarapido.com.br
RealDownload http://service.real.com/help/faq/rdown4/rdownfaqa01.html
SpeedDownload www.yazsoft.com (for Macintosh)
WebDownloader for X 1.30 www.krasu.ru/soft/chuchelo/features.php3
(Linux web downloader with X GUI)
WebLeacher www.webleacher.dk (down last time I tried it)
more details at www.davecentral.com/projects/thewebleacher/
WebPictures Downloader www.fullstrong.com
Locates and downloads pictures
X-Uploader Can't find the home page, but it's described (in Russian)
on www.compulenta.ru/2002/1/17/24333/
$_$_END_TABLE
Research projects
=================
These agents come from research projects. Of course that's how Google started...
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
citenikbot/ http://www.citenik.co.uk/bot.html. One-man project due
for release in 2004.
CLIPS-index http://clips-index.imag.fr/ (French)
French research robot from a linguistics project (?)
Computer_and_Automation_Research_Institute_Crawler
Robot from the research centre at Hungarian Acedemy
of Sciences at www.sztaki.hu Crawls from IP 195.111.1.93
cosmos Spider from www.xyleme.com which is a project to locate
robot@xyleme.com and index XML content on the web. The company is a spin off
from project at INRIA in France, a frequent source of
web robots. The word "xyleme" apparantly relates to the
vascular system in plants, but cleverly must be one of
the very few words to contain the letters "X", "M" and "L"
(although not in that order ;-)
D2KWebCrawler http://archive.ncsa.uiuc.edu/TechFocus/Projects/NCSA/D2K_-_Data_To_Knowledge.html
"Data to Knowledge" data miner. Crawls from 141.142.15.21
DiaGem/ Experimental spider from Mitsibushi R&D division
www.skyrocket.gr.jp/diagem.html
Crawls from IP 203.178.88.244
Digimarc WebReader Digimarc search images on the web looking for digital watermatrs
More details at www.digimarc.com
EchO!/2.0 Spiders from 194.254.160.3, which would seem to be part
of www.voila.com, a French-based search engine.
FinaleRobot The www.expressus.com site describes an Interactive Natural
robot-master@expressus.com Language encyclopedia that will become a search engine
at www.final-e.com. Good name, but at present it just
maps back onto the ExpressUs site (not such a good name).
Crawls from IP address 64.114.34.115
Ideare - SignSite www.ideare.com. Spiders from spider3.tiscalinet.it. Ideare are
a research company producing search engine technology, and are
part owned by Tiscali in Italy, who seem to use their various
tools for different search engines (mp3, images etc).
GentleSpider Some sort of spider that usually visits using
an IP address from within www.research.att.com or
crawler.tivra.com
Gulper Web Bot www.ecsl.cs.sunysb.edu/~maxim/cgi-bin/Link/GulperBot
(Open research project to produce opinion-based search engine)
larbin And from the people that brought you xyro (see below),
sebastien.ailleret@inria.fr comes another, newer bot. This one seems to crawl from
ghi@lcs.mit.edu the IP address cremant.inria.fr. *Update* more recently
it's also been seen coming from barracutta.lcs.mit.edu
cosmos And then there was "cosmos", crawling from pomelos.inria.fr
Seems these people are a webbot factory. Cosmos doesn't
offer an email address.
IRLbot http://irl.cs.tamu.edu/crawler. Crawls from 128.194.135.80
crawls randomly to determine the topology of the web.
KnowItAll www.cs.washington.edu/research/knowitall/ a project that
"extracts massive amounts of information from the Web in
an autonomous, scalable manner". Don't they know that
everyone hates a know-it-all? :-)
MJ12bot www.majestic12.co.uk/projects/dsearch/ A dsitributed search
engine project
MultiText Research project to index the last weeks' news items
http://canola1.uwaterloo.ca/
NEC Research Agent http://heavenly.nj.nec.com/
Research "Inquirus" (meta?) search engine
OntoSpider http://ontospider.i-n.info
Dutch robot for a research project. Crawls from 195.11.244.52
sherlock_spider www.sherlock.com.cn. A course project from
http://burrowww.cs.indiana.edu:15003/b659/
Crawls from 129.79.245.98
S.T.A.L.K.E.R. www.seo-tools.net/en/bot.aspx. "My first robot" :-)
Crawls from 195.71.117.89
Steeler www.tkl.iis.u-tokyo.ac.jp/~crawler/crawler.html.en
Japanese research robot.
ru-robot Unable to find details on this, but I'm guessing it's
0.1_hseo(at)cs.rutgers.edu a research spider from www.rutgers.edu. Crawls using
the IP teal.rutgers.edu
USyd-NLP-Spider www.it.usyd.edu.au/~vinci/webcorpus.html research into Natural
Language Processing at University of Sydney, Australia
WebGather http://pccms.pku.edu.cn:8000/
Chinese search project
xyro Seems to be a spider associated with a French
xcrawler@inria.fr research institute. Usually crawls using the IP
address vamos.inria.fr
Zao/0.2 www.kototoi.org/zao/ Another Japanese research robot
Crawls from 133.11.36.41.
Zao-Crawler Same as above, but crawled from 133.11.36.40
$_$_END_TABLE
Software packages
=================
These agents are the default identifiers for various software packages.
Software developers uses these packages to add Internet functionality to
their own applications. As such it's impossible to say without looking
at the pattern of access what these agents are being used for as the same
agent name may be used by different developers fo achieve differemt results.
While many of these packages allow you to change the user agent, some do
not, and many developers are too lazy to change the agent string.
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
GT::WWW Apparantly some form of web-accessing perl module. Possible
included in the Links SQL product produced by
www.gossamer-threads.com/scripts/index.htm.
HTTPClient Default agent name used by the Java HTTPClient class.
www.innovation.ch/java/HTTPClient/ (See also RPT-HTTPClient below)
HTTP::Lite Default identifier for a set of light-weight perl modules
for retrieving web documents . See www.toybox.ca/http-lite/
IP*Works! Set of TCP/IP components used in cross-platform development
of internet tools www.nsoftware.com/products/ipworks.aspx
libwww-perl The PERL programming language comes with a number of
routines for constructing web-aware scripts. This and
related strings are the default user agent identifiers,
although it's perfectly easy to change this to be whatever
you want.
libghttp The GNOME http library. A Linux software library
the offers connectivity to the web. Found in many
places on the web. There is a description at
www.fifi.org/doc/libghttp-dev/html/ghttp.html
Macromedia Flash Player Flash movies can contain scripts that can fetch content
from the web (such as other Flash movies or images)
MFC_Tear_Sample Agent name used in the sample code supplied with
Visual C++ for accessing the web. This may be therefore
be someone running a program they've written based on
that code.
PEAR HTTP_Request class TPEAR is a framework and distribution system for reusable PHP
components http://pear.php.net/
Python-urllib Presumably the default identifier for the urllib module
in the Python programming language
www.lib.uchicago.edu/keith/courses/python/class/7/
RPT-HTTPClient The Java HTTPClient class library
TeamSoft WinInet Component www.winsoft.sk/wininet.htm (menus require Java)
Internet software component suite
wget www.gnu.org/software/wget/wget.html
Free Unix/Linux package for retrieving web pages
WinScripter iNet Tools www.winscripter.com/wsh/tools/wsInetTools.asp
COM/DLL object that supports the SMTP and HTTP protocols
W3CRobot/ A fast web-spidering robot included with the libwww
package (?). See www.w3.org/Robot/
W3C-WebCon/ www.w3.org/ComLine a command-line toolkit that allows you
to perform HTTP operations
wxWidgets www.wxwidgets.org cross-platform open source C++ GUI builder
which includes "HTML viewing" and much, much more.
Zeus Webster Pro www.homepagesw.com/webster_overview.htm
$_$_END_TABLE
Offline browsers and other agents
=================================
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
$_$_TABLE_MIN_COLUMN_SEPARATION 2
Agent Identifier Agent home page
-----------------------------------------------
DigOut4U www.arisem.com/Enu/
DISCoFinder www.ars.ru/eng/products/discof.asp
eCatch www.ecatch.com
EirGrabber http://www2p.biglobe.ne.jp/~eir/index.htm
(Japanese software from the "Eir Project")
ExtractorPro (Bulk email marketing tool. URL deliberately omitted)
FairAd Client [[TEXT www.hager.co.at/fordelka/fairad.htm]] (German)
A German pay-to-surf client
JoBo www.matuschek.net/software/jobo/index.html a site downloader
iSiloWeb www.isilo.com (for palm pilot)
Kenjin Spider www.autonomy.com
MSIECrawler (Microsoft IE4.0)
MSProxy
NexTools WebAgent www.vector.co.jp/soft/win95/net/se053030.html
Offline Explorer www.metaproducts.com/OE.html
NetAttache Offline browser and search engine agent
PageDown Details (in Japanese) at
http://www01.u-page.so-net.ne.jp/fa2/y_yutaka/share/pagedown.htm
ParaSite www.ianett.com/parasite/
Searchworks Spider www.nedesign.com/Phipps/products.html
SiteMapper www.trellian.com/mapper/index.html
SiteSnagger http://www.pcmag.com/article2/0,1759,1559896,00.asp
SuperBot www.sparkleware.com/superbot/index.html
Teleport Pro www.tenmax.com/teleport/pro/home.htm
URL2File www.chami.com/free/url2file_wincon.html
Web2Map www.web2map.com/us/index.htm
Web site copier. English/German versions available
WebAuto www.yanasoft.co.jp/webauto.html
I *think* this is an offline browser. Site is in Japanese
WebCopier www.maximumsoft.com
Webdup www.webdup.com
(Chinese software. Not 100% sure what it does)
WebFetch www.webfetch.com
WebReaper http://www.webreaper.net/
Webrobot www.multimania.com/dilletb/WebRobot/
Website eXtractor www.asona.org
WebSnatcher www.theronwelch.com/websnatcher/
WebStripper www.solentsoftware.com/webstripper/
WebTwin www.WebTwin.com
Convert websites into help files.
WebVCR www.netresultscorp.com/fs_webvcr_info.html
WebZIP www.spidersoft.com
WWWOFFLE www.gedanken.demon.co.uk/wwwoffle/
Xaldon WebSpider www.xaldon.de/produkte_webspider.html (German)
Offline browser
$_$_END_TABLE
Other miscellaneous agents
==========================
These agents are ones that we've seen, but been unable to get information
for, or which are slightly unusual in origin. If you have any additional
information on any of these, feel free to send it to *info@jafsoft.com*
[[IGNORE_THIS table is broken. highlighting * is lost]]
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
$_$_TABLE_MIN_COLUMN_SEPARATION 2
User Agent Information
-------------------------------------------------
Ad Muncher www.admuncher.com
Browser plug-in that monitors the pages as you view them,
and removes all adverts, popup windows etc.
ADSAComponent http://cnds.ucd.ie/adsa/
ADSARobot distributed search engine project
Contact postmaster@cnds.ucd.ie
browses from acropolis.ucd.ie (which doesn't make
sense for a *distributed* search engine :-)
Albert Indexer www.albert.com
Multi-lingual search technology
AnswerChase www.answerchase.com a personal search robot.
ASPSeek www.aspseek.org/about.html. An open source search engine project
ATA-Translation-Service Looks to be an online translation tool, much like
Babelfish. Possibly related to www.atanet.org/
AVSearch Seems to be the AltaVista personal search agent. The
crawling site is sometimes referred to in the agent name
Avant Browser www.avantbrowser.com Browser add-on for Internet Explorer
Beamer www.pagebeamer.org/fr/index.php (French). A browser accelerator
that requires sites to create a "pagebeamer.txt" file that is
fetched by this agent to do predictive downloads.
beholder or www.vigiltech.com/esensedisclaim.html
e-sense www.vigiltech.com/esensedisclaim.html
BravoBrian http://bstop.bravobrian.it/ (may require IE). A content filtering
service that offers protection from pornography and
other unwanted content for children. Comes from IP 213.215.133.19
bumblebee@relevare.com Software used to build "Vortals" (vertical portals).
Details (requires Flash) can be found at
www.relevare.com/site/
Checkbot Seems to come from www.oxxfordinfo.com who offer B2B
services
contype Possibly Adobe Acrobat or Reader or Adobe Acrobat Reader
used with MSIE (I have been unable to confirm this)
Convera Internet Spider A "RetrievalWare" product which claims to be a multimedia
web cralwer. www.convera.com/Products/rw_ancillis.asp
ConveraCrawler Probably related to the above
ccubee Crawler technology from http://empyreum.com/technologies/platforms/ccubee/
Custo Tool to map the structure of a web site
www.netwu.com/custo/
CyberNavi_WebGet UA points to www.cybertech-inc.co.jp, but there's not
much there. It crawls from 222.151.213.124 which
is http://bsearchtech.com/ (Japanese). Bablefish
suggests this is a Japanese company offering search products
DaviesBot www.wholeweb.net/web/
deepweb Also calls itself an "Intelligent Deep-Web Robotic Agent"
A search engine indexer that will index dynamic content.
www.deepweb.com. Indexs from IP 66.96.221.180
EbiNess http://sourceforge.net/projects/ebiness
An Open Source project to display Internet information
ina 3D format.
EmailWolf www.pixeltech.com.au/~msw/ewolf/
email program no longer available - that's the only reason I'm
prepared to list it on this page.
Excalibur Internet Spider www.excalib.com/products/ispi/index.shtml
Expired Domain Sleuth Hunts down popular, yet expired domain names with
a view to letting you purchase an already popular
domain name. www.expireddomainsleuth.com
Everest-Vulcan Inc./ http://everest.vulcan.com/crawlerhelp Next-generation
services rechnology (under development)
GigaBaz http://brainbot.com/
GigaBazVStheWeb
crawler@brainbot.com
Giskard www.oralco.com
(Trivia note: Giskard is probably named after the Isaac Asimov robot)
grub-client Grub is a distributed, open source web crawler. Users
download the client which then indexes the web as part
of a distibuted effort www.grub.org/html/documents.php
heritrix Open-source, extensible web crawler project
http://crawler.archive.org/
htdig www.htdig.org
search engine software for companies and universities
http://webwarper.net A browser accelerator. The idea is that you browser "through"
their site, taking advantage of their faster Internet connection,
caching and - most importantly - compression (of the file sent
to your browser) in return for their adverts added to the viewed pages.
Such accesses give the webwarper URL as the User Agent, concealing
the true agent of the original user.
More details at http://webwarper.net/ww.pl/0/wwgz/about.htm?*
infoGIST www.infogist.com
InterGO www.teachersoft.com
http://browserwatch.internet.com/news/story/intergo1.html
This was a child-safe browser, nut it seems no associated
page remains
InternetArchive Presumably www.internetarchive.com, but that's in "stealth mode"
Internet Ninja www.ifour.co.jp (Japanese Macintosh browser?)
InternetSeer A web monitoring service.
More details at www.internetseer.com/
ipiumBot www.laurion.com/ipium-analysis.html (French)
A tool that searches for copies of your documents on
the web. Crawls from petula.laurion.net
InternetAmi IOR www.internetami.se/ior.html robot gathering data for
an English/Swedish translation service.
InsumaScout/ www.insuma.de/insuma/de/SEscout.html
Searches data situated in open data sources.
Katriona Something to do with the European Regional Internet Registry (RIPE)
Browses using IP address 213.219.19.148
larbin http://pauillac.inria.fr/~ailleret/prog/larbin/index-eng.html
LEIA *Unable to find*
(Too many "Star Wars" references get in the way)
LexiBot www.lexibot.com
LimeBot www.cruiselime.com/LimeBot.php Robot searching for information
on cruises. Browses using IP address 24.42.113.89
logikabot www.logika.net
Mata Hari www.thewebtools.com
(Internet search agent)
metabot Geographical-based text search tool. Crawls from 66.28.23.147
www.metacarta.com/products.htm
Mister Pix II Picture finder www.mister-pix.com/en/home.htm
MOSES 2.0 Spider www.ideas2internet.com/products/moses2/
*NOTE* Site crashes my version of netscape 4.7
MonkeyCrawl www.monkeymethods.org. "Futuristic play".
NetCruiser www.netcruiser-software.com/products.html
It's not clear to me *which* of these products this might be,
but I'm assuming it's one of them.
NPBot www.nameprotect.com crawls from 12.148.209.196 (crawler1.crawler918.com)
A trademark protection service
NetZippy www.innerprise.net/usp-spider.asp
NutchCVS http://lucene.apache.org/nutch/bot.html. Open source web-search project
NZBot www.navigationzone.com
Offers "information management" tools
Opencola www.opencola.com
A search application, combining data from multiple sources
ORA_checksite www.oreilly.com/openbook/webclient/ch06.html
Identifier used in a sample perl program in the online
book "Web Client Programming with Perl". The program is
used to check links. Obviously people have tried it, and it works :-)
Onekit.com - PAD File Get. PAD file poller. PAD files describe software applications to
download sites.
Oxxbot1 www.oxxfordinfo.com
(Data mining bot on IP 216.0.86.75)
Pansophica http://homepage.mac.com/zigkit/Pansophica/index.html
A Web search agent with neural net intelligence which organizes
and personalizes Web sites and searches.
PCSoftLand PAD file poller. PAD files describe software applications to
download sites.
Phoaks www.phoaks.com/index.html. An index or web resources
listed in UseNet. See also
www.public.iastate.edu/~CYBERSTACKS/Aristotle.htm
phpMySearch-Crawler http://phpMySearch.web4.hm a search engine for individual
sites.
PICgrabber A free picture and movie locator
www.movies-free.net
PictureOfInternet www.malfunction.org/poi
erik@malfunction.org Seems to be a project to create a collage of images gathered
from the Internet.
PicSpider www.bildkiste.de.vu (German). Site offers a "picture crate"
according to babelfish, which seems to be some form of
repository. Not sure why it's spidering, but crawls
from 217-20-118-26 which is part of internetserviceteam.com
PintaSpider *Unable to find* But the spider came from www.cnet.fr
Pita (Chub.Stanford.EDU) --
PitSpyder Thread0 *Unable to find*
psbot www.picsearch.org/bot.html
A bot indexing pictures. Crawls from ps.direct2internet.com
PolyBot http://cis.poly.edu/polybot/
crawls from
weasel.poly.edu,
grampus.poly.edu,
bumblebee.poly.edu
PureSight www.puresight.com/Products/PureSightHomeDescription.htm
(child-safe content filtering)
Rumours-Agent Comes from IP 202.214.69.131, which a lookup
identifies as "Cross Lingual Info Research" in Japan.
RepoMonkey Bait & Tackle A bit of detective work here. Recent entries in the
the log file link this to the site www.hungryhippo.com,
although the robot always appears to come from an IP
address at backflip.com (a bookmarking service).
Visiting www.hungryhippo.com reveals a "coming soon"
site. Looking at the HTML source leads to another page
at www.mezzaluna.net/hungryhippo.com/ (appears
identical).
The META tags for this page all appear to be references
to day trading, futures, training and the like, although
we did spot the word "fibonacci" (our favourite :-).
So... possibly a future search engine related to stock
trading?, or maybe the Monkey and Hippo are just feeding
me a red herring?
There's more. The picture on the Kenjin site at
[[TEXT www.kenjin.com/kenjin/info.html]] is currently the same as
that at HungryHippo. Kenjin is an Autonomy company.
Robot2.0(PingSoft) There are several "PingSoft"s around, but I suspect that
this belongs to one of the products listed at
http://www.pingsoft.net/ (e.g. SmartHunter)
since I was visited froma Chinese IP address.
SilentSurf www.silentsurf.com. A surf anonymizer service
SlySearch www.slysearch.com. A site that hunts down infringements
slysearch@slysearch.com of intellectual property rights.
SpaceBison http://www.proxomitron.org/
A web filter that is "ShonenWare", i.e. you should
purchase a Shonen Knife CD if you use it. Shonen Knife
are a great Japanese band, much loved by the late Kurt
Cobain. Sometimes this sets the referrer page to the
band's home page at www.mmjp.or.jp/knife/ (or maybe
the users just happen to go there themselves).
CrawlWave www.spiderwave.aueb.gr (Greek, and requires login)
Crawls from 195.251.252.44, which is part of the
Athens University of Economics and Business (www.aueb.gr)
SpotOn www.spoton.com
(IE add-on that organizes your browsing)
SQ Webscanner http://macinsearch.com/users/webscanner/
(on holiday last time I looked)
Squid www.squid-cache.org
An open-source web proxy cache for Unix systems
SquidClamAV_Redirector http://freshmeat.net/projects/scavr/?branch_id=54042&release_id=188491
An open-source anti-virus program that I saw accessing icons
on my site (!)
Sqworm Not 100% sure about this one. When it visited me it came
from the WebSense site 63.212.171.* (and a Google search show
others seem to see the same). At the WebSense site you
can find WebCatcher, a product used to monitor
employees web-surfing habits (as near as I can tell).
But as I say, I'm *not* 100% sure...
www.websense.com/products/about/webcatcher/index.cfm
Steganos Internet Anonym www.steganos.com/?layout=default&content=products_siapro&language=en
A surf anonymizer utility
SurfControl www.surfcontrol.com/products/web/default.aspx
content tracking product
Tagword Tool that surveys the links in the Open Directory
at http://dmoz.org, checking their status etc.
See http://tagword.com/dmoz_survey.php
TaWWWantula *Unable to find*
Tcl http client package The default identifier for any software built using
the Tcl HTTP package
http://tcl.activestate.com/software/tcltk/
http://tcl.activestate.com/man/tcl8.0/TclCmd/http.htm
TeraCrawl *Unable to find*
TurnitinBot www.turnitin.com
Plagarism prevention system. Crawls from 64.140.48.25
UCmore www.ucmore.com
A broswer plug-in (initially IE only) that searches for
related pages and categories. In my experience this
seems to entail accessing a favicon.ico file on a daily
basis (presumably to refresh the "favorites" list)
UdmSearch http://search.mnogo.ru/
Search engine technology, as used at sites such as
www.maplesearch.com. Now called mnoGoSearch.
unchaos_crawler www.unchaos.com. A search engine that offers a "hybrid"
of human and machine intelligence, but no search box
that I could see :-). Crawls from 192.115.134.201
unlostBot www.unlost.com is "under construction". The robot came
unlostBot@unlost.com from IP address 212.37.219.147 which is in France.
URLBlaze File/web search utility www.urlblaze.net
utopy Coming soon at www.utopy.com (requires flash). This
crawler@utopy.com venture-capital funded site is "running in stealth mode"
before launching the "new new thing" (is that a typo?).
One of the Flash pages defines Utopia (geddit?), and some
of the browsing is done by IP addresses at ...myutopy.com.
UtilMind HTTPGet A component intended for downloading pages from the web using
standard Microsoft Windows Internet library (winInet.dll)
Listed on www.utilmind.com/delphi2.html
UrlScope *Unable to find*
Vagabondo Appears to be a log analyzer for Russian BBS systems.
(I may have got that wrong). I found reference to
it being copyright John Gladkih 1998, but I've not found
any URL that gives a description (not even a Russian one).
VCI WebViewer Web browser object, that may be incorporated into software
www.homepagesw.com/webster_dl.htm
vspider www.verity.com/products/intspider/
A commercial spidering product.
WAVETools A set of Delphi components offered to build Internet
applications from www.transerve.com
Webbandit http://softwaresolutions.net/webbandit/index.htm
Collates search engine results
Webclipping.com www.Webclipping.com
News-gathering agent
webcollage Forms collage from randomly select web images
www.jwz.org/webcollage/ pet project of one of
the authors of Netscape. Seems to come from
differing IP nodes.
WebCompass ??? (quarterdeck search engine software)
WebGenie www.webgenie.com/products.html. presumably one of
the CGI-based products available on this site. Possibly
the "Site Sleuth"
Web Hound *Unable to find*
Or rather, I found several different "web hounds", so can't tell
which this was,
Web Magnet www.webmagnet.com
this appears to be a tool used by this web consultancy.
WebMiner Either http://www.tribolic.com/webminer/
or
http://www.webminer.com/webminer/index.cfm?section=overview
A tool to track down and target visitors to your website
WebPix Tool to fetch all pictures from a web site
www.netwu.com/webpix/
Webpush www.webhauler.com/webpush.htm
WebSymmetrix Originates in Korea, and is possibly related to their
National Computerization Agency. Uses IP address
210.183.28.39
webrank www.webrank.com/features.asp
Search engine popularity meter.
webwasher www.webwasher.com/en/products/wwash/functions.htm
(browser filter)
WhosTalking http://softwaresolutions.net/whostalking/
Software that tracks Trademark usage
last time I saw it it was creating 404 errors by adding
&dg.. to each URL. Hopefully they'll fix this
www.MacroX.de www.macrox.de (German). Appears to be an interpreter
designed to help automate regular tasks on a Windows PC.
XupiterToolbar A toolbar that sets up www.xupiter.com as the default
search engine. There appears to be a lot of negative
press regarding this toolbar
yacy http://yacy.net/home.html. An open source and distributed
search engine project. The above URL seems to redirect
to an IP-based one
YottaShopping_Bot http://www-yottashopping-com/. User arent clains this is a
Shopping Search Engine, but the URL requires a login
so I was unable to verify (so I deliberately made
it's URL non-clickable). Crawled from 64.62.175.133
$_$_END_TABLE
Sites that regularly visit
==========================
Some IP addresses, or sites may regularly visit you, although the user agent
may be obscure, blank, or even change.
Here are a few that I've been able to work out
$_$_BEGIN_TABLE
$_$_TABLE_ALIGN CENTER
$_$_TABLE_LAYOUT 2,32,132
Site address(es) Description
--------------------------------------------------------
proxy.netsetter.org This is a site thet offers a speed-up
to your surfing, in return for being able to
monitoring people's surfing habits. The speed-ups
are acheived through a variety of techniques,
and the monitoring info is sold on, although your
privacy is protected. Visit www.netsetter.org
for more details.
pwoshoes.transport.com *Not known*
...lightrealm.com This site daily reads any xml files submitted to
a shareware site in PAD format. PAD is a means for
describing shareware devised by the Association of
Shareware Professionals (www.asp-shareware.org). This site
is performing daily checks, looking to automatically
update its lists with any changes.
$_$_END_TABLE
Other useful sites
==================
Here are links to other sites you might find useful when looking into
web robots
$_$_BEGIN_TABLE
$_$_TABLE_MIN_COLUMN_SEPARATION 4
$_$_TABLE_ALIGN CENTER
$_$_TABLE_WIDTH 100%
www.botspot.com A Bot monitor site, with regular updates and links to
the bot's home pages.
www.htmlhelp.com/links/validators.htm A list of HTML validators
www.iplists.com A site that lists IP addresses of search engine
bots and others. More comprehensive (and probably
more up to date) that the IP addresses shown on this
page (which tends to record the first IP address seen)
http://tool.motoricerca.info/robots-checker.phtml An online syntax checker for robots.txt files.
Enter the URL of your robots.txt file to get it
checked and to see a summary of what effect it will
have.
www.mozilla.org/build/revised-user-agent-strings.html Mozilla web browser project. This page describes the
conventions used for formatting the User Agent in the
form "Mozilla..."
www.robotstxt.org/wc/robots.html A site dedicated to the robots.txt file. This page
gives some background to how robots work, although
there list of robots is quite small.
www.searchtools.com/robots/ A page collecting together a number of resources to
do with all aspects of web robots.
www.spiderhunter.com A site primarily about "cloaking" sites - the art of
making a site look different to different visitors.
Contains articles on how to detect spiders.
www.webcab.de/wapua.htm A site listing WAP user agent strings. These will
mostly be mobile phones
www.webmasterworld.com/forum11/index.htm This site contains a number of forums for topics of
interest to webmasters everywhere. This particular
forum actively discusses robots and search engines
that visit your site.
$_$_END_TABLE
...And finally, some fakers
===========================
Increasingly security and privacy concerns mean that users and companies
are wary about giving away information to sites they visit through the user
agent and other fields that appear in server logs.
Some browsers will allow you to select the user agent you present when
visiting a site. The Opera browser does this, for example, to allow it's
users to pretend to be either IE or Netscape when visiting web sites coded
in a way that forgets there are other browsers in use.
Also as firewalls become more common, we will see more and more user agent
fields beling blocked by the firewall, that will prevent this information
being transmitted to the outside world.
Just to prove that you can never rely on the user agent, here is a selection
of user agent strings I've seen in my log files that tell us nothing about the
software being used (although some of them speak volumes about the person
driving the software). I'm omitting any IP addresses I may have to protect
the identities of those concerned :-)
$_$_BEGIN_TABLE
$_$_TABLE_LAYOUT 2,56,132
$_$_TABLE_ALIGN CENTER
"user agent" seen Comments
------------------------------------------------------------------------------------------------
Bruciebot I'm assured this was created by a regular
in alt.www.webmasters :-)
------------------------------------------------------------------------------------------------
Blocked by Norton The agent has been blocked
Geblokkeerd door Norton by Norton Utilities. The refferrer
Blockeriet von Norton is also withheld. The second version
is Dutch. No doubt other languages occur
------------------------------------------------------------------------------------------------
Don't Like AOL Oh dear. This could start a trend!
------------------------------------------------------------------------------------------------
Don't be so nosey ;-) Hey! you came to *my* site first, remember? :-)
------------------------------------------------------------------------------------------------
Don't you wish you knew. Obviously.
------------------------------------------------------------------------------------------------
Go Away A bit rich from someone who *came*
to *my* site! :-)
------------------------------------------------------------------------------------------------
Field blocked by AtGuard Surfer is behind the AtGuard firewall (now
part of Norton Internet Security 2000) which
prevents the true User Agent being transmitted.
http://home.pages.at/atguard/
------------------------------------------------------------------------------------------------
Field blocked by Outpost www.agnitum.com
Again field is witheld by the software
------------------------------------------------------------------------------------------------
Isch habe gar kein Browser ;-) German for "I have no browser" :-)
Or so I thought, until I received the following
from *Clemens Marschner*
_Actually it is German - with Italian accent!_
_The word refers to an advertisement of the Nescafe_
_coffee, where a smart Italian convinces a beautiful_
_lady to stay and drink coffee with her after she knocks_
_at his door to complain that his car is in the way_
_of hers. And after she stayed and listened to him_
_while he prepares the coffee with lots of gestures_
_and Italian speak, she again asks him to move his car_,
_and he goes *"Isch 'abe gar keine Auto, Signorina"* (I_
_don't even have a car, signorina). Since that_
_commercial was shown for years, presumably all German_
_web masters know it_...
------------------------------------------------------------------------------------------------
My Web browser is not of your business True, but no fun.
------------------------------------------------------------------------------------------------
multiBlocker browser www.multiblocker.com/home.html Although this
seems to mainly offer protection against visitor
to your site, they obviously also provide a
user agent blocker for people browsing
------------------------------------------------------------------------------------------------
Wabbit's don't use browsers Probably the proxy service at
http://rabbit-proxy.sourceforge.net/
------------------------------------------------------------------------------------------------
Wot, no browser? (Win67; X; SK) Win67 ?!? Ah... a dream come true!
------------------------------------------------------------------------------------------------
Who gives a shit? It's as least as good as Lynx Ah yes, but how do we *know* that?
------------------------------------------------------------------------------------------------
Who wants to know? I do. :-)
$_$_END_TABLE
Awards for this page
====================
$_$_BEGIN_HTML
 |
 |
I've been told this page is referenced
in the book Spidering Hacks |
|
$_$_END_HTML
All awards gratefully received :-)
$_$_BEGIN_HTML
$_$_END_HTML
$_$_BEGIN_IGNORE
------- ignore this section. It's not in the HTML version of this page
------- These are the links reported by XENU as missing. They will
------- be checked and corrected or removed when I get time.
http://www.krasu.ru/soft/chuchelo/features.php3
error code: 404 (not found), linked from page(s):
http://www.lss.com.au/lss/windows/ls/linksweeper.htm
error code: 404 (not found), linked from page(s):
http://www.metacarta.com/products.htm
error code: 404 (not found), linked from page(s):
http://www.mezzaluna.net/hungryhippo.com/
error code: 404 (not found), linked from page(s):
http://www.netmind.com/
error code: 12007 (no such host), linked from page(s):
http://www.netresultscorp.com/fs_webvcr_info.html
error code: 403 (forbidden request), linked from page(s):
http://www.petersnews.com/
error code: 12007 (no such host), linked from page(s):
http://www.phoaks.com/index.html
error code: 404 (not found), linked from page(s):
http://www.pixeltech.com.au/~msw/ewolf/
error code: 404 (not found), linked from page(s):
http://www.pixeltech.com.au/~msw/ewolf/index.html
error code: 404 (not found), linked from page(s):
http://www.plkr.org/index.pl/faq
error code: 404 (not found), linked from page(s):
http://www.powerinter.net/
error code: 12007 (no such host), linked from page(s):
http://www.silentsurf.com/
error code: 12007 (no such host), linked from page(s):
http://www.skyrocket.gr.jp/diagem.html
error code: 12007 (no such host), linked from page(s):
http://www.spoton.com/
error code: 12002 (timeout), linked from page(s):
http://www.stud.ifi.uio.no/~lmariusg/download/python/LinkCheck.html
error code: 404 (not found), linked from page(s):
http://www.thewebtools.com/
error code: 12007 (no such host), linked from page(s):
http://www.transerve.com/
error code: 12002 (timeout), linked from page(s):
http://www.verity.com/products/intspider/
error code: 404 (not found), linked from page(s):
http://www.vigiltech.com/esensedisclaim.html
error code: 404 (not found), linked from page(s):
http://www.webhauler.com/webpush.htm
error code: 404 (not found), linked from page(s):
http://www.webleacher.dk/
error code: 12007 (no such host), linked from page(s):
http://www.webmasterworld.com/forum11/index.htm
error code: 403 (forbidden request), linked from page(s):
http://www.webseek.de/
error code: 12007 (no such host), linked from page(s):
http://www.webtop.com/
error code: 12007 (no such host), linked from page(s):
http://www.webtrends.com/
error code: 403 (forbidden request), linked from page(s):
http://www.wholeweb.net/web/
error code: 404 (not found), linked from page(s):
http://www.wire.co.uk/
error code: 403 (forbidden request), linked from page(s):
http://www.worldsearchcenter.com/
error code: 12029 (no connection), linked from page(s):
http://www01.u-page.so-net.ne.jp/fa2/y_yutaka/share/pagedown.htm
error code: 403 (forbidden request), linked from page(s):
http://www1.odn.ne.jp/freeware/rank/ineternet/internetlinkagent.html
error code: 404 (not found), linked from page(s):
$_$_END_IGNORE