What is a WWW robot?
A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
What is an agent?
The word "agent" is used for lots of meanings in computing these days. Specifically:
Autonomous agents
are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet.
Intelligent agents
are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.
User-agent
is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.
What is a search engine?
A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot
Robots can be used for a number of purposes:
* Indexing
* HTML validation
* Link validation
* "What's New" monitoring
* Mirroring
So what are Robots, Spiders, Web Crawlers, Worms, Ants?
They're all names for the same sort of thing, with slightly different connotations:
Robots
the generic name, see above.
Spiders
same as robots, but sounds cooler in the press.
Worms
same as robots, although technically a worm is a replicating program, unlike a robot.
Web crawlers
same as robots, but note WebCrawler is a specific robot
WebAnts
distributed cooperating robots.
Aren't robots bad for the web?
There are a few reasons people believe robots are bad for the Web:
* Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
* Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects.
* Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites.
But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
How does a robot decide where to visit?
This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.
Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.
How does an indexing robot decide what to index?
If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...
How do I register my page with a robot?
You guessed it, it depends on the service :-) Many services have a link to a URL submission form on their search page, or have more information in their help pages. For example, Google has Information for Webmasters.
This is referred to as "SEO" -- Search Engine Optimisation. Many web sites, forums, and companies exist that aim/claim to help with that.
But it basically comes down to this:
* In your site design, use text rather than images and Flash for important content
* Make your site work with JavaScript, Java and CSS disabled
* Organise your site such that you have pages that focus on a particular topic
* Avoid HTML frames and iframes
* Use normal URLs, avoiding links that look like form queries (http://www.example.com/engine?id)
* Market your site by having other relevant sites link to yours
* Don't try to cheat the system (by stuffing your pages of keywords, or attempting to target specific content at search engines, or using link farms)
Can I use /robots.txt or meta tags to remove offensive content on some other site from a search engine?
No, because those tools can only be used by the person controlling the content on that site.
You will have to contact the site and ask them to remove the offensive content, and ask them to take steps to remove it from the search engine too. That usually involves using /robots.txt, and then using the search engine's tools to request the content to be removed.
How do I know if I've been visited by a robot?
You can check your server logs for sites that retrieve many documents, especially in a short time.
If your server supports User-agent logging you can check for retrievals with unusual User-agent header values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
A robot is traversing my whole site too fast!
This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains.If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
How do I prevent robots scanning my site?
The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:
User-agent: *
Disallow: /
but this only helps with well-behaved robots.
Search This Blog
Monday, October 26, 2009
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment