One thing which is essential for any crawler-based search engine is to find a reasonable policy for your crawler.
Crawlers can consume a lot of ressources from other people, so you have to be careful what you do. The most efficient way for a crawler to fetch documents would be to start with the robots.txt file, analyze it and then start downloading all documents from this host as fast as possible. Thus time for crawling could be minimized, because you only have to download the robots.txt file once and DNS requests are also at a minimum.
Obviously the webmaster of the host wouldn’t be too happy about such a crawler policy. So I’m doing it like this: The crawler gets a junk (about ten or twenty) of URLs from a single host in a row. It fetches the appropriate robots.txt file and downloads all of the ten or twenty URLs from that host with a minimum time lag of one second between every request. Are those URLs downloaded the crawler contacts a different host.
I found that this policy is a good compromise between DNS and robots.txt traffic and being nice to the web hosts out there.