Here’s a breakdown of the session
Keith Hogan – Ask.com
- Less than 35% of servers have a robots.txt file
- Most are copied from one found online
- Typical 23 character (100 is about the max)
- Format is not well understood
- May change to xml format for better control in the near future
- Ask: can find info about a crawler on site in the about page.
Eytan Seidman – MSN Live Search
Dan Crow – Google
- Also need to focus on robots exlusion – robots.txt + robots meta tags
- Exclusion Protocol: tells search engines what not to index
- Comparted to sitemaps which tells search engines what to crawl
- Search engines have a lot of differences between them
- There is interest to standardize protocol for all search engines
Sean Suchter – Yahoo
- Yahoo slurp is the web crawling robot
- Yahoo! Slurp = user-agent identifier
- Supports all standard robots.txt commands (robotstxt.org
- Custom
-
- Crawl delay
- Sitemap – specify location
- Wildcards – specify patterns of urls to disallow/allow
- Custom meta extensions – NOODP and NOYDIR – do not use Yahoo directory titles
- Please only address yahoo robot with that section in .txt file
- Currently supports crawl-delay
-
- Microformats.org/wiki/robots-exclusion
-
- Demark sections of html you don’t want robots to use for matching
- Used to demark useless template text, ad text, etc.. irrelevant traffic
- Historically – to identify what pages to not show in search results
- But… there is more beyond those (css, images, inline text, iframes), do we need a mechanism to exclude that in certain respects?
Danny Sullivan – Host
- Check out webmasterworld’s robots.txt file – has lots of notes
- Wonders if robots.txt in XML format to make it easier
-
- Maybe combine the sitemap and robots to one file – check and prevent in one shot