« Previous Post       Next Post »

SES NYC 2007 – Robots.txt

Here’s a breakdown of the session

Keith Hogan – Ask.com

  • Less than 35% of servers have a robots.txt file
  • Most are copied from one found online
  • Typical 23 character (100 is about the max)
  • Format is not well understood
  • May change to xml format for better control in the near future
  • Ask: can find info about a crawler on site in the about page.

Eytan Seidman – MSN Live Search

Dan Crow – Google

  • Also need to focus on robots exlusion – robots.txt + robots meta tags
  • Exclusion Protocol: tells search engines what not to index
  • Comparted to sitemaps which tells search engines what to crawl
  • Search engines have a lot of differences between them
  • There is interest to standardize protocol for all search engines

Sean Suchter – Yahoo

  • Yahoo slurp is the web crawling robot
  • Yahoo! Slurp = user-agent identifier
  • Supports all standard robots.txt commands (robotstxt.org
  • Custom
    • Crawl delay
    • Sitemap – specify location
    • Wildcards – specify patterns of urls to disallow/allow
    • Custom meta extensions – NOODP and NOYDIR – do not use Yahoo directory titles
  • Please only address yahoo robot with that section in .txt file
  • Currently supports crawl-delay
    • Often misused
  • Microformats.org/wiki/robots-exclusion
    • Demark sections of html you don’t want robots to use for matching
    • Used to demark useless template text, ad text, etc.. irrelevant traffic
  • Historically – to identify what pages to not show in search results
  • But… there is more beyond those (css, images, inline text, iframes), do we need a mechanism to exclude that in certain respects?

Danny Sullivan – Host

  • Check out webmasterworld’s robots.txt file – has lots of notes
  • Wonders if robots.txt in XML format to make it easier
    • Maybe combine the sitemap and robots to one file – check and prevent in one shot

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>