<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Morpheus Media Mlog &#187; spiders</title>
	<atom:link href="http://www.morpheusmedia.com/mlog/tag/spiders/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.morpheusmedia.com/mlog</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Thu, 06 Jan 2011 22:20:36 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>SES NYC 2007 &#8211; Robots.txt</title>
		<link>http://www.morpheusmedia.com/mlog/archive/ses-nyc-2007-robotstxt/</link>
		<comments>http://www.morpheusmedia.com/mlog/archive/ses-nyc-2007-robotstxt/#comments</comments>
		<pubDate>Thu, 12 Apr 2007 14:36:43 +0000</pubDate>
		<dc:creator>Toby Evers</dc:creator>
				<category><![CDATA[Archive]]></category>
		<category><![CDATA[crawlers]]></category>
		<category><![CDATA[googlebot]]></category>
		<category><![CDATA[robots.txt]]></category>
		<category><![CDATA[SES NYC]]></category>
		<category><![CDATA[sitemaps]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[Web Development]]></category>

		<guid isPermaLink="false">http://www.morpheusmedia.com/mlog/?p=42</guid>
		<description><![CDATA[Here&#8217;s a breakdown of the session
Keith Hogan – Ask.com

Less than 35% of servers have a robots.txt file
Most are copied from one found online
Typical 23 character (100 is about the max)
Format is not well understood
May change to xml format for better control in the near future
Ask: can find info about a crawler on site in the [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Here&#8217;s a breakdown of the session</p>
<p class="MsoNormal"><strong>Keith Hogan</strong> – Ask.com</p>
<ul type="disc" style="margin-top: 0in">
<li class="MsoNormal">Less than 35% of servers have a robots.txt file</li>
<li class="MsoNormal">Most are copied from one found online</li>
<li class="MsoNormal">Typical 23 character (100 is about the max)</li>
<li class="MsoNormal">Format is not well understood</li>
<li class="MsoNormal">May change to xml format for better control in the near future</li>
<li class="MsoNormal">Ask: can find info about a crawler on site in the about page.</li>
</ul>
<p class="MsoNormal"><strong>Eytan Seidma</strong>n – MSN Live Search</p>
<ul type="disc" style="margin-top: 0in">
<li class="MsoNormal">Used example <a href="http://www.hilton.com/robots.txt">www.hilton.com/robots.txt</a> (tells them not to use it during the day… funny)</li>
</ul>
<p class="MsoNormal"><strong>Dan Crow </strong>– Google</p>
<ul type="disc" style="margin-top: 0in">
<li class="MsoNormal">Also need to focus on robots exlusion – robots.txt + robots meta tags</li>
<li class="MsoNormal">Exclusion Protocol: tells search engines what not to index</li>
<li class="MsoNormal">Comparted to sitemaps which tells search engines what to crawl</li>
<li class="MsoNormal">Search engines have a lot of differences between them</li>
<li class="MsoNormal">There is interest to standardize protocol for all search engines</li>
</ul>
<p class="MsoNormal"><strong>Sean Suchter </strong>– Yahoo</p>
<ul type="disc" style="margin-top: 0in">
<li class="MsoNormal">Yahoo slurp is the web crawling robot</li>
<li class="MsoNormal">Yahoo! Slurp = user-agent identifier</li>
<li class="MsoNormal">Supports all standard robots.txt commands (robotstxt.org</li>
<li class="MsoNormal">Custom</li>
<li style="list-style-type: none; list-style-image: none; list-style-position: outside">
<ul type="circle" style="margin-top: 0in">
<li class="MsoNormal">Crawl delay</li>
<li class="MsoNormal">Sitemap – specify location</li>
<li class="MsoNormal">Wildcards – specify patterns of urls to disallow/allow</li>
<li class="MsoNormal">Custom meta extensions – NOODP  and NOYDIR – do not use Yahoo directory titles</li>
</ul>
</li>
<li class="MsoNormal">Please only address yahoo robot with that section in .txt file</li>
<li class="MsoNormal">Currently supports crawl-delay</li>
<li style="list-style-type: none; list-style-image: none; list-style-position: outside">
<ul type="circle" style="margin-top: 0in">
<li class="MsoNormal">Often misused</li>
</ul>
</li>
<li class="MsoNormal">Microformats.org/wiki/robots-exclusion</li>
<li style="list-style-type: none; list-style-image: none; list-style-position: outside">
<ul type="circle" style="margin-top: 0in">
<li class="MsoNormal">Demark sections of html you don’t want robots to use for matching</li>
<li class="MsoNormal">Used to demark useless template text, ad text, etc.. irrelevant traffic</li>
</ul>
</li>
<li class="MsoNormal">Historically – to identify what pages to not show in search results</li>
<li class="MsoNormal">But… there is more beyond those (css, images, inline text, iframes), do we need a mechanism to exclude that in certain respects?</li>
</ul>
<p class="MsoNormal"><strong>Danny Sullivan</strong> – Host</p>
<ul type="disc" style="margin-top: 0in">
<li class="MsoNormal">Check out webmasterworld’s robots.txt file – has lots of notes</li>
<li class="MsoNormal">Wonders if robots.txt in XML format to make it easier</li>
<li style="list-style-type: none; list-style-image: none; list-style-position: outside">
<ul type="circle" style="margin-top: 0in">
<li class="MsoNormal">Maybe combine the sitemap and robots to one file – check and prevent in one shot</li>
</ul>
</li>
</ul>
<p style="color: #000088; text-align: right">
]]></content:encoded>
			<wfw:commentRss>http://www.morpheusmedia.com/mlog/archive/ses-nyc-2007-robotstxt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

