« Previous Post       Next Post »

SES NYC 2007 – Duplicate Content & Multiple Site Issue – 04/11/07

Breakdown of this session…

Shari Thurow – Grantastic Designs Inc; Duplicate Content Issues

  • What is duplicate content? Clustering and Filtering
    • Replica of exact syntactic terms and sequence of terms
    • Single formatting change will result in unique fingerprint (CSS, word order etc)
    • Clustering : limiting each site to one or two listings
    • Don’t want affiliates delivering similar content
    • Cluster – when the two listings are tabbed from each other, dif. Pages, similar results
    • “repeat the search with omitted results” – filters redundant content
    • Filters: template stripping/ Boilerplate stripping (section of html that is common to many different docs…. Search engines strip all duplicates (nav, left nav, footer) and grabs rest as fingerprint for page
  • Steps it search engine takes
    • Collection, filtering and refining, indexing (last)
  • Filtering
    • Press releases… one on PRWeb and one on your own site (filter sees prweb as news… one on own site as part of site
  • Content Evolution (only .8% of web content will change completely on a weekly basis).
  • High quality content will not change over time. Sometimes high mutation rate is bad
  • Host Name Resolution: search engines look at how many domains come from same IP address… delivering redundant content. They can detect that.
  • Shingle Comparison – word sets. Breaks down sets .. groups of adjacent words to compare for similarity. Ex. – have multiple products, breaks them up into different word sets. If these word sets repeat on a few pages, even if in dif. Order, they will be considered redundant content.
    • To help with this.. find your own pages that are similar and use robot seclusion (meta tag or robots.txt)
    • Seclude all but your best traffic page, make sure to canonicalize this content
  • Hallway page – connects doorway page (considered spam)… putting lots of links with different words (more keywords) that connect to similar content. Shingles all look the same to search engines.
  • Scraping – spam.. people copy exact content off your site and use it to rank in search engines. Copyright infringement. DMCA reporting when this happens.
  • If CMS is delivering duplicates… use robots.txt exclusion and 301 redirects

Mikkel deMib Svendsen – deMib.com

  • Common issues:
    • www to non www
    • session IDs
    • sort order parameters
    • bread crumb navigation
    • url rewriting
  • WWW to non-WWW – Site canonicalization – use a redirect to combine both
  • Session ID’s
    • If cookies not working… if add session id to end of URL – makes same URL of a different page. 200,000 different indexed pages. Major problems
    • Dump all session info into a cookie for all users!!! Or identify spiders and strip the session ID for them only
  • Customize Permalink Structure
    • Wordpress you can do this : define how you want URL’s to look
    • Ex. Post name/post ID
    • In wordpress, this rewrites url, but doesn’t block it – 2 identical pages.
    • Solution – 301 redirect non-official to official rewrite URL
    • Wordpress also has a canonical URL plugin – gives server response code
    • Forums – you see same problem – can have extra “page=1&pp=20” page parameter at end, end up with duplicate pages
    • Solution to forums: issue 301 redirect
    • Sort Orders: sort by different columns. This results in dif URL structure for similar shingles on site. Same content, sorted different, 2 dif. Urls, all versions end up indexed
    • Solution to sort orders: identify spiders and issue a 301 for all non-official URLs
  • Breadcrumb Nav
    • Creates alternative urls /shoes/running/… or running/shoes
    • Store user navigation into a cookie… only users gain navigation traits

Amit Kumar – Yahoo

  • Snippets are ok, if you take say some content from Wiki it’s ok to take a bit of content, just give a linkback to the source
  • Yahoo is focusing more on it’s sitemaster tool

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>