Breakdown of this session…
Shari Thurow – Grantastic Designs Inc; Duplicate Content Issues
- What is duplicate content? Clustering and Filtering
-
- Replica of exact syntactic terms and sequence of terms
- Single formatting change will result in unique fingerprint (CSS, word order etc)
- Clustering : limiting each site to one or two listings
- Don’t want affiliates delivering similar content
- Cluster – when the two listings are tabbed from each other, dif. Pages, similar results
- “repeat the search with omitted results” – filters redundant content
- Filters: template stripping/ Boilerplate stripping (section of html that is common to many different docs…. Search engines strip all duplicates (nav, left nav, footer) and grabs rest as fingerprint for page
- Steps it search engine takes
-
- Collection, filtering and refining, indexing (last)
- Filtering
-
- Press releases… one on PRWeb and one on your own site (filter sees prweb as news… one on own site as part of site
- Content Evolution (only .8% of web content will change completely on a weekly basis).
- High quality content will not change over time. Sometimes high mutation rate is bad
- Host Name Resolution: search engines look at how many domains come from same IP address… delivering redundant content. They can detect that.
- Shingle Comparison – word sets. Breaks down sets .. groups of adjacent words to compare for similarity. Ex. – have multiple products, breaks them up into different word sets. If these word sets repeat on a few pages, even if in dif. Order, they will be considered redundant content.
-
- To help with this.. find your own pages that are similar and use robot seclusion (meta tag or robots.txt)
- Seclude all but your best traffic page, make sure to canonicalize this content
- Hallway page – connects doorway page (considered spam)… putting lots of links with different words (more keywords) that connect to similar content. Shingles all look the same to search engines.
- Scraping – spam.. people copy exact content off your site and use it to rank in search engines. Copyright infringement. DMCA reporting when this happens.
- If CMS is delivering duplicates… use robots.txt exclusion and 301 redirects
Mikkel deMib Svendsen – deMib.com
- Common issues:
-
- www to non www
- session IDs
- sort order parameters
- bread crumb navigation
- url rewriting
- WWW to non-WWW – Site canonicalization – use a redirect to combine both
- Session ID’s
-
- If cookies not working… if add session id to end of URL – makes same URL of a different page. 200,000 different indexed pages. Major problems
- Dump all session info into a cookie for all users!!! Or identify spiders and strip the session ID for them only
- Customize Permalink Structure
-
- Wordpress you can do this : define how you want URL’s to look
- Ex. Post name/post ID
- In wordpress, this rewrites url, but doesn’t block it – 2 identical pages.
- Solution – 301 redirect non-official to official rewrite URL
- Wordpress also has a canonical URL plugin – gives server response code
- Forums – you see same problem – can have extra “page=1&pp=20” page parameter at end, end up with duplicate pages
- Solution to forums: issue 301 redirect
- Sort Orders: sort by different columns. This results in dif URL structure for similar shingles on site. Same content, sorted different, 2 dif. Urls, all versions end up indexed
- Solution to sort orders: identify spiders and issue a 301 for all non-official URLs
- Breadcrumb Nav
-
- Creates alternative urls /shoes/running/… or running/shoes
- Store user navigation into a cookie… only users gain navigation traits
Amit Kumar – Yahoo
- Snippets are ok, if you take say some content from Wiki it’s ok to take a bit of content, just give a linkback to the source
- Yahoo is focusing more on it’s sitemaster tool