Duplicate Content – Double Standards for Trusted Sites
February 5th, 2009 — Christoph C. Cemper | 10 Comments »Ye ole’ Duplicate Content Penalty
Duplicate Content and the “duplicate content penalty” have been around for 4 years at least – especially since Google started weeding out gazilllions of duplicate datafeed sites in 2004 (including some of mine that were REALLY duplicate before I took countermeasures…) … e.g. Aaron Wall and Caveman wrote about that years ago already.
Talk talk talk …
Duplicate Content has been topic on panels and discussions on ALL those dozens of search marketing conferences I’ve been to the last 5 years …
And even Google finally published an explanantion of dupe content penalties to their blog end of 2006 (wow – quick response) and a year later even to their Google Webmaster Guidelines … (please note how they rewrite their guideline post in a clever way to not trip their own dupe content filters
so for dupe content, web masters have been
trained,
warned
and even been threatened (from scraper sites or malicious competitors) with duplicate content of their website.
Why?
Because once Google detects your site or page to contain duplicate content, it removes it from the results or at least unranks it 100s of positions back.
So we’re urged to create unique and propelling content for the users and the search engines. Or be doomed.
All of us? No, not all of us… I found a nice exception…
Double Standard for Dupe Content?
Based on our recent research on highly trusted domains of Canadian university domains we found at least one interested example where the trust of completely duplicate domains seems to disable what webmaster’s have been trained on for years…
The University of Toronto maintains the utoronto.ca and the toronto.edu domain.
So dig this – they run the same content on both domains, don’t redirect and yet
both rank and got nice traffic
according to Compete.com the dupe domain toronto.edu still had 30k unique visitors last December – 10% of the real site!
No sign of a duplicate content penalty.
Dupe Site Stats compared
Here we go with some of the stats for the two domains
Url |
Juice |
G!PR |
DomKW |
DFS |
aFD |
whoisdate |
Y!BL |
Y!edu |
Y!ac.uk |
Y!uniDE |
Y!uniCA |
Y!gov |
Juice/9 |
8 |
38,201 |
01.08.1995 |
01.12.1997 |
01.08.1995 |
232,000 |
10,100 |
746 |
489 |
243,734 |
225 |
|
Dupe/22 |
8 |
5,713 |
01.08.1995 |
01.06.1997 |
08.05.1986 |
2,910 |
1,310 |
5 |
16 |
357 |
10 |
I had to split the table as even these few columns break the layout here
Url |
G!BL |
G!idx |
G!cache |
Y!BLint |
Y!LD |
Alexa |
Cuqv |
M!FL |
5,990
|
42,500 |
03.02.2009 |
10,300.00 |
232,000 |
3,312 |
313,734 |
38,400,000 |
|
5,990
|
3,090
|
03.02.2009 |
2,910 |
33,511
|
29,490
|
4,950,000 |
Ok, lot’s of weird headings – I know – what you see is a subset of the 55 different search engine marketing related detail parameters we track historically – time to explain our abbreviations for those common ones we use here explained briefly
Juice Juice Indicator (cemper.com) G!PR Google Page Rank (PR) Domkw Keywords driving traffic to the domain (SemRush) DFS date first seen using netcraft.com aFD First time a domain was cached by archive.org whoisdate Domain Creation Date Y!BL Yahoo Backlinks Y!edu Yahoo .edu domain backlinks Y!gov Yahoo .gov domain backlinks Y!uniDE Yahoo Backlinks from german universities Y!uniCA Yahoo Backlinks from canadian universities G!BL Google backlinks (attention: crippled!) G!idx Google indexed pages G!cache Google page cache date (attention: often crippled!) Y!BLint Yahoo internal Backlinks (from a page to its home-site) Y!LD Yahoo domain backlinks Alexa Alexa traffic rank Cuqv Unique Visitors per Month (Compete.com) M!FL MSN forward links
Interpretation
Now let’s see what we can read from these few datapoints already
- Both domains seem to exist since 1986 – altough the .ca registry was messed up in 1995. Yes, they are very old & crusty.
- Both domains maintain a PR of 8. No sign of a pagerank removal to signal a penalty
- utoronto.ca ranks for 38,201 unique keyword traffic phrases according to SemRush (an awesome organic keyword tracker!)
- toronto.edu ranks for 5,713 unique phrases according to SemRush
- There are ZERO internal links from toronto.edu going to toronto.edu … Broken? No – this just shows that they just serve the same content made for utoronto.ca user the toronto.edu domain.
- There’s a fraction of backlinks going to the toronto.edu
- 3000 accidential links you would LOVE to have
- 1310 accidential .edu links!
- 357 accidential .ca uni links (not to mention those 16 german and 5 british university links!)
- 10 accidential .gov links – at least give me those
- Our Juice Indicator tool correctly identifies the .edu as duplicate (”Dupe/22“) – at least the homepage doesn’t rank as it should for unique phrases (well yes, it’s the same stuff as on utoronto.ca which is JUICy – what do you expect)
- The number of Google backlinks reported are EQUAL for both domains. This fact again shows that the Google backlink command was crippled 5 years ago and changed it’s rules internally several times since (as to which weird numbers to show)) it appears that they somehow understand that the two sites are equal, yet let both of them rank
- The Google cache command shows something accurate for the homepage – but note that the
Google cache results had been crazy for 2+ years now
- Also Live/MSN show 5 million links going OUT of the toronto.edu as opposed to 38.4M of the utoronto.ca – we can assume that those would count in Live.
To sum up, it’s quite interesting to see that BOTH domains rank pretty well, for thousands of phrases – despite the duplicate content.
It’s still about Age & Trust
What’s the double standard here? The duplicate domain toronto.edu simply has
- TONS of great links
- is OLD & CRUSTY
- therefore seems to be TRUSTED ENOUGH by Google to serve 100% dupe content
So again, it’s still a matter of AGE & TRUST - large sites get away with dupe content as we see. Or as Todd Malicoat said years ago already (think it was him) – big brands get away with a lot more than small webmasters!
What would you do if you were the SEO for the University of Toronto?
Redirect the .edu to the utoronto.ca with a 301?
Probably, but what if according to Patrick Altoft the anchor text passing via 301 stops anytime – they would lose rankings from those 3000 links then… I’m sure.
What’s your take?



February 6th, 2009 at 4:47 pm
I’m with Gero on this one, I’ve seen this dozens of times. Google is associating some of the pages with one domain and the rest of the pages with the other domain. This is pretty common when a website is mirrored on two domains. Just look at the cache of the homepage. http://74.125.47.132/search?sourceid=navclient-ff&ie=UTF-8&q=cache%3Ahttp%3A%2F%2Fwww.toronto.edu%2F
The homepage cache is associated with the utoronto.ca domain. You wont find the homepage cached on the other domain. (check for yourself). If you want to investigate dup content, the first place to look is Google’s index and see which pages are getting cached.
February 5th, 2009 at 4:02 pm
Wow, what a great analysis! Thanks for this really fine piece of information about trusted sites ranking with dupe content. Also thanks for doing so much for the community by sharing all this on your site for free Chris.
Best from Hannover,
Max
February 5th, 2009 at 5:15 pm
Thank you for this good article. It is a nice analysis about duplicate content.
February 5th, 2009 at 5:57 pm
I have found quite by accident exactly the same results as you found. Last year I was testing a CMS we built to manage a bunch of sites. I simply copied and pasted an article about the property market in Paris that had just been published by a major news outlet into the “new post” form for an old, trusted site I owned. I published the exact copy and within 20 minutes it was indexed and was beating the major news outlet in Google SERPs!
February 5th, 2009 at 6:09 pm
For most of the duplicate contents the domains are only indexed and ranked with only one version, e.g. http://www.utoronto.ca/tsq/ vs. http://www.toronto.edu/tsq/, http://www.utoronto.ca/ceres/ vs. http://www.toronto.edu/ceres/ – one version has got good pagerank and the other version has got “bad rank”… the site-request for both sites is also completely different…
I’ve seen a lot of Domains with identical contents and own backlinks, where google indexed some documents with one domain and some with the other and some for both.
In Your example both domains are both strong enough to keep some of their sites in the index and rank them.
i still believe there is no thing such a DC-Penalty but a DC-Filter that kills the dupes. the problem is, that you share domain & linkpop for both domains and lose the incoming links for the dropped dupes. therefor a 301 would be a good idea.
February 5th, 2009 at 6:39 pm
Did you do any analysis on wapedia.mobi? I’m impressed that they often rank very well just having duplicated the content from wikipedia.org.
February 8th, 2009 at 3:49 am
Have you ever searched for song lyrics, recent news stories or a even common public domain articles? You’ll find that the so called “duplicate content penalty”, where Google will only show one URL with ‘duplicate content’ in results, is a complete myth. Now, which one is on top of results is a function of the site’s authority and backlinks but there is not a penalty as such.
July 13th, 2009 at 4:03 pm
I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty. Should I stop publishing my articles on article directories?
July 13th, 2009 at 4:28 pm
@Introspective: yes, in fact if you are publishing your own content on your own site AND the article directories you’re running the risk of doing more harm than good to your site
@frank carr: I agree that google sometimes lists a fraction of the duplicated pages in the unfiltered results. it’S not a myth, but Google tries to present a small selection of DUPE content… however from my observations I see that typically a 10th of the dupe content pages are shown – at max – … so if you see 100 results for a dupe song lyrics you can be SURE that there are 1000 pages carrying that song lyrics that are filtered (and a part of that shown in the unfiltered results)
January 27th, 2010 at 4:14 pm
I did some testing recently: If you investigate a search for a keyword that is so common that nobody would even think of optimizing a website for it, like maybe “com”, it will dig up preferably websites from the .com TLD which is quite natural, and I found it amazing that large sites with a lot of content and/or change frequency (of the website in total, not necessary of a single document) seem to be preferred. So this would confirm your conclusion that smaller sites are disadvantaged.