Digg Goes to the Source to Avoid Duplicates
Digg’s CSS started to go a little crazy about 10 minutes ago, which is a good sign that we would start seeing some new features showing up… and Digg didn’t let us down.
The main aspect to this update was to introduce a new ‘Dupe Detection’ system that users have been screaming for. Like some of the recent changes Digg has rolled out, this seems to be another really useful change that will only make the submission process easier and cut down on duplicate submissions.
The changes begin as soon as you attempt to submit an article. Digg immediately goes to the source of your submission and scrapes the page, analyzing various aspects to determine if the content might be a duplicate.
If Digg suspects the content might be a duplicate, then they display the possible duplicate submissions for you to review, before you even begin the submission process.
Not only will it help to save time submitting content that is only going to end up being a duplicate anyhow, this will also help to solve some of the various tricks users would try in order to fool to duplicate content system, such as adding additional parameters to the end of the submission url.
“To better understand the nature of the problem, we analyzed the types of duplicate stories being submitted. Most common are the same stories from the same site, but with different URLs. Our R&D team came up with a solution that identifies these types of duplicates by using a document similarity algorithm. Look for a separate tech blog post on how this works, but it has proven to be a reliable way of identifying identical content from the same source.” — Digg Blog
At this point you either have to abandon the submission or click the ‘My story has no duplicates’ button, which could potentially put your account in jeopardy if it is indeed a duplicate story. (Be warned that even before this system, users were warned about submitting duplicate content. Digg will ban accounts for submitting obvious duplicate content.)
“We’ll also be monitoring when certain Diggers choose to bypass high-confidence duplicates and will use this data to continue to improve the process going forward.” — Digg
Is the new system perfect?
No, as you can see in the picture below of a story that was release just a few minutes before its submission was attempted.
Digg found the following potential dupes:
However, at least you can make this determination at the beginning of the process instead of after you have filled out all the submission details.
If Digg does not detect any duplicate submissions after scanning the source, then you will not see this screen and will be shown Step 2 of the submission process.
One additional benefit to Digg scraping the source prior to being able to submit, is that they now automatically pull the title and a description (the first paragraph) as a recommendation.
In addition to the ‘Dupe Detection’ system being rolled out, Digg also made a few other minor changes.
Not sure how many people noticed, but Digg removed the ‘views’ count from the DiggBar. Originally it was only showing DiggBar views and since non-Digg users were being redirected, the data was inaccurate at best.
With the new updates today, the ‘views’ counter has returned, and should be showing all views to the content, not just the DiggBar views.
When looking at the various sections of Digg, you will notice a ‘more’ button that appears when you hover over an articles description. Clicking the ‘more’ button takes you to the article page where you can view the ongoing discussion in comments.
All in all, I think the addition of the new ‘Dupe Detection’ and the new submission process will help a lot in reducing duplication and the time it takes to submit your stories.
Edit: This post just hit the front page of Digg… Funny part is that it went to the front page at the exact same time as the Digg Blog post about… the same exact topic.
I actually did not know the Digg blog post was out when I wrote this article, and did not know it was up until Jen emailed me the link.
Also both articles are quite different, Digg’s being about announcing the feature, and mine explaining what it does and walking through the process.
So I would hardly consider the example above to be a failure in the detection system.