Skip to content

Simple brute force duplicate file identification

October 9, 2015

Here is a way to identify files that have duplicates.

find dir -type f -print0 | xargs -0 md5sum > filelist.txt
sort filelist.txt > filesort.txt
uniq -w 33 -D filesort.txt

# more legible
uniq -w 33 --all-repeated=separate filesort.txt # also --all-rep=sep works

This will show which files have duplicates. I saved the results in a file instead of piping everything so one can go back to filesort.txt and identify the other files which have the same md5.

Make sure you actually compare the files. Some files could possibly have the same md5sum without being the same. They will likely have a different size. It is possible for two files of the same size to have the same md5sum.For better positive hits, use sha256 (slower).

Advertisements

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: