[ale] Filed De-duplication

Fri Oct 18 17:00:23 EDT 2013

When I was running a previous employer's file server (that I built on 
Gentoo, btw, referencing the other thread), I would pipe find output to 
xargs to md5sum to sort so that I could get a text file that I could 
visually eyeball to see where the dupes tended to be.  In my view it 
wasn't a big deal until you had, like, ISO images that a dozen or more 
people had copies of - if that's going on, there needs to be some 
housecleaning and organization taking place.  I suppose if you wanted 
you could script something that moved dupes to a common area and 
generated links in place of the dupes, but I'm not sure if that doesn't 
introduce more problems than it solves.

As for auto-de-duping filesystems - which I suppose involves some sort 
of abstraction between what the OS thinks are files and what actually 
goes on disk - I wonder if there wouldn't wind up being some rather 
casual disk operations that could set off a whole flurry of r/w activity 
and plug up the works for a little while. Fun to experiment with, I'm sure.

On 10/18/13 12:34 PM, Calvin Harrigan wrote:
> Good Afternoon,
>     I'm looking for a little advice/recommendation on file 
> de-duplication software. I've have a disk filled with files that most 
> certainly have duplicates.  What's the best way to get rid of the 
> duplicates.  I'd like to check deeper than just file name/date/size.  
> If possible I'd like to check content (checksum?).  Are you aware of 
> anything like that?  Linux or windows is fine.  Thanks
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>