Library Kata "Detect file duplicates"

Develop a library that can be used to find file duplicates in a directory tree.

The library's contract should look like this:

interface IDublette check {
	IEnumerable Collect_Candidates(string path);
	IEnumerable Collect_Candidates(string path, comparison modes mode);
	
	IEnumerable Check_candidates(IEnumerable candidates);
}

interface IDublette {
	IEnumerable filepaths {get;}
}

enum comparison modes {
	size_and_name,
	size
}

First, the method Collect_candidates() is called. It runs through all files in a directory tree and only compares files very roughly. By default, the file name and size are compared, but only the file size can be compared if desired. Files that are then considered to be the same are returned by the function as IDublette back.

In a second pass, potential duplicates are then checked for actual equality. Now the complete content is represented by a [MD5 Hash] compared. Files that are now the same are saved again as IDublette is returned [1].

Variation #1

Supplement IDoublet check a mechanism to report the progress of the methods.

Resources

[MD5 Hash] http://de.wikipedia.org/wiki/Message-Digest_Algorithm_5

Endnotes

[1] Bear in mind that there may be groups of actual duplicates in a list of candidates.

en_USEnglish