Detecting Duplicate Code

I’ve recently stumbled across a tool that detects duplicate code (aka code clones) called CCFinder. It’s definitely not the simplest thing in the world to download and install, and it’s a little bit painful to exclude generated files (it has to be done manually) but it does do make finding repeated code a lot easier than with a tool like Simian. Why? Because you get a visual indication of where the duplicates are and can see the code itself. Plus you have greater control over the rules used to detect duplication of code.

Now before I run through things, it should be noted that I’d expect to see some level of "detected duplication" in any non-trivial code base. There are often scenarios where one part of the code will have a similar structure to other parts even though they're performing different functions or you'll have generated code or have situations where avoiding duplication adds more complexity than it is worth. Even so, the amount of duplication detected in any code base should be kept as small as possible as it keeps your code maintainable and means that fixes and changes only need to be made once. There's nothing worse than fixing a bug in one part of the code, not realising the same code actually exists elsewhere and also needs fixing.

To give you an idea of what CCFinder shows have a look at the following screen shots:

This first picture is showing code clone metrics per file and a visual indication of where the code clones are.

In the left panel (the file listing) various metrics are shown per file:

LEN: File length (in tokens – variable names, method calls, etc)
CLN: Number of Code Clones
NBR: Neighbors – Number of other files that share a code clone in this file
RSA: Ratio of Similarity to another file. Lower is better.
RSI: Ratio of Similarity within the file. Lower is better.
CVR: Coverage – Percentage of tokens covered by another code clone (and indication of how much of the code is duplicated)

In the right hand panel we have a visual indicator of where the clones are. The long diagonal line can be thought of as a mirror line and the black marks on each side of that diagonal are the clones. The large boxes are directory boundaries, so we can see which directories have more duplication than others.

We can also use Source view to see the duplicates between files. For example this shows a SetDateLabel() method in two different files where the code only differs by the parameter being called. It would be a great refactoring candidate.

If it's not obvious the section between the file listing and the code windows is a visual indicator of which file sections the two source windows are showing, and where the duplicated code is within those files.

You can also see code clones within the same file as well:

I've only used it for a short while, but I'm finding it to be very, very useful. If you can work through the crappy web site and the awful download/registration process hopefully you'll find it just as useful.

If you want the software, it’s free but you will need to register to get a license. Go to http://www.ccfinder.net/index.html and download it from there.