Feb 11, 2008

Detecting Duplicate Code

I’ve recently stumbled across a tool that detects duplicate code (aka code clones) called CCFinder.  It’s definitely not the simplest thing in the world to download and install, and it’s a little bit painful to exclude generated files (it has to be done manually) but it does do make finding repeated code a lot easier than with a tool like Simian.  Why?  Because you get a visual indication of where the duplicates are and can see the code itself.  Plus you have greater control over the rules used to detect duplication of code.

Now before I run through things, it should be noted that I’d expect to see some level of "detected duplication" in any non-trivial code base.  There are often scenarios where one part of the code will have a similar structure to other parts even though they're performing different functions or you'll have generated code or have situations where avoiding duplication adds more complexity than it is worth.  Even so, the amount of duplication detected in any code base should be kept as small as possible as it keeps your code maintainable and means that fixes and changes only need to be made once.  There's nothing worse than fixing a bug in one part of the code, not realising the same code actually exists elsewhere and also needs fixing.

To give you an idea of what CCFinder shows have a look at the following screen shots:

This first picture is showing code clone metrics per file and a visual indication of where the code clones are.

In the left panel (the file listing) various metrics are shown per file:

  • LEN: File length (in tokens – variable names, method calls, etc)
  • CLN: Number of Code Clones
  • NBR: Neighbors – Number of other files that share a code clone in this file
  • RSA: Ratio of Similarity to another file.  Lower is better.
  • RSI: Ratio of Similarity within the file. Lower is better.
  • CVR: Coverage – Percentage of tokens covered by another code clone (and indication of how much of the code is duplicated)

In the right hand panel we have a visual indicator of where the clones are.  The long diagonal line can be thought of as a mirror line and the black marks on each side of that diagonal are the clones.  The large boxes are directory boundaries, so we can see which directories have more duplication than others.


We can also use Source view to see the duplicates between files.   For example this shows a SetDateLabel() method in two different files where the code only differs by the parameter being called.  It would be a great refactoring candidate.

If it's not obvious the section between the file listing and the code windows is a visual indicator of which file sections the two source windows are showing, and where the duplicated code is within those files.


You can also see code clones within the same file as well:


I've only used it for a short while, but I'm finding it to be very, very useful.  If you can work through the crappy web site and the awful download/registration process hopefully you'll find it just as useful.

If you want the software, it’s free but you will need to register to get a license.  Go to http://www.ccfinder.net/index.html and download it from there.


  1. Any idea on the relationship between Minimum Clone Length and Minimum TKS when configuring clone detection options. These two together determine what is a clone, but I don't have a clue what is the relation.

  2. Hi ,

    I was trying to run CCFinder but it seems I am getting a licencekey invalid error.

    I have:
    1)CCFinderX ver. 10.2.5 for WinXP
    2) java version "1.5.0_07"
    3)python 2.5
    4)licensedata.eml file in bin directory.

    Is there any idea why it is still givin the error?

  3. Actually it is free for non-commercial and/or educational purposes.

    Please refer to the license agreement:

  4. Hi Richard,

    If you are interested in code duplication detection you can also check out SolidSDD. It has great presentation features, very easy to configure, and it is free as well for educational and OSS projects.

    See: Source Code Duplication Detector (SolidSDD)

  5. @Lucian Thanks. Don't you think the price is rather high for a tool like that though?

  6. I guess the price of SolidSDD makes it less accessible to freelance developers. On the other hand it all depends on how much work it saves you. Even for a small company that can turn out to be very acceptable. Besides that, if you are using it on Open Source you can get it for free.
    It might be however a good idea for the SolidSDD developers to come with a version targeted at individual/freelance developers as well.

  7. See CloneDR for a tool that finds duplicates according the source langauge structure rather than text, so it isn't fooled by reformatting, variable name changes, etc. Several sample clone detection reports are avialable at the site.

  8. Hi Richard,
    I'll try to open this page, but it takes minutes to load. The css from dropbox.com couldn't be loaded and this site look like plain text now.
    Nevertheless, thanks for that review. I also look for a tool, which could detect code clones in PHP ...
    (after the Preview the css could be loaded!)