• Uncategorized

Over half of GitHub is duplicate code

One of GitHub‘s mission is to share code, so it’s no surprise to find that up to 70% of the code on its platform is duplicated. Instead of measuring duplicate code on GitHub, the original plan of an international team of 8 researchers looked at file differences between different clone libraries and found an alarming percentage of file-level copies that changed the direction of the research.

X = files, Y = commits, Color = dupes. Source: DéjàVu: A Map of Code Duplicates on GitHub, Lopes et al at ACM

 

Researchers found that only 85 million of the 428 million files on GitHub are unique. The study was published at the OOPSLA SPLASH conference. JavaScript is the most cloned environment, 94% of the JavaScript files are copied; 73% of the C ++ files are duplicates, compared to 71% of the Python programs. Java is one of the most unique, but the repetition rate has reached 40%.

Reference: theregister