Skip to main content
Article
Detecting similar repositories on GitHub
SANER 2017: Proceedings of 24th IEEE International Conference on Software Analysis, Evolution and Reengineering: Klagenfurt, Austria, February 20-24, 2017
  • Yun ZHANG
  • David LO, Singapore Management University
  • PAVNEET SINGH KOCHHAR, Singapore Management University
  • Xin XIA
  • Quanlai LI
  • Jianling SUN
Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
2-2017
Abstract

GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN.

Keywords
  • Recommendation System,
  • Similar Repositories,
  • GitHub,
  • Information Retrieval,
  • search engines
ISBN
9781509055012
Identifier
10.1109/SANER.2017.7884605
Publisher
IEEE
City or Country
Piscataway, NJ
Creative Commons License
Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International
Additional URL
http://doi.org/10.1109/SANER.2017.7884605
Citation Information
Yun ZHANG, David LO, PAVNEET SINGH KOCHHAR, Xin XIA, et al.. "Detecting similar repositories on GitHub" SANER 2017: Proceedings of 24th IEEE International Conference on Software Analysis, Evolution and Reengineering: Klagenfurt, Austria, February 20-24, 2017 (2017) p. 1 - 10
Available at: http://works.bepress.com/david_lo/238/