Skip to main content
Article
Cataloging GitHub repositories
EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16
  • Abhishek SHARMA, Singapore Management University
  • Ferdian THUNG, Singapore Management University
  • Pavneet Singh KOCHHAR, Singapore Management University
  • Agus SULISTYA, Singapore Management University
  • David LO, Singapore Management University
Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2017
Abstract

GitHub is one of the largest and most popular repository hosting service today, having about 14 million users and more than 54 million repositories as of March 2017. This makes it an excellent platform to find projects that developers are interested in exploring. GitHub showcases its most popular projects by cataloging them manually into categories such as DevOps tools, web application frameworks, and game engines. We propose that such cataloging should not be limited only to popular projects. We explore the possibility of developing such cataloging system by automatically extracting functionality descriptive text segments from readme files of GitHub repositories. These descriptions are then input to LDA-GA, a state-of-the-art topic modeling algorithm, to identify categories. Our preliminary experiments demonstrate that additional meaningful categories which complement existing GitHub categories can be inferred. Moreover, for inferred categories that match GitHub categories, our approach can identify additional projects belonging to them. Our experimental results establish a promising direction in realizing automatic cataloging system for GitHub.

Keywords
  • GitHub,
  • Latent Dirichlet Allocation,
  • Genetic Algorithm
ISBN
9781450348041
Identifier
10.1145/3084226.3084287
Publisher
Association for Computing Machinery
City or Country
Karlskrona
Creative Commons License
Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International
Additional URL
http://doi.org./10.1145/3084226.3084287
Citation Information
Abhishek SHARMA, Ferdian THUNG, Pavneet Singh KOCHHAR, Agus SULISTYA, et al.. "Cataloging GitHub repositories" EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16 (2017) p. 314 - 319
Available at: http://works.bepress.com/david_lo/212/