This is the second semester of working with GitHub, though the project this semester is different from the last. Whereas last semester the team worked to build a fast classifier for programming languages given raw code, this semester the project is larger. GitHub has over 40 million public repositories at its disposal to compose a dataset, and one way to make use of this data in a novel and very useful way is to analyze the code people have written and determine what exactly makes well-written, modular code good and bad code unmaintainable and buggy. The ultimate goal of this project is to better understand what it means for a code base to be successful.
The problem statement is open-ended - there are two ways to frame the problem, either (1) by designing a well-defined metric for determining code quality, and implementing it to categorize, cluster, or rank GitHub repos, or (2) starting with initial correlates for successful code, such as repo metadata (eg number of stars, amount of activity, etc.) and extrapolate to find markers that indicate good or bad code. The intention is to apply the model to predict the success of new projects with code not in the training set.
At the end of the semester, the team will deliver a model that can take raw code and output some metric of how well-written the code is, such as a score, rank, or class. The first phase, and likely a large chunk of the project, of the project would therefore involve defining some kind of success metric and corresponding featurization for raw code. The second phase will entail sketching the model architecture, implementing, and training.