Integrated Visual Software Analytics on the GitHub Platform

: Readily available software analysis and analytics tools are often operated within external


Introduction
During the software development process, a large amount of data is created and stored in the various software repositories.For example, changes to the code are managed in a version control system, tasks are organized in an issue tracking system, and errors that occur are documented in a bug tracking system.Software analytics uses software data analysis and information visualization techniques "to obtain insightful and actionable information from software artifacts that help practitioners accomplish tasks related to software development, systems, and users" [1].The applications in which software analysis is used are diverse [2], e.g., effort estimation [3], social network analysis [4], or using visualization to support program comprehension tasks [5][6][7].Of particular relevance is the analysis of git repositories [8], as widely used type of repositories, and GitHub as popular social coding platform [9].Various platforms have been developed to provide software analytics services to stakeholders [10][11][12].These analytics services either integrate directly into the Continuous Integration (CI) pipeline or they are to be operated externally [13].In both cases, only a higher-level view on the analysis results are reported back to the developer by means of a review command, or a dashboard overview or visualization on the services' side.On the other hand, there are low-level tools available for direct use 1 , but they are usually operated within those analytics services or their results are only used at a higher level.While the techniques and tools are available for open source and industry projects, the processing steps as well as the data storage of software analysis data is usually considered separate to the source code repository.For example, the source code of an open source project like Angular can be hosted on GitHub and build using GitHub Actions [14], but software analysis is performed through external services and external storage -here, GitHub CodeQL2 and OpenSSF Scorecards 3 .Using readily available, external services allows for easy-to-integrate software analysis, but the analysis results are kept internally by the operators of those services -an association of source code to the derived analysis data is not considered.This comes with a number of limitations on the availability and reusability of those software data.For one, the performed analyses are I/O-intensive, implementation-specific, and usually time-consuming, as whole software projects and further software data repositories are parsed and analyzed.Second, the derived data is not externally available for further processing and visualization.Third, using external services limits the available analyses by means of mining tools, software metrics, and higher-level analysis and reports.The latter two impedes easy access to "resources and tools needed for practitioners to experiment and use MSR techniques on their repositories" [15].Last, this unavailability of the analysis data for third parties leads to multiple computations of such analyses as there is a broad interest in software measurements, e.g., by the Mining Software Repositories community and for software quality assurance and modern development processes and practices.To summarize, current state of the art has the following limitations: 1.
Readily available software analytics tools are often operated as external services,

2.
where measured software analysis data is kept internal, 3. and no external use of the data is available.
We propose an approach to derive software analysis data during the execution of a project's CI pipeline and store the results within its source code repository.This approach is exemplified using GitHub and GitHub Actions together with an exemplary set of static source code complexity metrics.For this, we propose a default component to run for software analysis, such that software metrics are computed and stored on a per-commit basis.As accessible storage location, we use the git object database and mirror the commit graph structure to augment existing commits with software analysis data.We use the GitHub API to store the software analysis data within the git repository.This data can later be used for further software analyses and software visualization (Figure 1).Although CI and GitHub Actions are often used to ensure quality and thus approachability of a project, using them to provide a form of public self-representation whose underlying data is reusable is underrepresented [16,17].We validate our approach with a case study on 64 open source GitHub projects written in TypeScript and show the performance impact on the CI and memory impact on the git repository.Last, we discuss the approach in the context of the diverse set of open source projects, different development environments, and analysis scenarios.
The remainder of this paper is structured as follows.Section 2 introduces related work.In Section 3, we present our approach and prototypical implementation for integrated software analytics.In Section 4, we describe our case study and evaluation of run-time performance and memory overhead.We discuss the approach in Section 5, focusing on limitations and extensibility.In Section 6, we conclude this work.

Related Work
Software analyses became a standard activity during software development that is usually executed as part of the CI pipeline.Thereby, the activity can be decomposed into several phases: (1) software repository mining, (2) optional intermediate storage, and (3) communication of the results.Specific to our proposed approach, the corresponding related work can be categorized into (1) tools for mining software repositories, (2) software metric storage and storage formats, (3) and software visualization.As the overall process targets an integration of software analytics into the GitHub platform, general software analytics systems are related work as well.

Tools for Mining Software Repositories
Version control systems, such as git, enable collaborative work on software projects.All activities and the entire history of a project are stored in a repository, which provides much information for further analysis.Example applications for analyzing git repositories include capturing static and dynamic software metrics [18][19][20], locating expertise among developers [21], or measuring environmental sustainability [22].The extraction of relevant data requires efficient processing tools, e.g., for compiling software metrics [23].An example of such a tool is PyDriller, which allows efficient extraction of software metrics from a git repository [24].By combining different optimizations, e.g., in-memory storage and caching, pyrepositoryminer provides an alternative tool that shows better performance.Other examples with different aspects of variation are (1) ModelMine [25], a tool focusing on mining model-based artifacts, (2) GitcProc [26], a tool based on regular expressions for extracting fine-grained source code information, (3) Analizo [27], a tool with support for object-oriented metrics in multiple languages, (4) LineVul [28], an approach for predicting vulnerability within source code, and (5) srcML [29], an infrastructure for the exploration, analysis, and manipulation of source code.
In addition to efficiently processing individual projects, it is often necessary to process entire collections of projects, for example, to generate data for training ML procedures.One of the first attempts to make data from GitHub accessible for research is Boa [30].Besides the infrastructure, it provides a domain-specific language and web-based interface to enable researchers to analyze GitHub data.Similarly, GHTorrent provides an infrastructure for generating datasets from GitHub [31], which can further be made available for local storage [32].An infrastructure that also provides a frontend is given by SmartSHARK [33].A technical hurdle in crawling large datasets from GitHub is the limitation of API requests.Crossflow addresses this problem through a distributed infrastructure [34].Besides source code, other software repositories, e.g., issue tracking systems or mailing lists, are also suitable for collecting information for subsequent analyses [35].

Metric Storage Formats
Source code metrics and similar software analyses are directly derived from recorded software data are often cached or stored after computation.This is feasible as such metrics and analyses are determinate and desirable as their computation can be time-and memoryintensive.For such storage, state-of-the-art approaches are applicable and usually chosen based on structural complexity, amount of data, and a developers' personal preference [36].As a result, there is a broad diversity in used data models, storage systems, and formats.With a file focus, the common formats XML [37], ARFF [38], CSV [39], and JSON -more specifically JSONL [40] -are used as well.Specific to the Moose system, there is also the MSE file format to store static source code metrics [41].As a standardized format for static source code analysis results, there is the SARIF 4 file format that is also used by GitHub for their security dashboard.These approaches are not strictly used in isolation, but can be used in combination as well [11,42].Although stored as files, for subsequent analyses in individual MSR use cases, these metrics are further gathered and stored into own databases [43].For example, relational databases as Postgres are used by projects as source{d} 5 and Sonarqube6 .

Software Visualization
For the observation of recorded metrics by a user, they can be depicted using a tablestructured representation.However, this approach does not scale for even mid-sized projects [44].As software itself has no intrinsic shape or gestalt, the area of software visualization provides techniques for representing software projects' structure, behavior, or evolution for supporting the stakeholders in different program comprehension tasks.In many cases, the layout of a visualization is derived from a project's folder hierarchy [45], e.g., when using treemaps [46].Software metrics can be mapped on the visual attributes of treemaps, e.g., texture, color, and size [47].Especially, 2.5D treemaps provide further visual attributes, which motivates their use for exploring large software projects by means of code cities [48], software cities [49], or software maps [5].Besides hierarchy-preserving visualizations, layouts can also be generated based on the semantic composition of software projects [50,51].In this case, abstract concepts in the source code are captured by applying a topic model, which results in a high-dimensional representation of each source code file.The local and global structures within the high-dimensional representation are captured in a two-dimensional scatter plot after using dimensionality reduction techniques.By enriching the visualization with cartographic metaphors or the placement of glyphs, software metrics can be mapped in the visualization.

Software Analytics Systems
Various Software as a Service (SaaS) platforms have been developed to gain insights from the development process and support developers in their work.Thereby, the intended use case is either (1) software analytics for a single project or (2) software repository mining for a large set of projects.The former use case is supported by platforms such as Sonarqube and the source{d} Community Edition.The latter use case is supported by research platforms such as MetricMiner [52] and GrimoireLab [53].For metrics already measured by GitHub, there is also Google BigQuery for Github 7 , which allows to access the data using an SQL interface.Last, there are some software analytics platforms that are deemed to be used for both use cases -serving both researchers and software developers -such as Microsoft CODEMINE [11].Another example is Nalanda, which comprises a socio-technical graph and index system to support expert and artifact recommendation tasks [12].As main demarcation and apart from readily available tools, infrastructures and full-featured, external software analytics services, we propose an extension to visual software analytics by means of an integrated approach within the GitHub platform.

Approach
Our proposed approach consists of two components: software analysis and software visualization.The software analysis component builds upon GitHub Actions to provide per-commit software analysis while storing the results as blobs in the git objects database of a project.The results are available for further processing and visualization for internal and external use cases, e.g., software visualization (Figure 1).Our software visualization demonstrator is implemented as a web application that fetches the analyzed data and renders them in an interactive software map client.

Process Overview
Both the analysis and the visualization operate in an isolated manner with a shared point of interaction: the git repository of the software project on GitHub (Figure 2).The analysis component integrates into the GitHub CI process and the visualization component integrates into web pages, e.g., hosted by GitHub Pages.The overall process is split into phases matching the two components and is summarized as follows: the analysis phase including storage of the results ( 1 -3 ), and the visualization phase ( 4 -5 ).The analysis phase is started when a developer creates and pushes a commit to the git repository, starting its CI 1 .After project-specific analysis 2 , the software analytics data is added to the repository as git blob objects 3 .This allows to annotate each commit of a repository with project-specific software analysis data, such as source code metrics.Later, this data can be queried and fetched from a client component 4 and used for visualizing the software project 5 .For example, we use the data to derive a representative visualization of a software project that can be shown to maintainer, developers, contributors, stakeholder, and visitors (examples in Figure 6).Such a visualization can be embedded into a project's landing page and serve as a self-presentation to potential new collaborators and even long-time collaborators.

Analysis
The analysis is designed to be part of a project's CI process.As such, we designed an extension to available CI processes on GitHub by means of a GitHub Action.This action is specifically designed to analyze the source code for a given commit 1 , i.e., the CI can be configured to execute this action on push to a branch.The general processing approach for this action is to collect the source code, apply static source code metrics, and store the results.However, choosing metrics for analysis is highly dependent on the used programming languages, the quality goals, and available implementations.As such, we see this as a major point of variation for future work.The interface for GitHub Actions for integrating potential metrics implementations is a Docker container, which allows for a highly flexible use of available tools and own developments of metrics.

Storage
The output of the analysis component is then stored within the git repository.Such a repository could contain different types of objects, but for interoperability and available   APIs we focused on files to represent software analysis data.Specific to our prototype, we use a CSV file format where each line contains the measurements for a source code file, identified by its file path.Although these metric files are created within a Docker container, this container has only read-only access to the git repository.Instead, we use the GitHub API to store these files within blobs 8 .The API allows to manipulate the git trees and refs using the /repos/{owner}/{repo}/git/trees and /repos/{owner}/{repo}/git/refs endpoints, respectively.This file is then committed to the git repository using a commit-specific git refs tree in the location refs/metrics/{sha} (Figure 3).This allows to query the software analysis data within the refs/metrics subtree from a given git SHA later on.
For convenience, we create and maintain specific git refs to branches as well.The sequence of requests is as follows.We first create a tree by sending a POST request to the /repos/{owner}/{repo}/git/trees endpoint.The APIs response will contain a SHA-1 hash of the newly created tree.We then create a reference under refs/metrics/{sha}, storing the SHA reference to the tree.This is achieved by a further POST request to the /repos/{owner}/{repo}/git/refs endpoint.This ensures that the blob tree is retrievable for every analyzed commit.Last, we populate the tree with the CSV file.

Visualization
The per-commit software analysis data is then available for fetching and visualization by the visualization component.This visualization is a hierarchy visualization by means of a software map, as we chose to measure software metrics per file that is organized in a file tree.The data retrieval consists of multiple requests and uses the GitHub API as follows.The prototype first request the metrics reference for a certain commit using a GET request to the endpoint /repos/{owner}/{repo}/git/refs.The retrieved tree SHA is then used to request an intermediate blob tree at the /repos/{owner}/{repo}/git/trees endpoint.This gives us a tree that stores the SHA reference to the blob containing our metrics data.This hash is then used to request the blob using another GET request, this time to the /repos/{owner}/{repo}/git/blobs endpoint.Once the blob is retrieved, the last step is to decode the base64-encoded content of the blob to retrieve the metrics content that is stored as a CSV string.
Parsing this string as tabular data results in a dataset suitable for software maps.Thereby, the software map visualization technique is a 3D-extruded information landscape that is derived from a 2D treemap layout.The tree structure for the treemap layout is hereby derived from the tree structure of the file path.The available visual variables in the visualization are footprint area (weight), the extruded height (height) and leaf color (color).The visualization allows for basic navigation through the 3D scene, allowing users to make themselves familiar with the project and build up a mental map [54].

Prototype Implementation Details
We prototypically implemented the proposed approach as an open source project on GitHub.It is available within the project github-software-analytics-embedding 9 .Additionally, we provide the GitHub Action on the market place 10 .Adding this action to a repository enables the integration of the prototypical TypeScript source code metrics for new commits.An example client 11 that is build with React is hosted on GitHub Pages (Figure 4).However, the client could also be embedded on any self-hosted web page (such as 9 hpicgs/github-software-analytics-embedding 10 https://github.com/marketplace/actions/analytics-treemap-embedding-action 11https://hpicgs.github.io/github-software-analytics-embeddingGitHub pages) using just an HTML script tag (Figure 5).Our prototypical analysis module is written in TypeScript.We decided to use TypeScript as a programming language because it provides first-citizen support for TypeScript code analysis using the TypeScript compiler API.The analysis code first creates an abstract syntax tree (AST) for each TypeScript file in the specified repository path.Then, the AST is used for static source code analysis.We decided to focus on a few simple software metrics, which include: The LoC metric returns the total number of source code lines a source file contains.NoC counts the occurrence of comments, counting both single-line comments and multi-line comments as one, while CLoC focuses on the code lines comments take up in a file.A single-line comment would therefore count as one, while multi-line comments would count as their respective number of lines.The DoC is calculated by dividing the sum of CLoC and LoC by the CLoC.The number of functions NoF count the number of method declarations and function declarations within a source code file.

Evaluation
We integrated our approach as GitHub Action into 64 open source TypeScript projects of various sizes.Then, we benchmarked the performance of this action and resource consumption within the git repository.Specifically, we compared the transmission size of a single metric blob, the pure metric calculation time for all TypeScript files in the repository, the total execution time of our GitHub Action, and an extrapolated metric blob memory consumption when used for every commit on the main12 branch.Thereby, the integration process consisted of forking and adding the GitHub workflow file to each of the repositories, which took approximately two minutes per project.

Case Study
The projects were chosen to use TypeScript as one of their programming languages while being either known to the authors or popular within the community (see details in Table A1 and Table A2).These projects differ largely in size, application area and development processes.The only common characteristic is the set of chosen programming language TypeScript or the availability of TypeScript typings, i.e., that the project contains .tsfiles.The size of the projects range from only a couple of files with a few hundred lines of code to almost 35 k source code files with above 6.5 M lines of code.Four example projects are highlighted in Table 1 and Figure 6; the remainder is available in the appendix, Table 1.Excerpt of the TypeScript repositories used for evaluation.The number of commits relate to the observed branch.The number of files represent the number of TypeScript source code files in the most current commit on the branch.The lines of code (LoC) are the lines of code from the TypeScript source code files.The overall share of TypeScript to the other programming languages (TS) is the self-declaration of GitHub and is a rough estimate.The full list is provided in Table A1   supplemental material, and online prototype (Table A1, Table A2, Figure A1, and Figure A2).

Repository Memory Impact
We measure memory footprint by the size of the base64-encoded metrics file response of the API, although it may be stored compressed within the git repository.The memory footprint of our analysis of a single commit scales linearly with the number of files within a project (Figure 7).This is to be expected as each file in the repository is represented through a single line in the metrics file, where each line stores the numerical values of each metric with a strict upper bound on the character length.The memory footprint seems rather high for large software projects as Angular or Visual Studio Code with a couple of hundred kilobytes per commit.However, smaller projects can profit from a lowconsumption software analysis component.Further, the per-commit blob size is a trade-off between a full CSV file of all files and their metrics and only a file for all changed files.While the former approach allows to fetch all metrics for all files at once, which is especially suitable for visualization, the latter approach allows for a much smaller memory footprint and is considered a default approach in software analytics [24].However, providing a full visualization for the latter approach results in a multitude of requests.
While extrapolating the per-commit blob size to whole repositories naively, i.e., simulating an integration of our approach from the first commit, the proposed technique shows strong limitations Figure 8.The simulated extrapolation assumes that each and every commit of the main branch would have it's files analyzed and stored within the repository with no data retention policy.As an upper bound, the results indicate a median increase of the repository by the factor two with an absolute increase of 180 MB.This number will be considerably smaller when taking into account (1) the compressed, binary representation of the git blob, (2) a more sensible application of the approach by only major commits instead of every one on the main branch, and (3) differential metric files containing only changed files.Reducing this to an empirically validated factor is still future work.

CI Execution Time Impact
The time our metrics computation took does not scale linearly with the lines of code of a project (Figure 9).However, even for large projects such as Visual Studio Code and Angular, the time to measure all files is limited to a couple of seconds (up to 8.2 s for Visual Studio Code).The maximum measured time was approximately 58 s for the Definitely Typed project.Considering the overall execution time of the GitHub Action (Figure 10), the process does not seem to scale linearly by neither Lines of Code nor number of files.However, for projects below 1 000 000 LoC or below 10 000 files, this process does not run longer than 10 seconds.

Practical Considerations & Recommendations
We conclude that the general runtime and repository size overhead is sensible for small and mid-sized open source projects.The proposed approach in its current stateprototypical, unoptimized, and limited in features -does scale for open source projects up to medium size.An example project would be Angular CLI, which comes with 14.5 k commits, around 1 k files and above 100 k LoC.The corresponding memory and runtime impact would be 3 s of GitHub Action time (whereof 1.5 s is the metrics computation), and 102 kB of base64-encoded metric blob size which would result in doubled repository size when measured for every tenth commit on the main branch since the very start of the project.Within our sample of 64 TypeScript projects and measured by memory impact Metrics Blob Size in kB Figure 8. Extrapolated repository size impact if every commit of the main branch would be augmented with software metrics information, measured by base repository size (log-log axis).Color represents the per-commit metric blob size as a second visual indicator.A derived linear regression (gray line) suggests that a repository would increase its size by 1.3-fold, i.e., the final size would have factor 2.3.However, the spread is rather high and corresponds to the number of commits on the main branch of a repository.
when measured for each commit on the main branch, Angular CLI is larger than 54 projects and smaller than 9 projects, resulting in the 86th percentile.Thus, the majority of projects are smaller and applicable for our proposed approach.

Discussion
This analysis, however, comes with multiple assumptions and design alternatives.As such, the measurements and results are specific to the chosen implementation and environment, i.e., GitHub, its Actions as CI, git, the GitHub API, the TypeScript language, an own metrics analysis component, and according integration and assumed usage by open source developers.This comes with a number of threats to validity to our results, as well as points for discussion on limitations through the specific environment we have chosen, and a broad set of opportunities for extensions to the proposed approach.

Threats to Validity
We identified several potential threats to the validity of the results, covering both the runtime analysis and the storage consumption analysis.

Runtime Analysis
For example, one limitation is our choice of a prototype implementation for the metrics computation rather than employing existing, established tooling.This approach allowed for a focused, controlled and low-profile metrics computation component to be used for the proposed approach.However, we see our measured timings as some kind of lower bound for the execution time of a static source code analysis.Further, the analysis component cannot be considered production-ready by means of stability and available features.
As the analysis component with the specific metrics does not reflect the usual load an actual analysis component would bring into a CI pipeline, the execution time is expected to further increase through computational costs for additional or more complex metrics.We assume that an alternative use of real-world metrics computation tools would increase the measured timings, but not by multiple orders of magnitude.Further, the allocated runners for the CI pose a threat to validity.To properly control for the allocated runners, the study  should be conducted with self hosted runners.However, these runners are the default runners that would be used by a majority of open source projects.

Storage Consumption Analysis
Regarding the storage consumption analysis, one threat is the inaccuracy in measuring the metrics blob size.We measured the base64-encoded API response string, which represents an upper bound for the required storage within the repository.Further, the employed extrapolation on the assumed storage are based on unknown actual usage scenarios.For one, we suggest to use a GitHub Action that gets triggered on each commit on a set of target branches.This may or may not be a sensible configuration.However, this configuration largely influences the overall memory consumption over the history of a software project.Further, the extrapolation assumes that the metrics blob file is constant in size, which correlates with the number of files in a repository being constant.This is a factor that will likely change over the history of a software project.

Limitations
An application of our approach to further open source projects on GitHub may be subject to technical limitations, for example overcoming scalability issues, handling advanced git workflows, and facing security issues.

Scalability
Scalability for the proposed approach is a main topic as GitHub wants to ensure continuous service for all its users, which concerns available space per repository and execution time for the shared GitHub Action runners.While the default timeout for the shared runners is at six hours 13 and not likely to be a direct limitation based on our tested open source projects, a more comprehensive analysis covering multiple commits within one GitHub Action may run out of time.For those cases, GitHub allows to register and use self-hosted runners 14 .Likewise, switching to an external CI service that would also allow to run the analysis component -available using Docker -may come with higher limits for computation.As another alternative, a developer of the project could execute the Docker image on their local machine.
Further, git repositories on GitHub have a soft limit in size 15 .Executing the metrics computation process for each and every commit and storing the full dataset in an evergrowing software repository is bound to reach those limits.Mitigations include different directions: (1) switching to an external file storage, such as git LFS, external databases, or foreign git repositories 16 , (2) integrate data retention policies and remove metrics data when superseded or obsolete, and (3) thin out the measured commits and focus on more important commits such as pull requests and releases.

Advanced git Workflows
As a distributed version control system, git allows for more advanced usage scenarios to advance and handle the history of a software project.One such feature is the rebase, another would be a commit filter, but the overall category is a history rewrite.Such a rewrite would derive new commits from existing ones while invalidating the latter ones.Currently, our proposed approach would naively handle such rewrites by recomputing the new commits as if they were normal commits.Any invalidation of stored metrics data for the obsolete commits is currently missing.Specific to this issue, but also applicable in a general sense, would be a handling of obsolete metrics data through the git garbage collector.

Security Considerations
Further, the proposed public, side-by-side availability of software metrics is subject to security considerations as the measured software may represent sensitive information.The targeted use cases for our approach are open source repositories that wants to apply lightweight software analysis on their already public source code.This public availability makes these repositories subject to external source code mining on a regular basis [55].Anyone with software mining tools can download the source code, derive software metrics, host them anywhere, and analyze them at their discretion.We argue that any securityrelated attack vector is introduced with publishing the source code and not with making own software metrics available.On the contrary, with our approach, we connect to the original idea of developing source code publicly.A broad community can participate and ensure a more healthy software development process and thus a more healthy software project.One adaption to our approach to protect the measured software data is to use an external database.This adaption, however, would prevent other use cases such as public availability of visualizations of the software project.Security considerations in the area of open source development remain their own field of study [56,57].

Extensibility
The current state of the approach and prototype allows for a number of extensions in various directions, namely other modes of integration into the development process, the supported languages, supported metrics, available visualization techniques, and the types of stored artifacts.The current, narrow focus on single implementation paths limits the applicability of the approach considerably, as it is specifically designed and implemented to work for the CI process of git repositories of the TypeScript parts of open source projects hosted on GitHub, where a small set of static source code metrics are derived and later visualized using the software map visualization technique.Applying further state-of-theart approaches in these directions would increase the fit for more use cases, application scenarios, and software projects.

Modes of Integration into Development Process
To allow for a low-threshold integration into an open source project's development process, we proposed the integration into the GitHub CI processes using GitHub Actions on a single commit at a time.However, there are further modes this software analysis component can be integrated into the development process.For example, the trigger can be changed to trigger on pull requests or releases, or even on manual start through a contributor or even a software component.In the end, this storage can be considered a caching mechanism where the the cache can be populated by triggering the execution of the software analysis component and storing the data through the GitHub API.As an alternative to the GitHub API, it is feasible to use the git API directly and pull and push the according refs directly.This would also render this approach available to other software project management platforms and even plain hosting of git archives.Further, each analysis process is not technically limited to measuring one single commit in isolation.This allows for (1) an extension to handle multiple individual commits and whole commit ranges within a single analysis process, and (2) to use more information sources in addition to the checked out commit, such as issue databases, development logs, CI logs, or source code of other commits.An extended analysis however would increase the computation time considerably.Specific to GitHub, there is currently a six-hour-long time limit for the shared runners, which would allow for such an increased amount of analysis.

Supported Programming Languages
Next to the integration into GitHub and the development process, the approach and prototype could be adopted to support further languages.As the implementation details surrounding the analysis component do not rely on any specific language -they are designed to be language agnostic -, supporting further programming languages is straight-forward and usually implemented using language-agnostic tools.Allowing for multiple programming languages is further important as software projects likely use multiple languages within one repository [58].

Supported Metrics
For demonstration purposes, we focused on static source code analysis metrics for our analysis component.However, the design and implementation of the prototype specifically allows to use a broad range of software analysis tools and custom implementations, and thereby, languages as well.More importantly, a broad view on the state and evolution of a software project comes with metrics explicitly covering system dynamics and the evolution of metrics over time.As such, the current approach to store file-focused software metrics will get obsolete and more diverse storage formats needs to be used.However, for a low-threshold access to those metrics and no further dependency to third-party services, we suggest to retain file-based storage within the git repository.

Visualization Approaches
While our current prototype is built upon static source code analysis metrics and the software map visualization technique, the underlying idea of fetching the software metrics directly from the repository does not limit the use of specific software visualization techniques, e.g., source-code-similarity-based forest metaphors [59,60].More specifically, the integrated software analysis data is a specific kind of database, that each technique should be adoptable to.Potential limitations come from the chose metrics measured and chosen file formats, both of which can be chosen unrestricted by our proposed approach.This flexibility enables contributors and developers to tailor the representation of their project and researchers to test novel visualization techniques on already measured software projects.

Stored Artifacts
Similar to the supported programming languages, metrics, and visualization techniques, the files stored as blobs within the git software repository are not limited to the proposed approach: storing software metrics.Instead, there are only a couple of limiting factors to the blobs stored within the repository, which is the base blob size, the overall repository size, the access speed through APIs, and possibly rate limits to ensure fair use of the APIs.This allows for a more diverse use of the available storage to augment software repositories.One example is to skip storage of the software metrics, but to derive and store a static image of the software system instead.Although more complex, this corresponds to the creation and storage of project badges -such as the shields.ioservice 17 -directly within the software repository.

Conclusions
When a software development team wants to integrate software analysis to their project, selecting tools or services are a trade-off which usually results in (1) no control over metric computation, or (2) no persistent availability of low-level analysis results.We proposed an approach to augment git commits of GitHub projects with software analysis data on the example of TypeScript projects and static source code metrics.The analysis is performed as part of a GitHub Actions CI pipeline, whose results are added to the git project as own blobs.These results are thus persistently stored within the project and accessible through standard git interfaces and the GitHub API.The used analysis tool and visualization technique are designed to be exchangeable.The requirements to satisfy are the availability of analysis tools for Docker containers and the storage of software data within the git repository.To demonstrate this approach, we visualized GitHub projects using a basic React client and software maps as the visualization technique.We further performed an evaluation on 64 open source GitHub projects using TypeScript as their main or auxiliary language.The analysed suggests that small and mid-sized software repositories have only little impact to their CI runtime and repository size, even with extensive use of the proposed approach.
As such, we see primarily a low-threshold and low-cost adoption of our approach for small and mid-sized open source projects that are otherwise struggling to setup their own software analysis pipeline, e.g., using external services.With our approach, we strive for direct access to abstract software information for the broad range of open source projects and their public representation to allow for a quick overview and a gestalt-providing component.Directly concerning open source projects and their development, we hope to increase a project's "ability to be appealing" [61] to both existing and new collaborators.We further argue for versatility and flexibility of the underlying approach to store commitrelated data directly within the git repository.Concerning the MSR community, such a broad integration of software metrics into the git repository would change availability and use of the data for novel analyses and replicability of published results.Extrapolating, large-scale evaluations of source code metrics can profit from already computed metrics within each repository through our approach [62].Further, dedicated software analysis data repositories can be either derived directly from the software repositories, or these repositories can be considered distributed datasets instead [55].
For future work, we see a replacement of the analysis component for one with a broad support for programming languages and software metrics.As such, we see the other areas of software metrics -dynamic metrics, process metrics, developer metricsas well as higher-level key performance indicators that should be available as well.Next to software measurements, the proposed approach can be used to store and provision derived visualization artifacts [39].Further, we consider to also allow developers perform the analyses on their machines and commit the results alongside their changes into the repository.This would allow for both CI and developers to perform measurements and distribute the workload, e.g., when computing measurements for whole branches of a project.From an MSR researchers' perspective, augmenting the commits of distributed software projects, for example through forks, by means of "rooted" repositories 18 would provide a greater impact, even with lower impact on overall repository size through reduced copies.Concluding, augmenting software repositories and providing low-threshold and easily accessible tooling further contributes to visual software analytics as a key component in software development.

Figure 1 .
Figure 1.A 2.5D interactive software map visualization of the Microsoft vscode software project.

Figure 2 .
Figure 2. Process overview showing the participation of different actors through our data processing pipeline triggered by a new commit.After processing, a visualization component can query the resulting software analytics data and derive visualization artifacts, such as software maps.

Figure 3 .
Figure 3. Proposed data structure to save commit-based metadata in the git object database.Each commit with software data references the original commit through name matching.

4 owner=Figure 5 .
Figure 5. HTML script tag that loads the client and initializes the visualization with the given GitHub project and commit.

Figure 6 .
Figure 6.Excerpt comparison of TypeScript projects with increasing size and complexity using a software map visualization.The number of lines of code (LoC) is mapped to weight, the number of functions (NoF) is mapped to height, and the density of comments (DoC) is mapped to color.The full overview is provided in Figure A1 and Figure A2.

Figure 7 .
Figure 7. Memory impact of the metric file blob in kB on the repository per commit when measured by number of files (log-log axis).Color represents the number of lines of code as a second visual indicator of correlation.A derived linear regression (gray line) suggests that each file in the repository contributes approximately one kB of base64-encoded metric blob storage per commit.

Figure 9 .
Figure 9.Run-time performance impact of the proposed software analysis component, measured by lines of code (log-log axis).Color represents the number of files as a second visual indicator that the anaylsis correlates with number of files as well.A derived linear regression (gray line) suggests that the analysis component does not scale linearly with the project size.

Figure 10 .
Figure 10.Run-time performance of the full GitHub Action that includes the proposed software analysis component and metrics blob storage, measured by lines of code (log-log axis).Color represents the number of files as a second visual indicator that the anaylsis correlates with number of files as well.A derived linear regression (gray line) suggests that the analysis component does not scale linearly with the project size.

Figure A1 .
Figure A1.Comparison of TypeScript projects with increasing size and complexity using a software map visualization.The number of lines of code (LoC) is mapped to weight, the number of functions (NoF) is mapped to height, and the density of comments (DoC) is mapped to color.