DevSecTrust: Standardising How We Measure Software Development Security

Jones, Lachlan; Turnbull, Benjamin; Moustafa, Nour

doi:10.3390/fi18060279

Open AccessArticle

DevSecTrust: Standardising How We Measure Software Development Security

by

Lachlan Jones

,

Benjamin Turnbull

and

Nour Moustafa

^*

School of Systems and Computing, University of New South Wales, Canberra 2612, Australia

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(6), 279; https://doi.org/10.3390/fi18060279

Submission received: 22 April 2026 / Revised: 15 May 2026 / Accepted: 19 May 2026 / Published: 25 May 2026

(This article belongs to the Section Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Metric-based software security plays a crucial role in allowing software developers to make informed decisions about their development practices, while also allowing software users to evaluate the security risks associated with the software they use. Metrics are increasingly used to ensure code security, but there has been little formal evaluation of their broader applicability to date, and interpretation of their results remains a qualitative task. To address this gap, we introduce DevSecTrust, a standardised evaluation framework for measuring and comparing software development security metrics. DevSecTrust provides: (i) a unified control-mapped metric schema, (ii) outcome-based calibration and validation against real vulnerability and maintenance data, and (iii) robustness and manipulability testing to assess metric reliability. This paper analyses two software development security tools, MITRE’s Hipcheck and OpenSSF’s Scorecard, to evaluate and contrast the metrics they produce against widely used open-source software projects. Our quantitative comparison identified low correlation and inconsistent distributions between the tools’ outputs, and our qualitative analysis of feature weighting and scoring logic revealed foundational differences in how each tool conceptualises “secure development”. These inconsistencies complicate trust in development security metrics and hinder their interpretability and operational value. This contributes to a path toward standardised measurement of software development security.

Keywords:

open-source software; software security; OpenSSF Scorecard; MITRE Hipcheck; software supply chain; software security metrics; software development

1. Introduction

Cybersecurity threats facing organisations continue to grow and evolve [1]. Software vulnerabilities are introduced during the development process and may not be identified until much later, when it is often difficult in terms of time and complexity to remedy them, highlighting the importance of software security reviews occurring throughout the development process rather than at a later stage [2]. Uncaught vulnerabilities in production code can lead to consequences as catastrophic as total organisational compromise and massive system failure.

Software developers and users continue to use various security metrics to inform processes and decisions, and to remain as secure from software vulnerabilities and resulting threats as possible [3]. With the importance of metric-based software security review and evaluation occurring during software development, the choice of metrics becomes increasingly essential to ensure that behaviours focus on achieving the proper outcomes [4]. While several software security metrics exist (including CVE and CVSS), these metrics typically occur post-development phase and can be unclear or incomplete [5]. It is noted that the software industry continues to examine which metrics are the most meaningful for ensuring software security [6].

Although automated security evaluation tools such as SAST, DAST, dependency scanning and AI-based tools are increasingly used in both open-source and commercial software development [7,8], definitions of what it means for a software project to be “secure” from a development practices perspective remain varied. The US National Institute of Standards and Technology (NIST) produced the Secure Software Development Framework, identifying practices to be implemented in secure software development; however, it does not describe implementations of these practices and notably does not define properties that are inherent in what “secure software” looks like [9]. While implementing these practices would likely improve progress toward secure software development, it does not assist in helping to identify whether an existing piece of software may be described or declared as secure.

Security evaluation tools, such as Hipcheck [10] and Scorecard [11], quantify software trustworthiness by scoring development behaviours like code review practices, contributor diversity, dependency management, or release signing. Measuring these practices allows users to make risk-based decisions around using the software resulting from these practices, rather than attempting to measure the absence of security within software. However, different implementations may choose to encode different assumptions, examine different features, and weigh those features differently. If these implementation choices produce different results, this undermines user trust (whether end-users or other developers using library code) that software is at least somewhat secure.

Inconsistency in tool results stems from the absence of a shared basis for defining what “secure development” entails, how security-related development behaviours should be measured, and how metric outputs should be interpreted in practice. Without a standardised approach, organisations cannot confidently compare risk across projects, development teams receive unclear guidance on which practices genuinely reduce security risk, and security scoring can be influenced by superficial or easily manipulated signals.

Khan et al. [12] defined software security as “creating and developing software that assures the integrity, confidentiality, and availability of its code, data and services”. Similar definitions are proposed in [13,14], focusing on assuring these three properties. While these similar definitions very validly focus on the requirements of protecting the confidentiality, integrity and availability of software and data they do not extend to offering suggestions of any properties or features which would be inherent to secure software, which leaves it an open question precisely as to how one may determine whether a particular piece of software does effectively satisfy this requirement of protecting these three properties or how to write software which confidently achieves this.

Ali et al. [15] proposed centring the requirements for software security around resilience against or resistance to threats, attacks or vulnerabilities. While this definition is excellent for drawing the focus of discussion to the desired results of any such conversation, it does not assist in identifying features inherent to software itself or the development practices that produced it. This seems to frame software security as a goal to be achieved, rather than a state which could be examined or a process which could be undertaken. It also has the effect of defining software security as the absence of the success of another process (the attack in question), rather than being a concept that can be reasoned about in its own right. For those who seek to identify properties of software or development practices that are correlated with security itself, this style of definition does not fit this need.

To address this gap, we introduce DevSecTrust, a framework for bringing structure, validation, and comparability to development security metrics. DevSecTrust establishes a standard schema for describing metrics, evaluates whether scores align with actual security outcomes, and tests the stability and resistance to manipulation of those scores across different software ecosystems. By doing so, the framework provides a more reliable foundation for measuring and reasoning about software development security risk.

There is a new class of programs, seeking to provide holistic interpretations of the security of open-source projects, based on a wider array of metrics than simple code analysis provides. This paper seeks to examine the way that the security risk associated with software development practices is evaluated by these programs as the presence of this risk is an underlying factor which impacts software security, reliability, and claims of secure development practices. Examining the metrics which are used to determine this level of risk will also aid in improving trust in ensuring metrics are relevant and reliable.

The core contributions of this paper are as follows:

We propose DevSecTrust, a framework for standardising the measurement of software development security.
We define a standard metric schema, outcome-based validation approach, and robustness (anti-manipulation) testing method.
We demonstrate how DevSecTrust can provide more reliable, interpretable, and comparable development security assessments across ecosystems.
We identify inconsistency and low correlation between the existing tools Hipcheck and Scorecard when evaluating the same open-source software projects.

The remainder of this paper is structured as follows: Section 2 provides the background and related work, Section 3 introduces a novel framework to address identified shortcomings in the literature, Section 4 validates this identified need through experimentation, and Section 5 performs an analysis of these results with novel conclusions.

2. Background and Related Work

Open-source software, while often viewed as more secure due to the ease of code auditing, may also introduce cybersecurity risks by allowing the execution of externally produced, potentially untrusted code [16]. This risk is extended as many open source projects in turn rely on other projects, resulting in several nested layers of dependencies and introducing the software supply chain to risk. This was highlighted in incidents such as the 2021 “Log4Shell” vulnerability, which was a vulnerability in one of the most widely used software programs in the world [17], despite being maintained by a small, hardworking team [18]. While the risks associated with using code produced by other (sometimes unknown) authors are not unique to OSS, this software is often provided “as-is”, with no warranties or ongoing support for those using the code.

Efforts to mitigate this risk include techniques such as Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST) and Software Composition Analysis (SCA) [19], which identify and measure known code patterns that may be indicative of vulnerabilities. Recent studies have begun to examine software development practices for a software codebase that may lead to vulnerabilities being introduced, rather than relying on SAST or DAST to avoid issues such as high false-positive rates or probabilistic results [20]. Recent work [21] aims to automate the evaluation of source code and development practices for identifying cybersecurity risks, resulting in various solutions to achieve this. Identified examples of these solutions include the MITRE Corporation’s Hipcheck [10] and the Open Source Security Foundation’s (OSSF) Scorecard [11].

Hipcheck is a tool developed to assist in assessing OSS for potential risks and vulnerability to malicious attack [10]. This is achieved by examining the development practices of a project, such as commit history and developer behaviours, rather than code analysis, to judge the level of risk that the project may have introduced unwanted vulnerabilities or attack vectors. Scorecard is a tool developed to help OSS maintainers assess the security risks of their development practices, and help users of these software projects to determine the risk in using them [11]. This is achieved by evaluating practices such as existing security checks, contributor data, and workflows in order to identify risky behaviours that may introduce unwanted vulnerabilities.

The aim of this study is not to develop a comprehensive, tool-agnostic framework, but to focus on tools that are user-centric. While interpreting output from a static analysis program requires a depth of technical understanding (especially to meaningfully mitigate identified issues), the niche of tools such as Hipcheck and Scorecard is designed to provide an overview of security even for those without that depth of technical security expertise.

Both Hipcheck and Scorecard provide the ability to review an OSS repository and produce a quantitative score, aiming to indicate the potential cybersecurity risk involved in using the software project. Neither tool claims to be a total solution for evaluating the security of a project, but they are designed as a starting point for further discussion or investigation [11,22]. Despite these tools being designed to serve as a starting point for investigation, they nonetheless produce a total score for a repository and thus begin to compare the expected security risk across different software repositories. These tools examine several features in the code and metadata of software repositories to produce a score for a given repository, including metrics for development paradigms, frequency and volume of code changes, and diversity among development contributors.

The different tools do not necessarily examine the same features as each other to arrive at their final results, and it is unclear which metrics each tool used to decide on the features it would examine [21]. It is recognised that there are existing software security metrics discussed throughout the literature, such as CVE or CVSS systems [18], which examine aspects of software or cybersecurity [23,24,25]. These systems are not intended to be a standalone cybersecurity solution, but rather an effort to define and catalogue publicly known vulnerabilities through their respective processes. These systems are designed for other purposes and are therefore unsuitable for determining software security quality. More frequently used code may be reviewed more often, resulting in a higher number of vulnerabilities being found, or high-profile vulnerabilities may attract more attention to specific areas of a codebase, leading to additional CVEs being identified (as seen in the case of the Log4Shell vulnerability).

Previous studies have examined other metrics that may be used to infer the state of security in a software project. Some propose measuring features of software itself to produce a quantitative measurement of security risk [4,5]. However, these acknowledge that the best metrics to examine differ between different codebases [4], or acknowledge that false positives mean that the number of results produced on large codebases is still an infeasible quantity [5]. Kudriavtseva and Gadyatskaya [26], in a study of existing software security metrics, identified that current metrics do not sufficiently measure the level of security in a software project (and that these metrics are seen as outdated), and that existing measures are quantitative despite practitioners seeking qualitative measures. This work particularly highlights the need for metrics that focus on filling the needs of practitioners, rather than theoretical or outdated metrics. Siavvas et al. [27] introduced an automated security assessment model to allow evaluation of the security of software written in Java. This work includes detailed validation of the model using available alternative methods, and while it identifies that relying on a single static security analysis tool as a part of the model is potentially limiting, it does justify this effectively by ensuring that information output to a user is actionable.

While this model appears to be a strong candidate for assessing the security of software projects written in Java, it is limited to a single programming language and does not utilise the potential findings from DAST tools, although Siivas et al. [27] did present justifications for this within the context of their intended use-case. While Saeed et al. [28] explained existing difficulties in embedding security in software development and achieving a comprehensive review, the conclusions drawn from this work do not yet provide actionable suggestions or acknowledge the competing priorities that exist within an organisation conducting software development. They do, however, highlight that automated threat analysis and security testing is an area for further exploration.

Ala et al. [29] proposed a series of metrics which can be used to assess risk during software development; however; their metrics are specific to the agile software development process and focus on quantifying the portion of work which has some form of security consideration without considering the quality or depth of this consideration. Often they do not provide immediately implementable recommendations to developers as to how to write more secure code. While each of these studies has proposed some form of metrics-based solution for quantifying and measuring the state of security within a software development project, there remains a gap in the existing literature for metrics that can be applied across multiple programming languages and provide immediately actionable results for practitioners to be able to monitor and improve the security of projects.

AI-based solutions have been proposed to identify security vulnerabilities in software projects; however, studies in the field have noted that proposed AI solutions are not yet ready to be used in large complex systems and require further work before being suitable for these [30], particularly to cover additional software scenarios [31]. For these reasons, this paper will examine non-AI dependent solutions to allow further time for this field to mature.

3. DevSecTrust Framework

3.1. Novelty and Contribution

To address the inconsistency and limited interpretability of existing development security scoring tools, we propose DevSecTrust, a framework designed to standardise how secure development metrics are defined, validated, and compared across tools and ecosystems. Rather than introducing another scoring system, DevSecTrust acts as an independent evaluation layer that examines how security metrics are constructed and whether their outputs can be meaningfully interpreted. Current tools often produce numerical scores without a shared measurement foundation, making comparisons difficult and sometimes misleading. DevSecTrust introduces a structured reference model that aligns metrics with clearly defined controls, consistent measurement semantics, and comparable scoring assumptions, enabling security scores to be analysed within a common analytical framework.

The key novelty of DevSecTrust lies in shifting the focus from metric availability to metric trustworthiness. Existing approaches implicitly assume that observable development practices, such as workflow configuration, dependency management, or repository policies, directly indicate software security. DevSecTrust challenges this assumption by treating metrics as hypotheses that must be empirically validated. The framework evaluates whether a metric meaningfully reflects real-world security outcomes, such as vulnerability occurrence, remediation speed, or long-term maintenance behaviour. By grounding metric evaluation in observable outcomes, DevSecTrust moves development security assessment from heuristic scoring toward evidence-based measurement.

In addition, DevSecTrust introduces systematic testing of stability and manipulability, recognising that trustworthy metrics must remain consistent across environments and resistant to superficial optimisation. A metric that changes significantly due to minor configuration adjustments or can be easily improved without genuine security improvement undermines decision-making and organisational trust. The framework therefore evaluates whether scores remain stable under benign variations and whether they can be artificially inflated through low-cost actions. Through this combination of standardisation, outcome validation, and robustness analysis, DevSecTrust provides a foundation for scientifically grounded and operationally reliable measurement of secure software development practices.

This framework is significant because without a validated and standardised approach, current development security scores may lead teams to adopt practices that improve scores without improving security, shift trust burden to individual tool vendors, and prevent organisations from making fair comparisons between software projects or supply-chain risks. DevSecTrust provides a scientifically grounded foundation for understanding what development security scoring tools are truly measuring.

3.2. Components of DevSecTrust Framework

The DevSecTrust framework is a structured evaluation pipeline designed to explain and reduce inconsistencies observed across existing development security scoring tools. As shown in Figure 1, the framework begins with the outputs of automated assessment tools, represented on the left side of the figure by the OpenSSF Scorecard and MITRE Hipcheck. Although both tools aim to assess secure development practices, they rely on different data sources, scoring rules, and implicit assumptions about what constitutes secure software engineering. These differences produce divergent scoring patterns, which our empirical analysis identifies as the Observed Evaluation Gap shown on the left of the figure. This gap reflects the absence of a shared measurement model, where scores generated by different tools cannot be interpreted consistently or compared reliably across projects or ecosystems.

To address this issue, the centre of the figure illustrates the interconnected components of the DevSecTrust framework. These components are sought in parallel, as each provides a distinct yet crucial function desired of reliable and trustworthy metrics. The Unified Metric Schema establishes a common semantic layer by mapping tool-specific metrics to shared security controls and observable repository behaviours using a common vocabulary, ensuring conceptual alignment across tools. The Outcome-Based Calibration component then evaluates whether these metrics correspond to measurable security outcomes, such as vulnerability occurrence, patch response time, or maintenance activity, grounding scores in empirical evidence rather than assumptions. Finally, the Robustness and Manipulation Testing component assesses whether metrics remain stable under benign changes and are resistant to superficial optimisation, ensuring that scores reflect genuine security posture rather than easily gamed configurations. These parallel components result in transparent and comparable outputs that transform isolated tool scores into evidence-based security assessments, providing a consistent foundation for interpreting and trusting development security metrics. By ensuring these three components are implemented and adhered to, the outcome of this is a normalised and comparable set of results across different evaluation metrics, improving the trustworthiness of each metric.

3.3. Unified Metric Schema

The Unified Metric Schema forms the foundational layer of the DevSecTrust framework by establishing a shared structure for describing and interpreting development security metrics across different tools and ecosystems. Existing security evaluation tools often define metrics independently, using different terminology, scales, and assumptions about what constitutes secure development behaviour. As a result, identical repository characteristics may be interpreted differently depending on the tool used. The Unified Metric Schema addresses this issue by introducing a common vocabulary that maps tool-specific checks to recognised security controls and observable repository signals. This mapping enables metrics originating from different tools to be interpreted within the same conceptual framework, reducing ambiguity and improving comparability.

Beyond simple normalisation, the schema explicitly captures contextual information such as metric intent, measurement method, expected directionality, and applicability constraints across programming languages or development environments. By formalising these properties, DevSecTrust treats metrics as structured measurement instruments rather than isolated numerical outputs. This approach allows researchers and practitioners to understand what a metric measures and why it exists, improving interpretability and enabling cross-tool analysis. The Unified Metric Schema therefore provides the semantic alignment necessary for downstream validation and comparison stages, ensuring that subsequent analyses operate on consistent and well-defined representations of development security practices.

3.4. Outcome-Based Calibration

The Outcome-Based Calibration component evaluates whether development security metrics meaningfully reflect real-world security outcomes. Many existing tools implicitly assume that certain development practices, such as enabling branch protection or signing releases, directly improve security, yet these assumptions are rarely validated empirically. DevSecTrust introduces an evidence-driven approach by analysing the relationship between metric scores and observable outcomes, including vulnerability emergence, patch latency, incident history, and long-term maintenance activity. This step transforms metric evaluation from heuristic reasoning into a data-supported validation process.

Through statistical comparison and outcome alignment, the framework assesses whether higher metric scores consistently correspond to improved security performance across diverse projects and ecosystems. Metrics that demonstrate weak or inconsistent relationships with real outcomes can therefore be identified as unreliable indicators, even if widely adopted in practice. This calibration stage also helps reveal hidden biases introduced by ecosystem differences, project scale, or development maturity. By grounding security metrics in observable evidence rather than assumed best practices, Outcome-Based Calibration strengthens the scientific validity of development security measurement and provides a clearer basis for interpreting automated security scores.

3.5. Robustness and Manipulation Testing

The Robustness and Manipulation Testing component evaluates the reliability of development security metrics under realistic operational conditions. Automated scoring systems can unintentionally incentivise behaviour that improves scores without improving actual security, particularly when metrics rely on easily configurable repository features. DevSecTrust therefore examines whether metric output remains stable when projects undergo benign changes, such as minor configuration updates or natural repository evolution. Stability under such conditions indicates that a metric captures meaningful security characteristics rather than noise or incidental variation.

In addition to robustness, this component investigates the susceptibility of metrics to manipulation. The framework tests whether scores can be artificially increased through superficial or low-cost actions that do not materially improve security, such as adding symbolic configurations or satisfying minimal checklist conditions. Metrics that are easily gamed undermine trust and may distort organisational decision-making by rewarding appearance over substance. By systematically analysing both stability and resistance to manipulation, DevSecTrust ensures that evaluated metrics support reliable and trustworthy security assessments. This stage ultimately strengthens confidence in metric-based evaluation by distinguishing genuinely informative measurements from those vulnerable to instability or strategic optimisation.

4. Preliminary Framework Validation

In order to validate the requirement for a new framework examining software development security risk and its proposed components, the results of existing tooling are examined for consistency to gain an understanding of what practices software developers consider to be “more secure”, with the hypothesis that if there is a clear understanding of this across the software development field then these tools should consistently evaluate software projects regardless of the implementation details of each tool.

4.1. Identifying Evaluation Tooling

Substantial efforts must be made to identify publicly available software tooling that claims to perform security risk evaluation of the software development process. To be considered, these tools must be openly and freely available online and produce a quantitative result indicating the level of identified risk in development practices for a given software project. The tool must also support multiple programming languages, allowing for a broader evaluation of risk beyond language-specific features.

OSS tools relating to the key terms “software”, “development”, “risk”, “score”, and “evaluate” were identified, namely the GitHub projects “mitre/hipcheck” (v3.3.1) and “ossf/scorecard” (v5.0.0). The documentation for each project was examined to identify shared key terms, and these terms were then used to attempt to identify additional projects performing similar functions. This second iteration of the query identified the GitHub project “Legit-Labs/legitify” as an additional potentially relevant tool; however, this tool examines several distinct other steps, including the previously identified “ossf/scorecard” project, so it was excluded from this experiment to avoid double-counting these tools. No further additional tooling was identified by this secondary inclusion step. At the end of this process the projects “mitre/hipcheck” and “ossf/scorecard” were the projects which were identified which examine risk in software development practices.

Both mitre/hipcheck and ossf/scorecard claim to evaluate the security posture of software projects by evaluating the development practices of that project, and so should in theory evaluate projects in a similar way.

4.2. Identifying Popular Software

This step was performed using the available ecosyste.ms packages dataset [32] to identify the top 100 most downloaded projects from the registries npmjs.org (JavaScript), proxy.golang.org (Golang), pypi.org (Python), repo1.maven.org (Java), rubygems.org (Ruby), and crates.io (Rust). Each project from these repositories was attempted to be retrieved via git using the repository URL supplied by ecosyste.ms (where this was invalid or failed, the project was skipped). In order to be broadly applicable for practitioners, existing solutions and new frameworks like DevSecTrust must be language agnostic, hence requiring examination of multiple programming languages. Examining the most often used software ensures that this process produces results which are widely applicable across realistic software development projects; however, it is acknowledged that drawing from exclusively open-source software may neglect to consider software development practices which are followed in closed-source projects due to commercial or privacy restrictions.

4.3. Evaluating Each Project with Each Tool

With 547 projects retrieved from various registries, Hipcheck and Scorecard were each used (in their container-provided form) to evaluate every project. Each tool was run via command-line interface (CLI) within the container and provided the repository URL of each project (given that each project examined at this stage was known to have a retrievable project URL). This was conducted on 18 November 2024, so projects were evaluated as they were at this point. If the container or application failed to evaluate a project for any reason, this project was skipped. For each registry, the results for each project were recorded with the time of project retrieval, project name, registry, and evaluated risk score. The final score given by each tool was normalised to be an integer between 0 and 100. Scorecard scores were normalised as in Equation (1):

S c o r e_{n o r m a l i s e d} = S c o r e_{r a w} \times 100

(1)

The scores given by Hipcheck were normalised as in Equation (2), which adjusts the raw score from a number between 0 and 1 representing the calculated risk level of the project (with 0 being the most secure) to allow for comparative analysis:

S c o r e_{n o r m a l i s e d} = 100 - S c o r e_{r a w} \times 100

(2)

In cases where only one tool successfully evaluated a software project, that project was removed from the dataset, as it does not allow for comparisons of results. While examining causes of individual tool failure on certain projects may be of benefit for improving additional tool functionality, this was considered beyond the scope of this work as it does not help to quantify how variable the results of the two tools are. Of the original 547 projects, 466 were successfully evaluated by both tools.

4.4. Comparing the Resulting Datasets

The datasets produced by each tool were compared to identify points of both agreement and difference. Given that both of these tools aim to quantify the risk introduced by software development practices in software projects, it follows that the datasets containing the results of the tools should largely agree. Minor points of difference may be expected due to implementation differences, but if each of these tools meaningfully evaluates the same thing, then the results should be statistically similar.

4.4.1. Aggregate Datasets

The dataset of each set of scores produced by both Scorecard and Hipcheck is summarised in Table 1.

Examining the normalised datasets visually in Figure 2 provides an understanding that these datasets are not identical, as would be expected if the different valuation tools evaluated software projects in meaningfully similar ways.

See Figure 3 for a histogram of this same dataset, with a bin size of 5. This figure demonstrates that Hipcheck results are grouped (compared to Scorecard) at or around 60, with very few projects scoring more highly than 70. By contrast, Scorecard appears to have a more widely spread result set rather than grouping around a certain value.

Given that the purpose of these tools is to assist in evaluating software security risk, Scorecard seems to fulfil the requirement of advising whether specific projects are more secure than others, as Hipcheck tends to group results rather than distinguish between them. This is also identifiable in Figure 4, which plots the results from both Hipcheck and Scorecard. Figure 4 plots the resulting score on the Y axis, against an arbitrary ordering of projects on the X axis (so that the same project evaluated by both Hipcheck and Scorecard has the same X value, but no other meaning is assigned to this value). In this figure, Hipcheck results have a trend of forming horizontal lines, indicating that some Hipcheck results are quite common scores. This is in direct contrast with Scorecard, which has no similar trend, but rather a greater spread of results.

In order to validate the exploratory statistical examination above, Welch’s t-test was performed to examine whether the mean value of each dataset was similar. When the datasets were not assumed to have identical variances, this t-test produced a statistic of

- 6.60 \times 10^{0}

and a p-value of

7.48 \times 10^{0}

, demonstrating that the datasets have statistically differing mean values. To further examine the differences between the two datasets (which would be near identical if the two tools evaluated software projects equally), the distribution of each dataset was examined as in Table 2.

The results in Table 2 further demonstrate that when evaluating the same software projects, Scorecard and Hipcheck evaluate these projects in different ways and produce notably different scores for them. A Bland–Altman analysis is used to determine whether two different tests concur in their results [33]. This process examines the differences between test scores as well as the means of the scores, and graphically contrasts these two values for interpretation [34]. Applying this process to evaluate agreement in methods as used by Bland and Altman [35] considers Hipcheck and Scorecard to be independent tests of a particular feature (in this case, the level of security risk present in a software project). This plot also contains a measurement of systematic bias (calculated as the mean of all the differences between measurements), as well as Limit of Agreement (LoA) values, which are typically calculated as in Equation (3):

L o A^{'} s = b i a s \pm 1.96 \times S D

(3)

The value

S D

in Equation (3) refers to the standard deviation of the differences. Using the standard deviation of the differences assumes that the differences are normally distributed. Conducting a Shapiro–Wilks test on the differences of these tests, with the null hypothesis that the data are normally distributed, produces a p-value of 0.011483356469457482. As this value is less than a threshold of 0.05, this data cannot be assumed to be normally distributed. Instead of using LoAs, a confidence interval (CI) of 95% will be substituted into the methodology to demonstrate a similar threshold. This methodology produces the plot in Figure 5.

Examining Figure 5 demonstrates that the systematic bias is slight (with a value of approximately −6.60 indicating that Hipcheck has a slight systematic bias to award a higher normalised score than Scorecard by 6.60 points), however the 95% CI spans a wide range of values considering that the plot represents the difference between two numbers bound by [0, 100], indicating that the differences between the two datasets are significant.

The differences between the results of the two tests do not exhibit an identifiable overall trend, suggesting that while systematic bias is relatively low in this data, random differences are substantial between the two tests. While outliers do fall outside the 95% confidence interval on the plot, this is to be expected due to using a CI rather than LoAs and therefore will not be examined further.

Overall, the differences between the two datasets vary too greatly and randomly to be accepted. The lack of trend in the difference also means there is no proposed method to mitigate any trend or systematic bias, and that the two tests (in this case, the two tools) do not have any identifiable relation in their results.

4.4.2. Registry-Specific Datasets

Having demonstrated that the two different tools evaluate the risk of software development projects differently, projects will be examined by the software registry (one registry per programming language) to determine whether any trends are identifiable between programming languages. The results per-registry are portrayed in Figure 6.

While critiquing the inconsistent differences between the results per tool per registry, it would be unfair not to acknowledge that different registries (by virtue of being specific to a particular programming language) may encourage different development practices, which are considered more applicable to that programming language. This has not yet been investigated; however, each tool examines the same set of factors regardless of the programming language being examined. This may be an explanation for at least part of the reason why a given tool yields noticeably different results in datasets for each registry. If this is the case, it presents an opportunity for future work to examine the practices of each language and identify additional practices that the community around each language may benefit from implementing or discarding.

Previously, while discussing Figure 2, it was noted that Hipcheck results tended to be tightly grouped around a specific point, rather than spread across the range of possible results (in contrast to Scorecard results, which had a greater spread relative to Hipcheck). The results for the proxy.golang.org registry in Figure 6 contrast with this trend, where Hipcheck results are more spread across the possible range, and Scorecard results are tightly grouped around the value of 26 with several outliers. This will be examined in further detail in later sections, but is noted here as an outlier.

5. Analysis and Recommendations

While both Hipcheck and Scorecard claim to evaluate software development security practices, they each implement this in different ways. Each tool has a set of features it examines in a codebase, which contribute to the final score given by the tool. While some checks are for similar topics, Hipcheck conducts nine checks and weights each numerical score produced by these checks equally. In contrast, Scorecard conducts 19 checks and weights each check according to one of four possible weightings determined by its configured importance. This is not to say that one tool is a superset of the other, as there are certainly features that are examined by each tool and not the other. The difference in the number of checks is believed to (very reasonably) stem from one of Hipcheck’s primary goals being to be fast [36]. Hipcheck could choose to implement the additional checks that Scorecard chooses; however, this would likely come at the cost of runtime speed and require compromising on one of Hipcheck’s three core values. This focus on runtime speed may form the beginning of an explanation for the groupings of results, which were previously noted in Figure 3; that specificity is sacrificed for speed.

5.1. Hipcheck

Previously, it was identified that Scorecard tended to cluster its results around specific values (refer to the discussion in Figure 4). Frequencies of Hipcheck results are plotted in Figure 7, showing that only 25 different scores were awarded by Hipcheck. The frequency of these scores follows a logarithmic trend, with a small number of scores being awarded disproportionately frequently.

It is hypothesised that this occurs due to the way Hipcheck calculates its risk score for each project. According to Hipcheck documentation, each analysis is scored as either 0 or 1 (does or does not present risk), which are then aggregated according to analysis weightings to produce a final risk score [22]. Taking the example from the Hipcheck documentation, we can calculate the number of potential total scores (represented as |S|, the cardinality of the set S of possible total scores) due to the ‘SCORE’ value of each analysis being either 0 or 1, and this is multiplied by the weight. Grouping the analyses by those with identical weightings, we form sets of analyses. X (weight: 0.1), Y (weight: 0.25) and Z (weight: 0.1665). We may then define sets

R_{X}

,

R_{Y}

and

R_{Z}

, which are the possible total scores of sets X, Y and Z respectively. Given that the total score is the sum of each of these analyses, the number of possible total scores may be denoted as

| S | = | R_{X} | \times | R_{Y} | \times | R_{Z} |

(4)

By utilising both the commutative and additive identity properties of addition, we may calculate the cardinality of each set

R_{X}

,

R_{Y}

and

R_{Z}

as a function of the cardinality of each of X, Y and Z, respectively. For an arbitrary set

R_{I}

and a corresponding set of analyses I, cardinality is represented as the following, where the constant 1 accounts for the case where no elements of I have a score of 1:

| R_{I} | = | I | + 1

(5)

This then becomes a simple substitution for

| I |

, given that

| X |

,

| Y |

and

| Z |

are clearly defined in this example as 5, 1 and 3 respectively. By substituting these values into Equation (5) and then Equation (5) into Equation (4) for each of X, Y and Z in place of I the resulting formula is

| S | = (5 + 1) \times (1 + 1) \times (3 + 1)

(6)

Resulting in 48 possible total scores from Hipcheck, as per Equation (6). Given that only 25 of the possible 48 scores were observed in testing, it is hypothesised that certain combinations of analyses passing or failing are more common than others, indicating that specific individual analyses are more likely to pass or fail. This could potentially be due to either the tests being overly sensitive to Type I or Type II errors, or to the fact that current software development practices tend to focus on or ignore certain aspects. Identifying the root cause of this within Hipcheck is considered outside the scope of this work.

5.2. Scorecard

With Scorecard’s increased number of checks and the chosen method of implementation of these comes the ability to produce a greater variety of final numerical scores. Each Scorecard test may return a score representing a pass or fail (MaxScoreResult or MinScoreResult, respectively), similar to Hipcheck, but may also return a score that is numerically between these (ProportionalScoreResult), with a value calculated from multiple smaller components or several predefined score values (such as HalfResultScore) [37]. This difference in implementation means that Scorecard’s final scores are not bound to a finite set of possible values but rather may be far more dynamic in their allocation of scores.

Examining the implementation of the webhooks evaluation in Scorecard provides an example of the variety of scoring available in Scorecard modules. Depending on how the software evaluates, this module may return the minimum score (MinScoreResult), maximum score (MaxScoreResult), or instead a ProportionalScoreResult representing a certain proportion of webhooks in the repository that pass the given test [38]. This variation in the possible return value significantly increases the range of possible final scores, as both the numerator and denominator in a proportional score result depend on the contents of the repository, rather than a binary pass/fail result. For this reason, the theoretical number of possible results for Scorecard analysis has not been calculated.

When examining Scorecard evaluation methods, it is essential to note that this tool attempts to quantify largely qualitative concepts, such as the variation in developer affiliation. This evaluation scores a project based on whether the number of contributing developers within a recent time period is affiliated with different organisations on GitHub, with each of these contributors authoring at least five commits out of the last 30 [39]. It is assumed that this check intends to quantify whether all contributing developers come from a small number of organisations so may fall victim to similar biases or errors; however, this seems to implement arbitrary (or at least poorly explained reasons for) numerical thresholds and assumes that all developers are affiliated with a single organisation at a time and that this is meaningfully reflected in GitHub profiles.

5.3. Evaluating at Scale

The previous examination of specific methodology employed by Scorecard to quantify largely qualitative concepts provides an example of what these tools both attempt to do at their core–they codify expertise. By applying these quantitative methods through code, expertise can be leveraged at a much greater scale than would be possible through manual review by experts. This enables many who previously would not have been able to perform similar reviews on codebases to do so. As with any increase in scale, however, this also carries the inability to address nuance effectively. In applying this approach, these tools both apply the same standards across all projects regardless of programming language, framework, or intended application.

As shown in Figure 6, each tool produces different trends across different programming languages, very possibly because of the differences in usage, structure or convention in each language. While it may be beneficial to hold each language to the same standard, there may be a potential advantage in creating more specific variants for each language to target features or checks that are more relevant to each language or ecosystem.

While there are aspects of cybersecurity that are easier to measure quantitatively, there are questions in cybersecurity that are preferred to be evaluated qualitatively [40]. Both Hipcheck and Scorecard attempt to translate qualitative features into quantitative outputs, such as Hipcheck’s “affiliation” plugin and Scorecard’s “contributors” check. Each of these plugins evaluates some form of trust in the project’s developers, an inherently qualitative element, and produces a numerical result to fit into the larger evaluation framework of the respective tool. To function effectively (i.e., as a largely automated evaluation of software development security), these tools require that the checks produce a numerical result, even when the check attempts to evaluate a qualitative feature. This attempt to convert from a qualitative to a quantitative result requires a loss of detail and potentially mis-translating the result (as the translation is automated) to gain the advantages of a quantitative automated approach. In noting the risk introduced through this conversion, it is also important to observe that both Hipcheck and Scorecard acknowledge in their documentation that their results are not definitive evaluations but rather serve as a starting point for a human review of the software in question. The risks introduced by converting from qualitative to quantitative measures are believed to be wholly reasonable and fulfil the purpose for which they are intended, so long as they are understood by users of the tools.

In introducing the DevSecTrust framework, it is acknowledged that this framework will inevitably include similar conversions from qualitative to quantitative results, in order to allow the framework to be implemented in automatable and repeatable ways in existing development processes. While this does introduce the previously identified risk of losing detail or quality in this conversion, it is considered that these risks are proportionate to the overall benefit that DevSecTrust offers in terms of automatable security review of in-development and established codebases. In much the same way that Hipcheck and Scorecard accept this risk in reasonable ways to still achieve their respective purposes, it is considered that accepting this risk in the DevSecTrust framework will allow for more meaningful and accessible security reviews at scale, especially if this risk is documented and made plainly clear in implementations of this framework.

5.4. Implications of Identified Inconsistency

While each of these tools claim to measure the software development security risk of projects, the differing resulting datasets produced by each tool indicates that these tools unintentionally measure different constructs or features of software projects, rather than a common state.

Identifying the inconsistency between available tools for evaluating the security risk associated with software development projects raises a problem for those who would use these software projects (whether that be developers using software libraries in their own software projects, end-users of software, or even software developers who aim to improve their own secure software development practices in order to more confidently offer confidence to their users), in that the scoring of how secure a project is is decided not only by the practices of the software project, but now also influenced by the choice of tool used to evaluate this risk. Confidence in the security risk of a project (or lack thereof) is undermined by the knowledge that a better score may be an artefact of the choice of evaluation tool rather than the project itself, or could potentially be gamed by choosing an evaluation tool which reports a better score for the same project.

This identified inconsistency reduces confidence in decision-makers about the risk they are accepting in using externally-produced software, and introduces uncertainty about whether reported security scores have been gamed. This serves to reinforce the requirement for a framework such as DevSecTrust which introduces gaming-resistance, score robustness and outcome-based validation to software development security risk evaluations in order to ensure results are standardised and comparable.

6. Limitations

In proposing the DevSecTrust framework it is acknowledged that there are limitations in this currently theoretical framework which will require further work to mitigate. This paper proposes the conceptual framework but does not offer a prototype implementation, which is acknowledged as an opportunity for significant future work. This future work will likely benefit from beginning with mathematical definitions of the schema and/or clearly defined interfaces between stages, as this may offer a natural starting point; however, this is considered beyond the scope of this proposal. This implementation will require further work after development in order to validate that the intended properties of this framework (such as robustness and resistance to gaming of outcomes) are maintained through implementation, and that the required inputs and outputs of each stage are reviewed for validity. Developing this framework at scale will require ongoing work to achieve and to validate that detail has not been sacrificed for general applicability.

It is also noted that in the previous framework validation work this framework is validated by examining inconsistencies between a small number of existing software development risk scoring tools, and would benefit from expanding its findings to more tools (including enterprise or commercially available tools) in order to ensure the findings are more generalised. It is also acknowledged that the validation work completed here examined using these tools with open-source software and did not examine closed-source software projects, which may have different development practices worth considering.

This paper has not investigated correlations between historical software development security risk scores and known security incidents via datasets such as the CVE database or historical cybersecurity incidents. This is instead acknowledged as an opportunity for future work, noting that a lack of CVE or known cybersecurity incidents may not correlate with a state of security, but rather may simply mean that no vulnerability or incident has yet been identified (as vulnerabilities may exist in software for extended time periods between introduction and discovery).

As noted, AI-based systems are still nascent. AI is making significant changes to the open-source community and the ongoing ramifications of this are not yet known. There are significant research opportunities in understanding these impacts from community, cybersecurity and code quality perspectives. As these changes are ongoing and believed to be significant in the open-source ecosystem, AI-based tools have been excluded from this work; however, we acknowledge that examining AI-led changes and their effects in this space would be of value to academia and practitioners alike.

7. Conclusions

The paper has introduced the DevSecTrust framework, which provides a structured approach to addressing a fundamental challenge identified in this work: the lack of a consistent and interpretable basis for measuring software development security. Through the application of DevSecTrust, this paper has demonstrated that existing development security evaluation tools, specifically MITRE’s Hipcheck and the OpenSSF Scorecard, often produce substantially different results when analysing the same set of 466 widely used open-source projects. These differences are not merely statistical variations but reflect deeper inconsistencies in how tools conceptualise and operationalise “secure software development,” highlighting that current metric-based security scores are neither directly comparable nor interchangeable.

The findings reveal that without a standardised validation model, organisations risk relying on metrics that provide unclear or potentially misleading guidance. DevSecTrust addresses this gap by introducing a unified framework that maps metrics to recognised security practices, evaluates their alignment with real-world security outcomes, and assesses their robustness and resistance to manipulation, thereby improving interpretability and trust in automated security assessments. By enabling evidence-based comparison and validation of development security metrics, the framework supports more informed decision-making for developers, maintainers, and security practitioners. Future work will extend DevSecTrust to incorporate additional evaluation tools, broaden coverage across software ecosystems, and perform longitudinal validation using extended vulnerability and maintenance datasets to enhance measurement of software development security.

Author Contributions

Conceptualization, L.J., B.T. and N.M.; methodology, L.J. and B.T.; software, L.J.; validation, B.T., N.M. and L.J.; investigation, L.J.; data curation, L.J.; writing—original draft preparation, L.J.; writing—review and editing, B.T. and N.M.; visualization, L.J. and N.M.; supervision, B.T. and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in Github at https://github.com/LachJones/DevSecTrust_ValidationData, published on 5 March 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chidukwani, A.; Zander, S.; Koutsakis, P. A Survey on the Cyber Security of Small-to-Medium Businesses: Challenges, Research Focus and Recommendations. IEEE Access 2022, 10, 85701–85719. [Google Scholar] [CrossRef]
Thomas, T.W.; Tabassum, M.; Chu, B.; Lipford, H. Security During Application Development: An Application Security Expert Perspective. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018. [Google Scholar] [CrossRef]
Khalid, A.; Raza, M.; Afsar, P.; Khan, R.A.; Mohm, M.I.; Rahman, H.U. A SWOT Analysis of Software Development Life Cycle Security Metrics. J. Softw. Evol. Process. 2025, 37, e2744. [Google Scholar] [CrossRef]
Hauser, J.; Katz, G. Metrics: You are what you measure! Eur. Manag. J. 1998, 16, 517–528. [Google Scholar] [CrossRef]
Croft, R.; Babar, M.A.; Li, L. An Investigation into Inconsistency of Software Vulnerability Severity across Data Sources. arXiv 2021. [Google Scholar] [CrossRef]
Zahan, N.; Shohan, S.; Harris, D.; Williams, L. Do Software Security Practices Yield Fewer Vulnerabilities? In Proceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Melbourne, Australia, 14–20 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 292–303. [Google Scholar] [CrossRef]
Angermeir, F.; Voggenreiter, M.; Moyon, F.; Mendez, D. Enterprise-Driven Open Source Software: A Case Study on Security Automation. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Madrid, Spain, 25–28 May 2021. [Google Scholar] [CrossRef]
Nutalapati, V. Automated Security Testing for Mobile Apps: Tools, Techniques, and Best Practices. Int. Res. J. Eng. Appl. Sci. (IRJEAS) 2023, 11, 26–31. [Google Scholar]
Souppaya, M.; Scarfone, K.; Dodson, D. Secure Software Development Framework (SSDF) Version 1.1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2022. [CrossRef]
Mitre Corporation. Hipcheck. Available online: https://hipcheck.mitre.org/ (accessed on 3 March 2025).
Open Source Software Foundation. OpenSSF Scorecard. 2020. Available online: https://github.com/ossf/scorecard/blob/main/README.md (accessed on 3 March 2025).
Khan, R.A.; Khan, S.U.; Khan, H.U.; Ilyas, M. Systematic Literature Review on Security Risks and its Practices in Secure Software Development. IEEE Access 2022, 10, 5456–5481. [Google Scholar] [CrossRef]
Edward, E.; Nyamawe, A.S.; Elisa, N. A Survey on Secure Refactoring. SN Comput. Sci. 2024, 5, 952. [Google Scholar] [CrossRef]
Otieno, M.; Odera, D.; Ounza, J.E. Theory and practice in secure software development lifecycle: A comprehensive survey. World J. Adv. Res. Rev. 2023, 18, 53–78. [Google Scholar] [CrossRef]
Ali, M.; Ullah, A.; Islam, M.R.; Hossain, R. Assessing of software security reliability: Dimensional security assurance techniques. Comput. Secur. 2025, 150, 104230. [Google Scholar] [CrossRef]
Mead, N.R.; Woody, C.; Hissam, S. Open Source Software: The Ultimate in Reuse or a Risk Not Worth Taking? Computer 2025, 58, 78–83. [Google Scholar] [CrossRef]
What Is the Log4j Vulnerability? Available online: https://www.ibm.com/think/topics/log4j (accessed on 7 April 2025).
Maeprasart, V.; Ouni, A.; Kula, R.G. Drop it All or Pick it Up? How Developers Responded to the Log4JShell Vulnerability. In Proceedings of the 2024 IEEE/ACIS 22nd International Conference on Software Engineering Research, Management and Applications (SERA), Honolulu, HI, USA, 30 May–1 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 249–254. [Google Scholar] [CrossRef]
Cruz, D.B.; Almeida, J.R.; Oliveira, J.L. Open Source Solutions for Vulnerability Assessment: A Comparative Analysis. IEEE Access 2023, 11, 100234–100255. [Google Scholar] [CrossRef]
Mitre Corporation. Why Hipcheck? 2024. Available online: https://hipcheck.mitre.org/docs/getting-started/why/ (accessed on 3 March 2025).
Mead, N.; Woody, C.; Hissam, S. Open Source Software (OSS) Transparency for DoD Acquisition. arXiv 2024, arXiv:2404.16737. [Google Scholar] [CrossRef]
Mitre Corporation Hipcheck Scoring. 2024. Available online: https://hipcheck.mitre.org/docs/guide/concepts/scoring/ (accessed on 10 February 2025).
Chalyi, O.; Driaunys, K.; Rudžionis, V. Assessing Browser Security: A Detailed Study Based on CVE Metrics. Future Internet 2025, 17, 104. [Google Scholar] [CrossRef]
Mellado, D.; Fernández-Medina, E.; Piattini, M. A comparison of software design security metrics. In Proceedings of the Fourth European Conference on Software Architecture: Companion Volume, Copenhagen Denmark, 23–26 August 2026; ACM: New York, NY, USA, 2010; pp. 236–242. [Google Scholar] [CrossRef]
Wang, J.A.; Wang, H.; Guo, M.; Xia, M. Security metrics for software systems. In Proceedings of the 47th annual ACM Southeast Conference, Clemson, SC, USA, 19–21 March 2009; ACM: New York, NY, USA, 2009; pp. 1–6. [Google Scholar] [CrossRef]
Kudriavtseva, A.; Gadyatskaya, O. You cannot improve what you do not measure: A triangulation study of software security metrics. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, Avila, Spain, 8–12 April 2024; ACM: New York, NY, USA, 2024; pp. 1223–1232. [Google Scholar] [CrossRef]
Siavvas, M.; Kehagias, D.; Tzovaras, D.; Gelenbe, E. A hierarchical model for quantifying software security based on static analysis alerts and software metrics. Softw. Qual. J. 2021, 29, 431–507. [Google Scholar] [CrossRef]
Saeed, H.; Shafi, I.; Ahmad, J.; Khan, A.A.; Khurshaid, T.; Ashraf, I. Review of Techniques for Integrating Security in Software Development Lifecycle. Comput. Mater. Contin. 2025, 82, 139–172. [Google Scholar] [CrossRef]
Ala, A.A.; Salim, A.A.; Faitouri, A.A. Security Metrics for Assessing Security Risks of Software in Agile Development Methods. Bani Waleed Univ. J. Humanit. Appl. Sci. 2025, 10, 213–224. [Google Scholar] [CrossRef]
Kommrusch, S. Artificial Intelligence Techniques for Security Vulnerability Prevention. arXiv 2019. [Google Scholar] [CrossRef]
Rajapaksha, S.; Senanayake, J.; Kalutarage, H.; Al-Kadri, M.O. Enhancing Security Assurance in Software Development: AI-Based Vulnerable Code Detection with Static Analysis; Springer Nature: Cham, Switzerland, 2024; pp. 341–356. [Google Scholar] [CrossRef]
Open Source Collective. Packages. 2024. Available online: https://packages.ecosyste.ms/ (accessed on 10 December 2024).
Gerke, O. Reporting Standards for a Bland–Altman Agreement Analysis: A Review of Methodological Reviews. Diagnostics 2020, 10, 334. [Google Scholar] [CrossRef] [PubMed]
Martin Bland, J.; Altman, D. Statistical Methods for Assessing Agreement Between Two Methods of Clinical Measurement. Lancet 1986, 327, 307–310. [Google Scholar] [CrossRef]
Bland, J.M.; Altman, D.G. Measuring agreement in method comparison studies. Stat. Methods Med. Res. 1999, 8, 135–160. [Google Scholar] [CrossRef] [PubMed]
MITRE. Hipcheck/README.md at Main-Mitre/Hipcheck. 2023. Available online: https://github.com/mitre/hipcheck/blob/main/README.md (accessed on 17 February 2025).
Open Source Software Foundation. Requirements for a Check. 2021. Available online: https://github.com/ossf/scorecard/blob/main/checks/write.md (accessed on 17 February 2025).
Open Source Software Foundation. Webhooks.go. 2022. Available online: https://github.com/ossf/scorecard/blob/main/checks/evaluation/webhooks.go (accessed on 17 February 2025).
Open Source Software Foundation. Checks.md. 2021. Available online: https://github.com/ossf/scorecard/blob/main/docs/checks.md (accessed on 17 February 2025).
Fujs, D.; Mihelič, A.; Vrhovec, S.L.R. The power of interpretation: Qualitative methods in cybersecurity research. In Proceedings of the 14th International Conference on Availability, Reliability and Security, Canterbury, UK, 26–29 August 2019. [Google Scholar] [CrossRef]

Figure 1. Introducing the DevSecTrust framework.

Figure 2. Boxplot of normalised aggregate score datasets.

Figure 3. Histogram comparing Scorecard and Hipcheck scores.

Figure 4. Comparing distribution of Hipcheck and Scorecard score datasets.

Figure 5. Bland–Altman plot of the difference between Scorecard and Hipcheck results.

Figure 6. Boxplots of per-registry normalised score datasets for Scorecard and Hipcheck results.

Figure 7. Frequency plot of scores from Hipcheck.

Table 1. Aggregate datasets for Scorecard and Hipcheck results.

Detail	Scorecard Results	Hipcheck Results
n	466	466
Mean	46.6	52.8
Std. Deviation	16.7	11.7
Minimum Value	9.0	11.0
5th percentile	22.0	33.0
25th percentile	34.0	46.0
Median	46.0	53.0
75th percentile	58.0	63.0
95th percentile	75.0	67.0
Maximum Value	89.0	83.0

Table 2. Comparing skewness of Hipcheck and Scorecard aggregate datasets.

	Scorecard Results Dataset	Hipcheck Results Dataset
Normal Distribution?	No	No
Skews	Right	Left
Skew value	0.25821052333592054	−0.7914401210212394
Skew description	Almost symmetrical	Moderately skewed

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jones, L.; Turnbull, B.; Moustafa, N. DevSecTrust: Standardising How We Measure Software Development Security. Future Internet 2026, 18, 279. https://doi.org/10.3390/fi18060279

AMA Style

Jones L, Turnbull B, Moustafa N. DevSecTrust: Standardising How We Measure Software Development Security. Future Internet. 2026; 18(6):279. https://doi.org/10.3390/fi18060279

Chicago/Turabian Style

Jones, Lachlan, Benjamin Turnbull, and Nour Moustafa. 2026. "DevSecTrust: Standardising How We Measure Software Development Security" Future Internet 18, no. 6: 279. https://doi.org/10.3390/fi18060279

APA Style

Jones, L., Turnbull, B., & Moustafa, N. (2026). DevSecTrust: Standardising How We Measure Software Development Security. Future Internet, 18(6), 279. https://doi.org/10.3390/fi18060279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DevSecTrust: Standardising How We Measure Software Development Security

Abstract

1. Introduction

2. Background and Related Work

3. DevSecTrust Framework

3.1. Novelty and Contribution

3.2. Components of DevSecTrust Framework

3.3. Unified Metric Schema

3.4. Outcome-Based Calibration

3.5. Robustness and Manipulation Testing

4. Preliminary Framework Validation

4.1. Identifying Evaluation Tooling

4.2. Identifying Popular Software

4.3. Evaluating Each Project with Each Tool

4.4. Comparing the Resulting Datasets

4.4.1. Aggregate Datasets

4.4.2. Registry-Specific Datasets

5. Analysis and Recommendations

5.1. Hipcheck

5.2. Scorecard

5.3. Evaluating at Scale

5.4. Implications of Identified Inconsistency

6. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI