DEVELOPMENT

How Top Open Source Projects Protect Their Code: Insights and Best Practices

Chris Abraham
Head of Data Science
February 7, 2022

TL;DR

We conducted a systematic analysis of the how 250 starred open-source projects on GitHub protect their source code. We defined a few metrics that allowed us to benchmark the effect that turning on security capabilities have on Pull Requests, repo interactions and quality outcomes. Based on the analysis, we note the positive impact Branch Protection and CODEOWNERS have on Pull Request reviews quality.

{{arnica-top-signup-banner="/template-pages/try-arnica-banner"}}

Introduction: The value and vulnerability of open source code

The proliferation of software supply chain attacks highlights the need for better security of source code, CI/CD pipelines and the entire DevOps toolchain.

To better understand which DevOps hardening aspect will be most impactful, we researched the common practices of the top 250 starred projects on GitHub.

Examining risk data for top open source projects on GitHub

The data for this analysis was obtained through queries executed against open-source repositories on GitHub. The dataset contains the following collected metadata:

  • Branch Protection policies enablement.
  • For each of the protected branches, which ones are using CODEOWNERS.
  • All Pull Requests (PRs) and the top 50 reviews of each in the last year.
  • Count of collaborators.
  • All languages used over 20% in each repository.

Branch Protection: Security setting that can be setup on all/specific branches and that enforces DevOps practices such as Pull Request reviews, status checks, etc.
CODEOWNERS: A way to specify individuals/teams that are responsible for reviewing code changes in a repository. It is a sub-setting of Branch Protection policies and is defined in a file in the repository.

How do the teams behind these repos manage PRs? Over the period of a year, 42% of the repos have had between 0-50 PRs. The second and third interval of between 50-200 and 200-800 PRs accounts for an additional 19% and 17% PRs respectively.

Figure 1. Distribution of Pull Requests within the top 250 starred GitHub repos
Figure 1. Distribution of Pull Requests within the top 250 starred GitHub repos

Given this data, we posed a few key questions that will shed light on what development teams are facing. These questions explore the interplay between quality and security controls. The following dimensions were considered:

  1. Branch Protection policy enablement
  2. Pull Requests (PR) handling
  3. CODEOWNERS enablement
  4. Repo contributors count

Let’s run through the combinations of these dimensions.

How do organizations use Branch Protection policies in relation to the number of contributors in their repos?

While Branch Protection usage is evenly split between the top 250 repos, those that use Branch Protection have an average contributor base that is double than those that don’t.

Branch Protection Enabled Repos % Age Average Contributor #
TRUE 52% 337
FALSE 48% 184

Table 1. Relationship between Branch Protection setting and Contributors in a Repo

Based on this data, we can infer that Branch Protection policies are used to deal with scale.

How do organizations use CODEOWNERS when Branch Protection is enabled?

The use of CODEOWNERS settings (a sub-option of Branch Protection) in the top 250 repos doesn’t seem to have much traction at the moment.

Figure 2. The proportion of CODEOWNERS within Branch Protection setting
Figure 2. The proportion of CODEOWNERS within Branch Protection setting

The hidden points of view

To analyze the other dimensions, we introduce 3 additional measurements.

Time Between Interactions (TBI)

TBI is the time incurred during successive PR events that take place in the lifecycle of a single PR, such as comments, review, changes, etc. In the example below, the PR was created by ‘itsamyth’ at:  2021-08-29 20:31:03.

Figure 3. An example of PR process for the Express JS repo

The first CHANGES_REQUESTED event was entered by ‘dougwilson’ at: 2021-08-29 21:01:48. Therefore, the TBI is: 30.75 mins. The next COMMENTED event from ‘itsamyth’ incurs a TBI of 2623 mins (the time difference between 2021-08-31 16:44:30 and 2021-08-29 21:01:48).

Since the TBI is assumed to be proportional to the number of changed lines of code, we normalize TBI by lines of code.

Pull Request Interactions Score (PRIS)

This metric captures the qualitative aspects of PR comments, reviews and change. PRIS is a score derived based on user association (member, contributor, etc.) and specific review actions (approved, commented, changes requested, etc.). The score is additive for a PR based on all the PR events.

The following pseudo code is used to compute the initial basis for PRIS:


for review in pr:
  if review.state == "REQUESTED_CHANGES":
   comment_score_unnormalized +=2 
  for comment in review:
    if comment.authorAssociation == "MEMBER":
        comment_score_unnormalized +=2
    if comment.authorAssociation == "CONTRIBUTOR":
        comment_score_unnormalized +=1.5
    if comment.authorAssociation == "NONE":
        comment_score_unnormalized +=1
    if comment.authorAssociation == "AUTHOR":
        comment_score_unnormalized +=0.5

comment_score = (comment_score_unnormalized / lines_changed)/number of reviews in PR

We have also normalized the score on lines of code added/deleted and number of reviews in the PRs. For the sake of the discussion we loosely consider a higher PRIS score as an indicator of higher quality PR review process, although due to Goodhart’s law, we don’t imply this should be used as a definitive PR quality metric).

Pull Request Review Quality Score

To quantify the quality of the PR score, it is important to place it in context. Each of the contributors in each repository has a different load. We compute a Mean Reviewer Load Factor as how “busy” are all the reviewers in that repo based on the actual lines of code (both additions and deletions) reviewed by them.

Pull Request Review Quality Score is calculated by multiplying the Mean Reviewer Load Factor with the Mean PRIS (based on the role of each reviewer) to avoid situations where the author or unknown user comment on a PR to impact its score. The score is calculated as a collaborative effort.

Based on the formula “Pull Request Review Quality Score” = (“Mean Reviewer Load Factor” * “Mean PRIS”), the higher score indicates better quality.

Now that we have defined these metrics, we present a first cut exploratory data analysis below. For the analysis, we split PR volumes into bins across all the 250 repos.

PR # Intervals Mean PR TBI Mean normalized PR TBI Mean # contributors Mean # Lines Changed Repo # Mean PRIS
(0, 50) 5260 1484 210 725 104 1.24
(50, 200) 5746 956 284 2410 47 0.79
(200, 800) 2642 291 348 1673 42 0.64
(800, 2000) 2569 183 389 1559 31 0.60
(2000, 12000) 2144 130 383 977 24 0.57

Table 2. Table of PR handling vs Contributors

The above table captures a paradox: the more PRs in a repo, the lower the mean TBI. We believe that this paradox can be explained by the fact that the repos that accumulate greater contributor interest, greater scope of code changes and process them much quicker points to the efficiency of their process. Also, as number of PRs increase and members being a limited resource, we also observe a decrease in Mean PRIS scores which means that the quality of PR reviews process is moderately impacted.

Does enabling CODEOWNERS impact the PRs Review Quality Score vs Protected Branches without CODEOWNERS?

In the previous section, we looked at the repos’ PR TBI regardless of their CODEOWNERS and/or Branch Protection settings.

We examined the effect of Branch Protection setting enabled below:

Branch Protection Enabled Mean normalized PR TBI Mean Reviewer Load Factor Mean PRIS PR Review Quality Score
TRUE 178 54,746 0.56 30,761
FALSE 1084 5,141 1.41 7,250

Table 3. PR Processing metrics with Branch Protection setting

In general, Branch Protection has a positive effect on the mean PR TBI and PR Review Quality Score.

Next, we look at the effects of turning on the CODEOWNERS setting. We’ll use two analyses, an overall vs binned approach to PRs processed by organization. The overall PRs processed are below:

CODEOWNERS Setting Mean normalized PR TBI Mean Reviewer Load Factor PRIS PR Review Quality Score
TRUE 197 50,222 0.63 31,845
FALSE 203 54,085 0.57 31,073

Table 4. PR processing metrics with CODEOWNERS setting

We see that introducing the CODEOWNERS setting has a negative impact on mean TBI but not on PR interactions. Thus, we infer that specifying a defined pool of reviewers might play a role here in trading off mean TBI for greater code review interactions and therefore quality.

Does enabling Branch Protection impacts the PRs Review Quality Score vs Unprotected Branches?

First, analyzing using overall metrics with both settings available:

Branch Protection Enabled CODEOWNERS Setting Mean normalized PR TBI Mean Reviewer Load Factor PRIS PR Review Quality Score
TRUE TRUE 197 47,025 1.14 53,654
TRUE FLASE 174 26,304 1.06 27,809
FALSE FALSE 1,084 1,969 2.76 5,435

Table 5. PR metrics with CODEOWNERS and Branch Protection settings

We can imply that in all combinations above, the enablement of Branch Protection policies increased the overall PR Review Quality Score.

Second, we turn to a PR volume binned approach to the various settings – Branch Protection and CODEOWNERS.

Setting: Branch Protection – ON, CODEOWNERS – ON/OFF

PR # Intervals CODEOWNERS Setting Mean normalized PR TBI Mean PRIS Mean Reviewer Load Factor PR Review Quality Score
(0, 50) TRUE N/A N/A N/A N/A
(0, 50) FALSE 1,777 1.20 893 1,071
(50, 200) TRUE 2,357 1.92 818 1,571
(50, 200) FALSE 1,070 0.77 4,636 3,570
(200, 800) TRUE 334 0.62 76,415 47,378
(200, 800) FALSE 168 0.53 49,990 26,495
(800, 2000) TRUE 150 0.42 27,811 11,681
(800, 2000) FALSE 182 0.57 72,442 41,292
(2000, 12000) TRUE 169 0.70 53,829 37,680
(2000, 12000) FALSE 2,111 0.54 48,615 26,252

Table 6. PR handling metrics with Branch Protection – ON, CODEOWNERS – ON/OFF

As can be seen from the above table, with both CODEOWNERS and Branch Protection settings turned on, the mean normalized PR TBI is higher compared to those repos that have only Branch Protection turned on. Again, we reason that PR event interactions controlled by CODEOWNERS setting restricts the review process to specified individuals or teams which we hypothesize would result in greater quality of PR process and outcomes at the expense of a lower mean TBI (the same holds for normalized mean TBI).

However, the PR # Interval of (800, 2000] bears further investigation because the PR Review Quality Score (as well as PRIS) degrades with turning on CODEOWNERS option.

Setting: Branch Protection – OFF, CODEOWNERS - OFF

PR # Intervals CODEOWNERS Setting Mean normalized PR TBI Mean PRIS Mean Reviewer Load Factor PR Review Quality Score
(0, 50) FALSE 1,154 1.24 1,130 1,397
(50, 200) FALSE 757 0.69 2,659 1,841
(200, 800) FALSE 837 1.54 3,938 6,051
(800, 2000) FALSE 278 1.64 5,707 9,354
(2000, 12000) FALSE N/A N/A N/A N/A

Table 7. PR handling metrics with Branch Protection - OFF

Comparing repos in the same PR volume bins with the Branch Protection and CODEOWNERS turned off in Table 7 vs similar PR volume binned repos in Table 6 shows the positive effect on mean TBI and PR Review Quality Score.

Conclusion: Implement effective protections on your open source dependencies

In conclusion, we investigated the Branch Protection landscape gleaned from the top 250 starred GitHub repos. The team analyzed the various Branch Protection strategies, configuration options and organizational strategies that are currently in use in the open-source space. We were able to quantify the effect of using Branch Protection as well as CODEOWNERs settings across some of the top GitHub open-source repos. While these options represent static approaches to setting up and managing code protection, the Arnica team is looking at dynamic approaches that will supercharge the automation efforts in both open and closed source projects that lie at the intersection of large contributor teams, DevOps governance and security. We see a tremendous opportunity in introducing tools and insights that will improve the efficiency of DevOps processes while simultaneously reaping the benefits of better code quality.

THE LATEST UPDATES

More from our blog

What Every Developer Needs to Know About GitHub Branch Protection
What Every Developer Needs to Know About GitHub Branch Protection
March 25, 2024
What Developers Can Learn from Taylor Swift's Re-recording Strategy
What Developers Can Learn from Taylor Swift's Re-recording Strategy
March 25, 2024
How We Converted a GitHub Tool Into a General Purpose Webhook Proxy to Supercharge Our Integration Development
How We Converted a GitHub Tool Into a General Purpose Webhook Proxy to Supercharge Our Integration Development
March 25, 2024

{{arnica-bottom-signup-banner="/template-pages/try-arnica-banner"}}