Using git-upload-pack for a simpler CI integration

Arnold Noronha

Founder, Screenshotbot

One of the early decisions we made in Screenshotbot was to not have read-access to your GitHub repositories.

This turned out to be a huge advantage for us. It makes our customers much more confident using our product, and makes it easier to get through Security reviews at bigger enterprises.

But it also makes it easier for us to integrate with other Git providers such as GitLab, BitBucket or Phabricator. If we don’t depend on custom APIs, our code just works for everyone. (We still need to access their APIs to comment on Pull Requests, but on many platforms that permission is a lot more granular.) In fact, a subset of our features will also work a self-hosted Git repository.

But even though we don’t need read access to your repository, we do need access to the “graph” to make decisions such as “find the last commit for which we have a build available”, or “find the best merge-base”. We’ll talk about how we did this in the past, and we’ll also talk about our new feature using the the git-upload-pack protocol, which might be a fun read if you’re looking to understand Git internals.

How we did this until yesterday

In order to implement this, Screenshotbot stores a “commit-graph” for every repository on our server. A commit-graph is just the commit SHAs: for each commit we store its parents. We store no other information about your Git repository.

In your CI runs, we would look at the last 1000 commits using something like git log --all --pretty="%H %P" --max-count 100. We upload them to our server (again, only SHAs), and the server does the job of merging the graphs together with what we already know. In practice, we’ll always have the full graph available to us.

This does cause some issues:

It’s a lot harder to support shallow clones in CI, since the local repository does not have access to all the commits to generate the graph locally.
For a large monorepo, we might be generating and merging an identical graph many times per commit. Surely, we could be more optimal in how we merge graphs.

Using git upload-pack

We recently worked with a customer that absolutely needed shallow clones. Cloning their repository without shallow clones added several minutes to their builds.

We wanted to avoid having to depend on GitHub specific APIs to solve this. Again, that’s hard to maintain and even harder to test. It also means our users need to do an additional configuration step to give us API access, which we didn’t want to do.

However we realized this: Most of our customers’ CI jobs already had SSH access to their Git repositories. Surely there must be a way to get the information we needed directly via SSH, and hopefully efficiently?

And indeed there is, enter git-upload-pack.

When you clone a repository, or pull from a repository, your local client makes an SSH connection to your remote server and runs git-upload-pack. This starts an interactive session that negotiates what information needs to be transferred.

The official protocol is an ugly and messy binary protocol, and micro-optimizes on things it really shouldn’t be micro-optimizing (IMHO). But if you disregard the exact wire format, the protocol roughly goes like this:

The server tells us all the refs (such as refs/heads/main), and the associated commit SHAs. It also tells us what features it the server support (the “filter” feature will be particularly helpful, also helpful is the “allow-reachable-sha1-in-want”).
The client asks the server for which refs it wants, and approximately what objects it already has. Objects could be commits, blobs, or information about the filesystem tree. If the server supports the “filter” feature, we can also tell it to avoid sending the blobs, which signficantly reduces the network traffic.
The server sends all the objects required for us to complete our graph, using a packfile format.

A Packfile is just a collection of object. Objects can be commits, “tree” information about the the files and directories, or blobs. Roughly speaking, the Packfile is a collection of <type, length, zlib compressed contents>.

Connecting `git-upload-pack` to Screenshotbot

Now that we know about git-upload-pack, we can do the following:

On the CI job, our CLI tool opens a connection to the remote git-upload-pack
The Git server tells us about the refs and commits.
Our CLI tool then checks with Screenshotbot server about which refs and commits we need to complete the Screenshotbot’s version of the graph.
The CLI tool communicates the commits it needs, and what information it has to the Git server
The Git server sends the commit information efficiently, and we transform it into something we can send to Screenshotbot servers

And that’s really it. As far as we could tell, there wasn’t an existing command line way of doing this, so we had to implement the protocol from scratch. Here’s the implementation if you’re interested.

Caveats

If you’re trying to work with the git-upload-pack protocol, here’s some things to be aware of

The protocol is ugly. For example, why are the “features” sent as part of the first “have” and “want”, as opposed to just being its own line? Why is the object type optimized be to be in 3-bits, instead of 8? It makes the code that much more complicated, and the number of objects are going to be far far smaller than the contents of the objects. There is a v2 of the protocol that I haven’t looked at, but we went with v1 for broader support.
The Packfile format is also ugly. For instance, why is the format different based on type of the object? But I somewhat understand why Packfile needs to be ugly, it’s a bit more in the hot path of actually doing queries.
Some Git servers (particularly Phabricator) does not send an EOF after sending the Packfile, this has significance depending on how your zlib library is processing data. For example, if you zlib library is reading multiple bytes at a time before processing it, it might not know when to stop. You’ll want to use a zlib library that is reading one byte at a time.
Different Git servers might have different restrictions. Azure DevOps requires clients to use multi_ack mode, which we didn’t build. We just decided not to support this new flow on servers that have more specific requirements, instead we fall back to our previous implementation in these cases.

Summary

TL;DR: If you’re a Screenshotbot user, you can now use shallow clones! However, please reach out to us as we’re rolling out the deployment as of writing.

If you’re here looking for how to work with git-upload-pack, I hope this review and our implementation will help you out. Happy to answer any questions below.