One of the early decisions we made in Screenshotbot was to not have read-access to your GitHub repositories.
This turned out to be a huge advantage for us. It makes our customers much more confident using our product, and makes it easier to get through Security reviews at bigger enterprises.
But it also makes it easier for us to integrate with other Git providers such as GitLab, BitBucket or Phabricator. If we don’t depend on custom APIs, our code just works for everyone. (We still need to access their APIs to comment on Pull Requests, but on many platforms that permission is a lot more granular.) In fact, a subset of our features will also work a self-hosted Git repository.
But even though we don’t need read access to your repository, we do need access to the “graph” to make decisions such as “find the last commit for which we have a build available”, or “find the best merge-base”. We’ll talk about how we did this in the past, and we’ll also talk about our new feature using the the git-upload-pack
protocol, which might be a fun read if you’re looking to understand Git internals.
In order to implement this, Screenshotbot stores a “commit-graph” for every repository on our server. A commit-graph is just the commit SHAs: for each commit we store its parents. We store no other information about your Git repository.
In your CI runs, we would look at the last 1000 commits using something like git log --all --pretty="%H %P" --max-count 100
. We upload them to our server (again, only SHAs), and the server does the job of merging the graphs together with what we already know. In practice, we’ll always have the full graph available to us.
This does cause some issues:
We recently worked with a customer that absolutely needed shallow clones. Cloning their repository without shallow clones added several minutes to their builds.
We wanted to avoid having to depend on GitHub specific APIs to solve this. Again, that’s hard to maintain and even harder to test. It also means our users need to do an additional configuration step to give us API access, which we didn’t want to do.
However we realized this: Most of our customers’ CI jobs already had SSH access to their Git repositories. Surely there must be a way to get the information we needed directly via SSH, and hopefully efficiently?
And indeed there is, enter git-upload-pack
.
When you clone a repository, or pull from a repository, your local client makes an SSH connection to your remote server and runs git-upload-pack
. This starts an interactive session that negotiates what information needs to be transferred.
The official protocol is an ugly and messy binary protocol, and micro-optimizes on things it really shouldn’t be micro-optimizing (IMHO). But if you disregard the exact wire format, the protocol roughly goes like this:
refs/heads/main
), and the associated commit SHAs. It also tells us what features it the server support (the “filter” feature will be particularly helpful, also helpful is the “allow-reachable-sha1-in-want”).A Packfile is just a collection of object. Objects can be commits, “tree” information about the the files and directories, or blobs. Roughly speaking, the Packfile is a collection of <type, length, zlib compressed contents>.
git-upload-pack
to ScreenshotbotNow that we know about git-upload-pack, we can do the following:
git-upload-pack
And that’s really it. As far as we could tell, there wasn’t an existing command line way of doing this, so we had to implement the protocol from scratch. Here’s the implementation if you’re interested.
If you’re trying to work with the git-upload-pack
protocol, here’s some things to be aware of
TL;DR: If you’re a Screenshotbot user, you can now use shallow clones! However, please reach out to us as we’re rolling out the deployment as of writing.
If you’re here looking for how to work with git-upload-pack, I hope this review and our implementation will help you out. Happy to answer any questions below.
Did you enjoy this post? Share the knowledge!