Module cargo::sources::registry::index

Expand description

Management of the index of a registry source.

This module contains management of the index and various operations, such as actually parsing the index, looking for crates, etc. This is intended to be abstract over remote indices (downloaded via Git or HTTP) and local registry indices (which are all just present on the filesystem).

How the index works

Here is a simple flow when loading a Summary (metadata) from the index:

A query is fired via RegistryIndex::query_inner.
Tries loading all summaries via RegistryIndex::load_summaries, and under the hood calling Summaries::parse to parse an index file.
1. If an on-disk index cache is present, loads it via Summaries::parse_cache.
2. Otherwise goes to the slower path RegistryData::load to get the specific index file.
A Summary is now ready in callback f in RegistryIndex::query_inner.

This is just an overview. To know the rationale behind, continue reading.

A layer of on-disk index cache for performance

One important aspect of the index is that we want to optimize the “happy path” as much as possible. Whenever you type cargo build Cargo will always reparse the registry and learn about dependency information. This is done because Cargo needs to learn about the upstream crates.io crates that you’re using and ensure that the preexisting Cargo.lock still matches the current state of the world.

Consequently, Cargo “null builds” (the index that Cargo adds to each build itself) need to be fast when accessing the index. The primary performance optimization here is to avoid parsing JSON blobs from the registry if we don’t need them. Most secondary optimizations are centered around removing allocations and such, but avoiding parsing JSON is the #1 optimization.

When we get queries from the resolver we’re given a Dependency. This dependency in turn has a version requirement, and with lock files that already exist these version requirements are exact version requirements =a.b.c. This means that we in theory only need to parse one line of JSON per query in the registry, the one that matches version a.b.c.

The crates.io index, however, is not amenable to this form of query. Instead the crates.io index simply is a file where each line is a JSON blob, aka IndexPackage. To learn about the versions in each JSON blob we would need to parse the JSON via IndexSummary::parse, defeating the purpose of trying to parse as little as possible.

Note that as a small aside even loading the JSON from the registry is actually pretty slow. For crates.io and RemoteRegistry we don’t actually check out the git index on disk because that takes quite some time and is quite large. Instead we use libgit2 to read the JSON from the raw git objects. This in turn can be slow (aka show up high in profiles) because libgit2 has to do deflate decompression and such.

To solve all these issues a strategy is employed here where Cargo basically creates an index into the index. The first time a package is queried about (first time being for an entire computer) Cargo will load the contents (slowly via libgit2) from the registry. It will then (slowly) parse every single line to learn about its versions. Afterwards, however, Cargo will emit a new file (a cache, representing as SummariesCache) which is amenable for speedily parsing in future invocations.

This cache file is currently organized by basically having the semver version extracted from each JSON blob. That way Cargo can quickly and easily parse all versions contained and which JSON blob they’re associated with. The JSON blob then doesn’t actually need to get parsed unless the version is parsed.

Altogether the initial measurements of this shows a massive improvement for Cargo null build performance. It’s expected that the improvements earned here will continue to grow over time in the sense that the previous implementation (parse all lines each time) actually continues to slow down over time as new versions of a crate are published. In any case when first implemented a null build of Cargo itself would parse 3700 JSON blobs from the registry and load 150 blobs from git. Afterwards it parses 150 JSON blobs and loads 0 files git. Removing 200ms or more from Cargo’s startup time is certainly nothing to sneeze at!

Note that this is just a high-level overview, there’s of course lots of details like invalidating caches and whatnot which are handled below, but hopefully those are more obvious inline in the code itself.

Structs

IndexPackage
A single line in the index representing a single version of a package.
RegistryDependency 🔒
A dependency as encoded in the IndexPackage index JSON.
RegistryIndex
Manager for handling the on-disk index.
Summaries 🔒
An internal cache of summaries for a particular package.
SummariesCache 🔒
A representation of the cache on disk that Cargo maintains of summaries.

Enums

IndexSummary
A parsed representation of a summary from the index. This is usually parsed from a line from a raw index file, or a JSON blob from on-disk index cache.
MaybeIndexSummary 🔒
A lazily parsed IndexSummary.

Constants

CURRENT_CACHE_VERSION 🔒
The current version of SummariesCache.
INDEX_V_MAX 🔒
The maximum schema version of the v field in the index this version of cargo understands. See IndexPackage::v for the detail.

Functions

split 🔒
Like slice::split but is optimized by [memchr].