Expand description
Management of the index of a registry source.
This module contains management of the index and various operations, such as actually parsing the index, looking for crates, etc. This is intended to be abstract over remote indices (downloaded via Git or HTTP) and local registry indices (which are all just present on the filesystem).
How the index works
Here is a simple flow when loading a Summary
(metadata) from the index:
- A query is fired via
RegistryIndex::query_inner
. - Tries loading all summaries via
RegistryIndex::load_summaries
, and under the hood callingSummaries::parse
to parse an index file.- If an on-disk index cache is present, loads it via
Summaries::parse_cache
. - Otherwise goes to the slower path
RegistryData::load
to get the specific index file.
- If an on-disk index cache is present, loads it via
- A
Summary
is now ready in callbackf
inRegistryIndex::query_inner
.
This is just an overview. To know the rationale behind, continue reading.
A layer of on-disk index cache for performance
One important aspect of the index is that we want to optimize the “happy
path” as much as possible. Whenever you type cargo build
Cargo will
always reparse the registry and learn about dependency information. This
is done because Cargo needs to learn about the upstream crates.io crates
that you’re using and ensure that the preexisting Cargo.lock
still matches
the current state of the world.
Consequently, Cargo “null builds” (the index that Cargo adds to each build itself) need to be fast when accessing the index. The primary performance optimization here is to avoid parsing JSON blobs from the registry if we don’t need them. Most secondary optimizations are centered around removing allocations and such, but avoiding parsing JSON is the #1 optimization.
When we get queries from the resolver we’re given a Dependency
. This
dependency in turn has a version requirement, and with lock files that
already exist these version requirements are exact version requirements
=a.b.c
. This means that we in theory only need to parse one line of JSON
per query in the registry, the one that matches version a.b.c
.
The crates.io index, however, is not amenable to this form of query. Instead
the crates.io index simply is a file where each line is a JSON blob, aka
IndexPackage
. To learn about the versions in each JSON blob we would
need to parse the JSON via IndexSummary::parse
, defeating the purpose
of trying to parse as little as possible.
Note that as a small aside even loading the JSON from the registry is actually pretty slow. For crates.io and
RemoteRegistry
we don’t actually check out the git index on disk because that takes quite some time and is quite large. Instead we uselibgit2
to read the JSON from the raw git objects. This in turn can be slow (aka show up high in profiles) because libgit2 has to do deflate decompression and such.
To solve all these issues a strategy is employed here where Cargo basically
creates an index into the index. The first time a package is queried about
(first time being for an entire computer) Cargo will load the contents
(slowly via libgit2) from the registry. It will then (slowly) parse every
single line to learn about its versions. Afterwards, however, Cargo will
emit a new file (a cache, representing as SummariesCache
) which is
amenable for speedily parsing in future invocations.
This cache file is currently organized by basically having the semver version extracted from each JSON blob. That way Cargo can quickly and easily parse all versions contained and which JSON blob they’re associated with. The JSON blob then doesn’t actually need to get parsed unless the version is parsed.
Altogether the initial measurements of this shows a massive improvement for Cargo null build performance. It’s expected that the improvements earned here will continue to grow over time in the sense that the previous implementation (parse all lines each time) actually continues to slow down over time as new versions of a crate are published. In any case when first implemented a null build of Cargo itself would parse 3700 JSON blobs from the registry and load 150 blobs from git. Afterwards it parses 150 JSON blobs and loads 0 files git. Removing 200ms or more from Cargo’s startup time is certainly nothing to sneeze at!
Note that this is just a high-level overview, there’s of course lots of details like invalidating caches and whatnot which are handled below, but hopefully those are more obvious inline in the code itself.
Structs
- A single line in the index representing a single version of a package.
- A dependency as encoded in the
IndexPackage
index JSON. - Manager for handling the on-disk index.
- An internal cache of summaries for a particular package.
- A representation of the cache on disk that Cargo maintains of summaries.
Enums
- A parsed representation of a summary from the index. This is usually parsed from a line from a raw index file, or a JSON blob from on-disk index cache.
- A lazily parsed
IndexSummary
.
Constants
- The current version of
SummariesCache
. - The maximum schema version of the
v
field in the index this version of cargo understands. SeeIndexPackage::v
for the detail.
Functions
- split 🔒Like
slice::split
but is optimized by [memchr
].