top | item 42570261

(no title)

jepler | 1 year ago

My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...

discuss

order

DylanSp|1 year ago

It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.

menaerus|1 year ago

Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.

That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.

> The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.

[1] https://glean.software/docs/indexer/cxx

UltraSane|1 year ago

The whole point of indexing data is to perform very expensive computation once and leverage the result many many times and it works really well.

phyrex|1 year ago

It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.

kllrnohj|1 year ago

You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.

Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.