top | item 45002092

DeepWiki: Understand Any Codebase

231 points| childishnemo | 6 months ago |aitidbits.ai

53 comments

order

buovjaga|6 months ago

It's a pity that there is no clear way to send takedown requests. We didn't ask for deceptive garbage to be generated as documentation for LibreOffice, but here it is and newbies are discovering it: https://deepwiki.com/LibreOffice/core/2-build-system (spoiler: LibreOffice has never used Buck as a build system)

baby|6 months ago

From my experience using deepwiki what it generates is not deceptive garbage

conartist6|6 months ago

I sent them a politely worded threat and they responded right away opting me out:

> Hello, I am writing you as an author of Open Source software seeking to protect my security and that of my users.

> What I would like to know is: how may I prevent deepwiki from indexing my projects, specifically those in the ----- GitHub organization? If you consider yourselves to have implicit legal permission to train on my projects and write about them, know that I hereby explicitly and permanently revoke that permission.

> Since you likely believe that I lack the authority to get you to stop, I will add this:

> To the extent allowed by law I will consider any incorrect information you publish about my projects to be libelous and, given this notice, made with your intention. LLMs have no will to act, so publishing misinformation about my project, at such time as that happens, could only be the result of human will.

> Kind regards, > Conrad Buck

rgoulter|6 months ago

The DeepWiki tool itself seems pretty neat. It has a pretty good go at collecting documentation from across the codebase and organising it in one place. It has a pretty good guess at coming up with documentation which isn't there.

It strikes me as an example of automated code assistance that's e.g. more useful than "the item under the cursor has type <X>, here's its documentation".

There are things which benefit from being automatically described, and there are things where "the map is not the territory", and you do want someone to have come up with a map.

> "Treat it like a patient senior engineer."

I trust that LLMs are patient (you can ask them stupid questions without consequence).

I do not trust LLMs to act as 'senior'. (i.e. Unless you ask it to, it won't push back against dumb ideas, or suggest better ideas that would achieve what you're trying to do. -- And if you just ask it to 'push back', it's going to push back more than necessary).

ignoramous|6 months ago

> It strikes me as an example of automated code assistance that's e.g. more useful than "the item under the cursor has type <X>, here's its documentation". ... I trust that LLMs are patient (you can ask them stupid questions without consequence).

DeepWiki does add tremendous value already: I maintain open source projects and frequently direct volunteers to use DeepWiki to explore those (fairly convoluted) codebases. But ... I've caught DeepWiki hallucinating pretty convincingly far more than once just because a struct / a package / a function was named for something it wasn't doing anymore / wasn't doing it by the book (think: RFCs, docs, specifications etc). This isn't a criticism more than it is of the refactoring practices by the maintainers themselves. "Code readability" and tests (part of "gyms" for Agents), then I imagine, are going to come in clutch for OSS projects that want to benefit from a constant stream of productive external contributions.

giancarlostoro|6 months ago

I'm trying it out with a repo that has no code comments or documentation. Let's see how it does. :)

Edit: its been like over 10 minutes, and still nothing, thats interesting, I did choose a Lingo source project so it probably gave up by now.

fergie|6 months ago

I hesitate to dump on Deepwiki, because what it does is to some degree impressive and timesaving (especially the system diagrams).

But for my libs (that aren't super popular, but OTOH have a few million downloads per year) it generates documentation that is incorrect, and this is not good for users.

tacker2000|6 months ago

The Elkjs project uses this and im not really sure i like it. [1]

Its a bit hard to find stuff. I was looking to find the structure of the main configuration json object and couldnt find it in the deepwiki.

I found it on the “non ai created” doc page of the main Elk project[2] (Elkjs is a JS implementation of Elk)

But yes this is of course just one data point.

[1]https://deepwiki.com/kieler/elkjs/5-usage-guide

[2]https://eclipse.dev/elk/documentation/tooldevelopers/graphda...

Nullabillity|6 months ago

"Uses it" sounds strong.. I don't see any link to it from https://github.com/kieler/elkjs?

Annoyingly, anyone can just.. request a deepwiki for any GitHub repo. That one exists doesn't mean that it's endorsed or reviewed by the project.

They just kind of barged in, welcome or not. Just another SEO slop-spammer.

jcranmer|6 months ago

So I decided to look at some open source repos I know decently well. The only one that seems to have a wiki is LLVM (https://deepwiki.com/llvm/llvm-project).

Thoughts on the overview page: Okay, weird subset of the top-level directories. The high-level compilation pipeline diagram is... wrong? Like, Clang-AST is definitely part of clang frontend, and you get to the optimization pipeline, which clearly fucks up the flow through vectorization and instruction selection (completely omitting GlobalISel as well too, for that matter). The choice of backends to highlight is weird, and at the end of the day, it manages to completely omit some of the most important passes in LLVM (like InstCombine).

Drilling down into the other pages... I mean at no point does it even discuss or give an idea of what LLVM IR. There's nothing about the pass manager, nothing about expected canonicalization of passes. It's got a weird fixation about some things (like the role of TargetLowering), but manages to elide pretty much any detail that is actually useful. The role of TableGen in several components is completely missing--and FWIW, understanding TableGen and its error messages is probably the single hardest part of putting together an LLVM backend, precisely the thing you'd want it to focus on.

If I had to guess, it's overly fixated on things that happen to be very large files--I think everything it decided to focus on in a single page happens to be a 30kloc file or something. But that means it also misses the things that are so gargantuan they're split into multiple files--Clang codegen is ~100kloc and InstCombine is ~40kloc but since they're in several 4-5kloc files instead of a large 26kloc file (SLPVectorizer) or 62kloc file (X86ISelLowering), they're simply not considered important and ignored.

IceHegel|6 months ago

Yeah this is my experience too. For projects I know well, the diagrams are not engineering quality.

grokblah|6 months ago

That’s a very intriguing observation.

(I haven’t read how it works but…) I wonder if removing file sizes, commit counts, and other numerical metadata would have a significant impact on the output. Or if all of the files were glommed into one large input with path+filename markers?

menaerus|6 months ago

Nonetheless I still think it's impressive considering that LLVM codebase is one of the most complex ones to be found in the wild.

swiftcoder|6 months ago

Ok, yeah, this feels like a reasonable use case for AI. I generated a DeepWiki from one of my repos, and it's pretty informative. It goes into way too much depth on some trivial details, and glosses over more important stuff in places, but overall it seems to have produced a pretty detailed summary of what the package does, and why it does many things the way that it does.

nikisweeting|6 months ago

Deepwiki was instrumental in our refactor of a large codebase away from playwright to pure CDP @ browser-use. Huge props to the team that built it, I regularly refer to it as one of the few strictly net positive AI coding tools.

The auto-overviews and diagrams are great, but where it truly shines is the "deep research" follow-up questions system at the bottom. It's much better than using OpenAI deep research of perplexity to ask questions about complex codebases like puppeteer/playwright/chromium/etc.

dataviz1000|6 months ago

I want to respond to our last interaction here.

This is Cordyceps [0]. My Chrome extension port of the typescript port of Browser Use needs some love but it is contained there.

I wanted to post the port of a MCP with pure CDP libraryI found to use Chrome.debugger therefore removing Playwright but the typescript and monorepo tooling is out of date. Hopefully I'll get it done tomorrow.

[0] https://github.com/adam-s/cordyceps

opdahl|6 months ago

Isn’t this supposed to be a short technical blog? Why does it seem like they’re a salesman and it’s a sales pitch?

> "We are generating more code than ever. With LLMs like Claude already writing most of Anthropic’s code, the challenge is no longer producing code, it is understanding it."

The first sentence already is obviously AI generated, and reading through it it, it is obviously completely written by AI to the point of it being distracting.

I understand the author probably feels that AI is better at writing than they are, but I would heavily recommend they use their own voice.

I’ve personally started to try to think about the points someone prompted an AI to generate some text (the actual thoughts of the author) so that I can more easily skim past the AI generated slop such as: "… you’ll get the env setup, required services, and dependency graph with citations to README, Dockerfile, and scripts, so you can hit the ground running".

npinsker|6 months ago

Though I agree with you, both of the sentences you cited (the first two in the piece) have mistakes in their English and wouldn’t be written by AI.

IceHegel|6 months ago

I really want to like deepwiki, but just looking at the diagrams of repos, they are too handwavy to be useful.

They are a conceptual overview and don’t seem tied down enough to the actual implementation details of a particular project.

Perhaps this could be the improved.

neilv|6 months ago

> Suppose you find a clever mechanism in another repository, such as an authentication flow or a clever way to persist state locally. In that case, you can ask DeepWiki to provide a Markdown cheat sheet: a breakdown of how it works, which files define it, and what it depends on. You can then drop that summary directly into Claude Code or Cursor as structured context and ask it to implement it in your project.

Bonus if the LLM was trained on the original repository.

Then it would be that much more clear that you're just laundering open source code.

1317|6 months ago

It would be nice if it could also read github issues etc if they were available, so it could have more context about the decisions that were made. For large projects/those with a lot of issues this might be a bit impossible though

manishsharan|6 months ago

Who is paying for those tokens? I had a long conversation with "Devin" and I must have burnt up a large number of tokens. In any case, thank you "Devin"

mkagenius|6 months ago

> my go-to tool for understanding unfamiliar codebases

Somehow voices are a little easier to pay attention to than text to me.

I had done a show HN for https://gitpodcast.com earlier, created with a similar goal in mind.

mxmilkiib|6 months ago

neat system

tried it out on Mixxx; https://deepwiki.com/mixxxdj/mixxx

don't know how to zoom the diagrams on mobile tho, n they can easily almost disappear from view when panning around

the prompt box could do with a way to move it out of the way of the bottom couple of cm in portrait, or from covering more than a quarter of the screen in landscape orientation

fedeb95|6 months ago

I'm wondering about the "any" part. How does it perform on big codebases? Like ~20 repositories with lots of classes.

lelouch9099|6 months ago

What if I don't trust this third party with my code? are there any open source/ local way to run this ?

letaem77|6 months ago

This is my way to do it:

1. Archive whole repository into single text file with Repopack: https://github.com/yamadashy/repomix

2. To reduce token, compress the file with LLMLingua-2: https://github.com/microsoft/LLMLingua

(fewer token = more context can be given to LLM = LLM better understands your repository)

3. Copy & Paste the compressed archive text contents as a context, into ChatGPT’s input field as-is, or local LLMs.

4. Ask the LLM for documentation generation. Like, “this is a repository source code. given context, generate a ‘table-of-content’.” Then you will get a ToC. If it looks good, you can ask for generating first chapter. And keeps going until you finish whole documentation.

If you are trying to document Typescript/Javascript codebase, You may use bundlers like esbuild for step 2, which will beneficial for token reducing.

If you interested in step 2’s LLMLingua-2, check out my Typescript port that can be ran without no installation at: https://atjsh.github.io/llmlingua-2-js/

Cheer2171|6 months ago

Is deepwiki related to wikis in any way, or just grifting on the name?

raphman|6 months ago

Yeah, it's not a wiki at all However, in my experience, many people just know the term 'wiki' as a short form for Wikipedia - and for many, Wikipedia is the only encyclopedia they know. So, I guess that the Deepwiki authors see it as some kind of deep/specialised encyclopedia.

faangguyindia|6 months ago

I just use context7, they launched api recently. It's my goto solution for coding agent docs.

manishsharan|6 months ago

Gemini and chatgpt and github copilot subscriptions also provide similar functionality.

caboteria|6 months ago

I recently received an AI-slop bug report for a small open source project (PureLB) that I maintain, and the slop was generated by DeepWiki. It was very incorrect, but I didn't know what "DeepWiki" was so I wasted about an hour. If DeepWiki is causing garbage bug reports even on tiny projects like mine, I can't imagine how much maintainer time it's wasting over all.