It's a pity that there is no clear way to send takedown requests. We didn't ask for deceptive garbage to be generated as documentation for LibreOffice, but here it is and newbies are discovering it: https://deepwiki.com/LibreOffice/core/2-build-system (spoiler: LibreOffice has never used Buck as a build system)
Out of curiosity, how come LibreOffice has .buckversion, BUCK, .buckconfig, etc? This commit[1] does seem to indicate using Buck to build at one point, though it is 10 years old.
I sent them a politely worded threat and they responded right away opting me out:
> Hello, I am writing you as an author of Open Source software seeking to protect my security and that of my users.
> What I would like to know is: how may I prevent deepwiki from indexing my projects, specifically those in the ----- GitHub organization? If you consider yourselves to have implicit legal permission to train on my projects and write about them, know that I hereby explicitly and permanently revoke that permission.
> Since you likely believe that I lack the authority to get you to stop, I will add this:
> To the extent allowed by law I will consider any incorrect information you publish about my projects to be libelous and, given this notice, made with your intention. LLMs have no will to act, so publishing misinformation about my project, at such time as that happens, could only be the result of human will.
The DeepWiki tool itself seems pretty neat. It has a pretty good go at collecting documentation from across the codebase and organising it in one place. It has a pretty good guess at coming up with documentation which isn't there.
It strikes me as an example of automated code assistance that's e.g. more useful than "the item under the cursor has type <X>, here's its documentation".
There are things which benefit from being automatically described, and there are things where "the map is not the territory", and you do want someone to have come up with a map.
> "Treat it like a patient senior engineer."
I trust that LLMs are patient (you can ask them stupid questions without consequence).
I do not trust LLMs to act as 'senior'. (i.e. Unless you ask it to, it won't push back against dumb ideas, or suggest better ideas that would achieve what you're trying to do. -- And if you just ask it to 'push back', it's going to push back more than necessary).
> It strikes me as an example of automated code assistance that's e.g. more useful than "the item under the cursor has type <X>, here's its documentation". ... I trust that LLMs are patient (you can ask them stupid questions without consequence).
DeepWiki does add tremendous value already: I maintain open source projects and frequently direct volunteers to use DeepWiki to explore those (fairly convoluted) codebases. But ... I've caught DeepWiki hallucinating pretty convincingly far more than once just because a struct / a package / a function was named for something it wasn't doing anymore / wasn't doing it by the book (think: RFCs, docs, specifications etc). This isn't a criticism more than it is of the refactoring practices by the maintainers themselves. "Code readability" and tests (part of "gyms" for Agents), then I imagine, are going to come in clutch for OSS projects that want to benefit from a constant stream of productive external contributions.
I hesitate to dump on Deepwiki, because what it does is to some degree impressive and timesaving (especially the system diagrams).
But for my libs (that aren't super popular, but OTOH have a few million downloads per year) it generates documentation that is incorrect, and this is not good for users.
So I decided to look at some open source repos I know decently well. The only one that seems to have a wiki is LLVM (https://deepwiki.com/llvm/llvm-project).
Thoughts on the overview page: Okay, weird subset of the top-level directories. The high-level compilation pipeline diagram is... wrong? Like, Clang-AST is definitely part of clang frontend, and you get to the optimization pipeline, which clearly fucks up the flow through vectorization and instruction selection (completely omitting GlobalISel as well too, for that matter). The choice of backends to highlight is weird, and at the end of the day, it manages to completely omit some of the most important passes in LLVM (like InstCombine).
Drilling down into the other pages... I mean at no point does it even discuss or give an idea of what LLVM IR. There's nothing about the pass manager, nothing about expected canonicalization of passes. It's got a weird fixation about some things (like the role of TargetLowering), but manages to elide pretty much any detail that is actually useful. The role of TableGen in several components is completely missing--and FWIW, understanding TableGen and its error messages is probably the single hardest part of putting together an LLVM backend, precisely the thing you'd want it to focus on.
If I had to guess, it's overly fixated on things that happen to be very large files--I think everything it decided to focus on in a single page happens to be a 30kloc file or something. But that means it also misses the things that are so gargantuan they're split into multiple files--Clang codegen is ~100kloc and InstCombine is ~40kloc but since they're in several 4-5kloc files instead of a large 26kloc file (SLPVectorizer) or 62kloc file (X86ISelLowering), they're simply not considered important and ignored.
(I haven’t read how it works but…)
I wonder if removing file sizes, commit counts, and other numerical metadata would have a significant impact on the output. Or if all of the files were glommed into one large input with path+filename markers?
Ok, yeah, this feels like a reasonable use case for AI. I generated a DeepWiki from one of my repos, and it's pretty informative. It goes into way too much depth on some trivial details, and glosses over more important stuff in places, but overall it seems to have produced a pretty detailed summary of what the package does, and why it does many things the way that it does.
Deepwiki was instrumental in our refactor of a large codebase away from playwright to pure CDP @ browser-use. Huge props to the team that built it, I regularly refer to it as one of the few strictly net positive AI coding tools.
The auto-overviews and diagrams are great, but where it truly shines is the "deep research" follow-up questions system at the bottom. It's much better than using OpenAI deep research of perplexity to ask questions about complex codebases like puppeteer/playwright/chromium/etc.
This is Cordyceps [0]. My Chrome extension port of the typescript port of Browser Use needs some love but it is contained there.
I wanted to post the port of a MCP with pure CDP libraryI found to use Chrome.debugger therefore removing Playwright but the typescript and monorepo tooling is out of date. Hopefully I'll get it done tomorrow.
Isn’t this supposed to be a short technical blog? Why does it seem like they’re a salesman and it’s a sales pitch?
> "We are generating more code than ever. With LLMs like Claude already writing most of Anthropic’s code, the challenge is no longer producing code, it is understanding it."
The first sentence already is obviously AI generated, and reading through it it, it is obviously completely written by AI to the point of it being distracting.
I understand the author probably feels that AI is better at writing than they are, but I would heavily recommend they use their own voice.
I’ve personally started to try to think about the points someone prompted an AI to generate some text (the actual thoughts of the author) so that I can more easily skim past the AI generated slop such as: "… you’ll get the env setup, required services, and dependency graph with citations to README, Dockerfile, and scripts, so you can hit the ground running".
> Suppose you find a clever mechanism in another repository, such as an authentication flow or a clever way to persist state locally. In that case, you can ask DeepWiki to provide a Markdown cheat sheet: a breakdown of how it works, which files define it, and what it depends on. You can then drop that summary directly into Claude Code or Cursor as structured context and ask it to implement it in your project.
Bonus if the LLM was trained on the original repository.
Then it would be that much more clear that you're just laundering open source code.
It would be nice if it could also read github issues etc if they were available, so it could have more context about the decisions that were made. For large projects/those with a lot of issues this might be a bit impossible though
Who is paying for those tokens? I had a long conversation with "Devin" and I must have burnt up a large number of tokens. In any case, thank you "Devin"
don't know how to zoom the diagrams on mobile tho, n they can easily almost disappear from view when panning around
the prompt box could do with a way to move it out of the way of the bottom couple of cm in portrait, or from covering more than a quarter of the screen in landscape orientation
(fewer token = more context can be given to LLM = LLM better understands your repository)
3. Copy & Paste the compressed archive text contents as a context, into ChatGPT’s input field as-is, or local LLMs.
4. Ask the LLM for documentation generation. Like, “this is a repository source code. given context, generate a ‘table-of-content’.” Then you will get a ToC. If it looks good, you can ask for generating first chapter. And keeps going until you finish whole documentation.
If you are trying to document Typescript/Javascript codebase, You may use bundlers like esbuild for step 2, which will beneficial for token reducing.
Yeah, it's not a wiki at all However, in my experience, many people just know the term 'wiki' as a short form for Wikipedia - and for many, Wikipedia is the only encyclopedia they know. So, I guess that the Deepwiki authors see it as some kind of deep/specialised encyclopedia.
I recently received an AI-slop bug report for a small open source project (PureLB) that I maintain, and the slop was generated by DeepWiki. It was very incorrect, but I didn't know what "DeepWiki" was so I wasted about an hour. If DeepWiki is causing garbage bug reports even on tiny projects like mine, I can't imagine how much maintainer time it's wasting over all.
buovjaga|6 months ago
johnfn|6 months ago
[1]: https://github.com/LibreOffice/core/commit/1fd41f43eb73c373c...
baby|6 months ago
conartist6|6 months ago
> Hello, I am writing you as an author of Open Source software seeking to protect my security and that of my users.
> What I would like to know is: how may I prevent deepwiki from indexing my projects, specifically those in the ----- GitHub organization? If you consider yourselves to have implicit legal permission to train on my projects and write about them, know that I hereby explicitly and permanently revoke that permission.
> Since you likely believe that I lack the authority to get you to stop, I will add this:
> To the extent allowed by law I will consider any incorrect information you publish about my projects to be libelous and, given this notice, made with your intention. LLMs have no will to act, so publishing misinformation about my project, at such time as that happens, could only be the result of human will.
> Kind regards, > Conrad Buck
rgoulter|6 months ago
It strikes me as an example of automated code assistance that's e.g. more useful than "the item under the cursor has type <X>, here's its documentation".
There are things which benefit from being automatically described, and there are things where "the map is not the territory", and you do want someone to have come up with a map.
> "Treat it like a patient senior engineer."
I trust that LLMs are patient (you can ask them stupid questions without consequence).
I do not trust LLMs to act as 'senior'. (i.e. Unless you ask it to, it won't push back against dumb ideas, or suggest better ideas that would achieve what you're trying to do. -- And if you just ask it to 'push back', it's going to push back more than necessary).
ignoramous|6 months ago
DeepWiki does add tremendous value already: I maintain open source projects and frequently direct volunteers to use DeepWiki to explore those (fairly convoluted) codebases. But ... I've caught DeepWiki hallucinating pretty convincingly far more than once just because a struct / a package / a function was named for something it wasn't doing anymore / wasn't doing it by the book (think: RFCs, docs, specifications etc). This isn't a criticism more than it is of the refactoring practices by the maintainers themselves. "Code readability" and tests (part of "gyms" for Agents), then I imagine, are going to come in clutch for OSS projects that want to benefit from a constant stream of productive external contributions.
giancarlostoro|6 months ago
Edit: its been like over 10 minutes, and still nothing, thats interesting, I did choose a Lingo source project so it probably gave up by now.
fergie|6 months ago
But for my libs (that aren't super popular, but OTOH have a few million downloads per year) it generates documentation that is incorrect, and this is not good for users.
tacker2000|6 months ago
Its a bit hard to find stuff. I was looking to find the structure of the main configuration json object and couldnt find it in the deepwiki.
I found it on the “non ai created” doc page of the main Elk project[2] (Elkjs is a JS implementation of Elk)
But yes this is of course just one data point.
[1]https://deepwiki.com/kieler/elkjs/5-usage-guide
[2]https://eclipse.dev/elk/documentation/tooldevelopers/graphda...
Nullabillity|6 months ago
Annoyingly, anyone can just.. request a deepwiki for any GitHub repo. That one exists doesn't mean that it's endorsed or reviewed by the project.
They just kind of barged in, welcome or not. Just another SEO slop-spammer.
jcranmer|6 months ago
Thoughts on the overview page: Okay, weird subset of the top-level directories. The high-level compilation pipeline diagram is... wrong? Like, Clang-AST is definitely part of clang frontend, and you get to the optimization pipeline, which clearly fucks up the flow through vectorization and instruction selection (completely omitting GlobalISel as well too, for that matter). The choice of backends to highlight is weird, and at the end of the day, it manages to completely omit some of the most important passes in LLVM (like InstCombine).
Drilling down into the other pages... I mean at no point does it even discuss or give an idea of what LLVM IR. There's nothing about the pass manager, nothing about expected canonicalization of passes. It's got a weird fixation about some things (like the role of TargetLowering), but manages to elide pretty much any detail that is actually useful. The role of TableGen in several components is completely missing--and FWIW, understanding TableGen and its error messages is probably the single hardest part of putting together an LLVM backend, precisely the thing you'd want it to focus on.
If I had to guess, it's overly fixated on things that happen to be very large files--I think everything it decided to focus on in a single page happens to be a 30kloc file or something. But that means it also misses the things that are so gargantuan they're split into multiple files--Clang codegen is ~100kloc and InstCombine is ~40kloc but since they're in several 4-5kloc files instead of a large 26kloc file (SLPVectorizer) or 62kloc file (X86ISelLowering), they're simply not considered important and ignored.
IceHegel|6 months ago
grokblah|6 months ago
(I haven’t read how it works but…) I wonder if removing file sizes, commit counts, and other numerical metadata would have a significant impact on the output. Or if all of the files were glommed into one large input with path+filename markers?
menaerus|6 months ago
swiftcoder|6 months ago
nikisweeting|6 months ago
The auto-overviews and diagrams are great, but where it truly shines is the "deep research" follow-up questions system at the bottom. It's much better than using OpenAI deep research of perplexity to ask questions about complex codebases like puppeteer/playwright/chromium/etc.
dataviz1000|6 months ago
This is Cordyceps [0]. My Chrome extension port of the typescript port of Browser Use needs some love but it is contained there.
I wanted to post the port of a MCP with pure CDP libraryI found to use Chrome.debugger therefore removing Playwright but the typescript and monorepo tooling is out of date. Hopefully I'll get it done tomorrow.
[0] https://github.com/adam-s/cordyceps
opdahl|6 months ago
> "We are generating more code than ever. With LLMs like Claude already writing most of Anthropic’s code, the challenge is no longer producing code, it is understanding it."
The first sentence already is obviously AI generated, and reading through it it, it is obviously completely written by AI to the point of it being distracting.
I understand the author probably feels that AI is better at writing than they are, but I would heavily recommend they use their own voice.
I’ve personally started to try to think about the points someone prompted an AI to generate some text (the actual thoughts of the author) so that I can more easily skim past the AI generated slop such as: "… you’ll get the env setup, required services, and dependency graph with citations to README, Dockerfile, and scripts, so you can hit the ground running".
npinsker|6 months ago
oriettaxx|6 months ago
I would love code could be opensource: I just saw now a couple of attempts
* https://github.com/AsyncFuncAI/deepwiki-open
* https://github.com/AIDotNet/OpenDeepWiki
with several stars
IceHegel|6 months ago
They are a conceptual overview and don’t seem tied down enough to the actual implementation details of a particular project.
Perhaps this could be the improved.
neilv|6 months ago
Bonus if the LLM was trained on the original repository.
Then it would be that much more clear that you're just laundering open source code.
1317|6 months ago
manishsharan|6 months ago
mkagenius|6 months ago
Somehow voices are a little easier to pay attention to than text to me.
I had done a show HN for https://gitpodcast.com earlier, created with a similar goal in mind.
mxmilkiib|6 months ago
tried it out on Mixxx; https://deepwiki.com/mixxxdj/mixxx
don't know how to zoom the diagrams on mobile tho, n they can easily almost disappear from view when panning around
the prompt box could do with a way to move it out of the way of the bottom couple of cm in portrait, or from covering more than a quarter of the screen in landscape orientation
fedeb95|6 months ago
lelouch9099|6 months ago
letaem77|6 months ago
1. Archive whole repository into single text file with Repopack: https://github.com/yamadashy/repomix
2. To reduce token, compress the file with LLMLingua-2: https://github.com/microsoft/LLMLingua
(fewer token = more context can be given to LLM = LLM better understands your repository)
3. Copy & Paste the compressed archive text contents as a context, into ChatGPT’s input field as-is, or local LLMs.
4. Ask the LLM for documentation generation. Like, “this is a repository source code. given context, generate a ‘table-of-content’.” Then you will get a ToC. If it looks good, you can ask for generating first chapter. And keeps going until you finish whole documentation.
If you are trying to document Typescript/Javascript codebase, You may use bundlers like esbuild for step 2, which will beneficial for token reducing.
If you interested in step 2’s LLMLingua-2, check out my Typescript port that can be ran without no installation at: https://atjsh.github.io/llmlingua-2-js/
Cheer2171|6 months ago
raphman|6 months ago
faangguyindia|6 months ago
manishsharan|6 months ago
caboteria|6 months ago
cAtte_|6 months ago