top | item 42329071

Show HN: Replace "hub" by "ingest" in GitHub URLs for a prompt-friendly extract

185 points| cyclotruc | 1 year ago |gitingest.com

Gitingest is a open-source micro dev-tool that I made over the last week.

It turns any public Github repository into a text extract that you can give to your favourite LLM easily.

Today I added this url trick to make it even easier to use!

How I use it myself: - Quickly generate a README.md boilerplate for a project - Ask LLMs questions about an undocumented codebase

It is still very much work in progress and I plan to add many more options (file size limits, exclude patterns..) and a public API

I hope this tool can help you Your feedback is very valuable to help me prioritize And contributions are welcome!

51 comments

order

wwoessi|1 year ago

Hi, great tool!

I've made https://uithub.com 2 months ago. Its speciality is the fact that seeing a repo's raw extract is a matter of changing 'g' to 'u'. It also works for subdirectories, so if you just want the docs of Upstash QStash, for example, just go to https://uithub.com/upstash/docs/tree/main/qstash

Great to see this keeps being worthwhile!

Arcuru|1 year ago

That looks awesome. You didn't mention it but uithub.com also has an API, I can definitely see myself using this for a new tool.

helsinki|1 year ago

I wonder why nobody uses jsonl format to represent an entire codebase? It’s what I do and LLMs seems to prefer it. In fact, an LLM suggested this strategy to me. Uses less characters, too.

Fokamul|1 year ago

Nothing against gitingest.com, but this is really peak of technology. Having LLMs which require feeding them info with copy&paste, peak of effectivity too. OMFG.

evmunro|1 year ago

Great idea to make it just a simple URL change. Reminds me of the youtube download websites.

I made a similar CLI tool[0] with the added feature that you can pass `--outline` and it'll omit function bodies (while leaving their signatures). I've found it works really well for giving a high-level overview of huge repos.

You can then progressively expand specific functions as the LLM needs to see their implementation, without bloating up your context window.

[0] https://github.com/everestmz/llmcat

lukejagg|1 year ago

Is the unicode really the best way to display the file structure? The special unicode characters are encoded into 2 tokens, so I doubt it would function better overall for larger repos.

shawnz|1 year ago

Also, even if different characters were used, the 2D ascii art style representation of the directory tree in general strikes me as something that's not going to be easily interpreted by an LLM, which might not have a conception of how characters are laid out in 2D space

Jet_Xu|1 year ago

Interesting approach! While URL-based extraction is convenient, I've been working on a more comprehensive solution for repository knowledge retrieval (llama-github). The key challenge isn't just extracting code, but understanding the semantic relationships and evolution patterns within repositories.

A few observations from building large-scale repo analysis systems:

1. Simple text extraction often misses critical context about code dependencies and architectural decisions 2. Repository structure varies significantly across languages and frameworks - what works for Python might fail for complex C++ projects 3. Caching strategies become crucial when dealing with enterprise-scale monorepos

The real challenge is building a universal knowledge graph that captures both explicit (code, dependencies) and implicit (architectural patterns, evolution history) relationships. We've found that combining static analysis with selective LLM augmentation provides better context than pure extraction approaches.

Curious about others' experiences with handling cross-repository knowledge transfer, especially in polyrepo environments?

ComputerGuru|1 year ago

Instead of a copy icon, it would be better to just generate the entire content as plaintext in the result (not in an html div on a rich html page) so the entire url could be used as an attachment or its contents piped directly into an agent/tool.

Ctrl-a + ctrl-c would remain fast.

vallode|1 year ago

Agreed, missing opportunity to be able to change a url from github.com/cyclotruc/gitingest to gitingest.com/cyclotruc/gitingest and simply recieve the result as plain text. A very useful little tool nonetheless.

wwoessi|1 year ago

for that you can use https://uithub.com (g -> u)

- for browsers it shows html - for curl is gets raw text

nfilzi|1 year ago

Looks neat! From what I understood, it's like zipping up your codebase in a streamlined TXT version for LLMs to ingest better?

What'd you say are the differences with using sth like Cursor, which has access to your codebase already?

cyclotruc|1 year ago

It's in the same lane, just sometimes you need a quick and handy way to get that streamlined TXT from a public Repo without leaving your browser

fastball|1 year ago

Might be good to have some filtering as well. I added a repo that has a heap of localized docs that don't make much sense to ingest into an LLM but probably use up a majority of the tokens.

cyclotruc|1 year ago

Hey! OP here: gitingest is getting a lot of love right now, sorry if it's unstable but please tell me what goes wrong so I can fix it!

smcleod|1 year ago

I wrote a tool some time ago called ingest ... to do exactly this from both local directories, files, web urls etc... as well as estimating tokens and vram usage: https://github.com/sammcj/ingest

nonethewiser|1 year ago

I implemented this same idea in bash for local use. Useful but only up to a certain size of codebase.

Cedricgc|1 year ago

Does this use the txtar format created for developing the go language?

I actually use txtar with a custom CLI to quickly copy multiple files to my clipboard and paste it into an LLM chat. I try not to get too far from the chat paradigm so I can stay flexible with which LLM provider I use

anamexis|1 year ago

It seems to be broken, getting errors like "Error processing repository: Path ../tmp/pallets-flask does not exist"

cyclotruc|1 year ago

Thank you, I'll look into it

modelorona|1 year ago

Very cool! I will try this over the weekend with a new android app to see what kind of README I can generate.

Do you have any plans to expand it?

cyclotruc|1 year ago

Yes I want to add a way to target a token count to control your LLM costs

bosky101|1 year ago

For some reason was giving a large file instead of reading from the readme

Exuma|1 year ago

isnt there a limit on prompt size? how would you actually use this? Im not very up to speed on this stuff

xnx|1 year ago

Gemini Pro has a 2 million character context window which is ~1000 pages of code.

lolinder|1 year ago

Most projects would be way too big to put into a prompt—even if technically you're within the official context window, those are often misleading—the actual window where input is actually useful is usually much smaller than advertised.

What you can do with something like this is store it in a database and then query it for relevant chunks, which you then feed to the LLM as needed.

matt3210|1 year ago

The example buttons are a nice touch

hereme888|1 year ago

It's like a web version of Repomix

spencerchubb|1 year ago

Github already has a way to get the raw text files

barbazoo|1 year ago

All of them in one operation? How?

moralestapia|1 year ago

This is really nice, congrats on shipping.

I also really like this idea in general of APIs being domains, eventually making the web a giant supercomputer.

Edit: There is literally nothing wrong with this comment but feel free to keep downvoting, only 5,600 clicks to go!