top | item 9784008

Ask HN: How do you familiarize yourself with a new codebase?

407 points| roflc0ptic | 10 years ago

My question is pretty straightforward: how do you, hacker news enthusiast, familiarize yourself with a new codebase? Obviously your answer is going to be contingent on the kind of work that you do.

Some background: What's motivating me to ask is that I am flirting with the idea of trying to add a couple of features to SlickGrid (https://github.com/mleibman/SlickGrid), Michael Leibman's phenomenal javascript grid widget. Unfortunately Leibman got busy and isn't actively supporting it anymore.

The codebase is something like 8k lines of javascript, so it's not ludicrously big, but I'm kind of intimidated thinking about trying to make sense of it. My first strategy is just to open up important-looking javascript files (slick.core.js, slick.grid.js) and read through for comprehension. This seems like a pretty slow way to build a mental model of the code, though. Some features I want to implement are 1. an ajax data source that doesn't require paging, and 2. frozen columns. Someone else has implemented a buggy version of frozen columns (and since abandoned the project), and I might like to use it, but I can't tell if it's buggy because it's a hard problem, or because their implementation strategy was poor (or both!). So at the moment I can't evaluate if I should implement my own, or try to fix the issues with theirs.

Picking up other people's code seems to be one of the harder tasks developers face, as evidenced by how much code gets abandoned, so I wondered if the voices of experience on here could point me in the right direction, either by talking about this problem in particular, or more generally, how you build knowledge about a new codebase.

Thanks!

238 comments

order
[+] tessierashpool|10 years ago|reply
I wrote some simple bash scripts around git which allow me to very quickly identify the most frequently-edited files, the most recently-edited files, the largest files, etc.

https://github.com/gilesbowkett/rewind

it's for assessing a project on day one, when you join, especially for "rescue mission" consulting. it's most useful for large projects.

the idea is, you need to know as much as possible right away. so you run these scripts and you get a map which immediately identifies which files are most significant. if it's edited frequently, it was edited yesterday, it was edited on the day the project began, and it's a much bigger file than any other, that's obviously the file to look at first.

we tend to view files in a list, but in reality, some files are very central, some files are out on the periphery and only interact with a few other files. you could actually draw that map, by analyzing "require" and "import" statements, but I didn't go that far with this. those vary tremendously on a language-by-language basis and would require much cleverer code. this is just a good way to hit the ground running with a basic understanding which you will very probably revise, re-evaluate, or throw away completely once you have more context.

but to answer your actual question, you do some analysis like this every time you go into an unfamiliar code base. you also need to get an idea of the basic paradigms involved, the coding style, etc. -- stuff which would be much harder to capture in a format as simple as bash scripts.

one of the best places to start is of course writing tests. Michael Feather wrote a great book about this called "Working Effectively with Legacy Code." brudgers's comment on this is good too but I have some small disagreements with it.

[+] fokz|10 years ago|reply
Thanks for sharing. I often get lost in large projects. Blindly jumping around is quite inefficient and frustrating.

How hard do you think it is to write a tool to draw dependencies map for a specific language?

May be there're built-in code analyzing tools in compilers for popular languages that I'm not aware of?

[+] rch|10 years ago|reply
Great answer. I can't begin to estimate how many times I've considered writing something along these lines.

It might be nice to surface files that are frequently edited together as well.

[+] thameera|10 years ago|reply
Anybody knows a similar analysis tool for an SVN project? One option would be to convert the SVN to git for analysis purposes, but I'd be interested in a better solution.
[+] baby|10 years ago|reply
Is it normal that it takes a while > 2minutes to run your program on some big java repo?

edit: took like 4 minutes, worked well!

[+] rajathagasthya|10 years ago|reply
Thanks for sharing. For someone in their early career like me, this is very useful.
[+] lsiebert|10 years ago|reply
Hmm... this is an interesting project. No License info though.
[+] zodvik|10 years ago|reply
Very simple & quite effective. Thanks for sharing.
[+] rpcope1|10 years ago|reply
Very cool! Thanks for sharing your scripts.
[+] nmb|10 years ago|reply
thanks for making this and sharing!
[+] scott_s|10 years ago|reply
A post from last year, "Strategies to quickly become productive in an unfamiliar codebase": https://news.ycombinator.com/item?id=8263402

My comment from that thread:

I do the deep-dive.

I start with a relatively high level interface point, such as an important function in a public API. Such functions and methods tend to accomplish easily understandable things. And by "important" I mean something that is fundamental to what the system accomplishes.

Then you dive.

Your goal is to have a decent understanding of how this fundamental thing is accomplished. You start at the public facing function, then find the actual implementation of that function, and start reading code. If things make sense, you keep going. If you can't make sense of it, then you will probably need to start diving into related APIs and - most importantly - data structures.

This process will tend to have a point where you have dozens of files open, which have non-trivial relationships with each other, and they are a variety of interfaces and data structures. That's okay. You're just trying to get a feel for all of it; you're not necessarily going for total, complete understanding.

What you're going for is that Aha! moment where you can feel confident in saying, "Oh, that's how it's done." This will tend to happen once you find those fundamental data structures, and have finally pieced together some understanding of how they all fit together. Once you've had the Aha! moment, you can start to trace the results back out, to make sure that is how the thing is accomplished, or what is returned. I do this with all large codebases I encounter that I want to understand. It's quite fun to do this with the Linux source code.

My philosophy is that "It's all just code", which means that with enough patience, it's all understandable. Sometimes a good strategy is to just start diving into it.

[+] bentcorner|10 years ago|reply
I find it frustrating that languages features work actively against you when you're trying to understand something.

Wide inheritance and macro usage are probably the worst. Good naming can aid understanding, but basic things like searchability are harmed by this.

Of those two, macros are the most trouble. You can't take anything for granted, and must look at every expression with an eye for detail. Taking notes becomes essential.

[+] goshx|10 years ago|reply
This is my approach too. I like to understand the entire flow, from the beginning to the end. To me this is the best way to get familiar because once you dive from different entry points you start noticing the patterns and similar paths in the code to the point where you don't need to dive to those areas again, as you quickly assimilate them by simply going over multiple times.
[+] slightlycuban|10 years ago|reply
Interesting. I take a similar approach, but then I add testing (which usually coincides with fixing some bug).

Find an entry point to the system, then make it compile, then make the test run, then just keep on nudging the code until I'm satisfied I've covered what I'm interested in.

If I stay true to my mantra "only add test code when it is absolutely necessary" (is this argument needed? pass null and find out), I find an accurate (albeit not pretty) description the flow through that procedure.

Then you commit your test and save your discovery for posterity.

[+] matwood|10 years ago|reply
I also like to find a high level interface or function and follow down. Once I get to the boom, I then start following the important data. This is particularly helpful nowadays when data frequently moves between multiple systems before seeing easily visible results.
[+] JustSomeNobody|10 years ago|reply
1. I make sure I can build and run it. I don't move past this step until I can. Period.

After that, if I don't have a particular bug I'm looking to fix or feature to add, I just go spelunking. I pick out some interesting feature and study it. I use pencil and paper to make copious notes. If there's a UI, I may start tracing through what happens when I click on things. I do this, again with pencil and paper first. This helps me use my mind to reason about what the code is doing instead of relying on the computer to tell me. If I'm working on a bug, I'll first try and recreate the bug. Again, taking copious notes in pencil and paper documenting what I've tried. Once I've found how to recreate it, I clean up my notes into legible recreate steps and make sure I can recreate it using those steps. These steps are later included in the bug tracker. Next I start tracing through the code taking copious notes, etc, etc. yada yada. You get the picture.

[+] monk_e_boy|10 years ago|reply
Debugger! Surprised no one has mentioned it yet. I work in js and php, both of which I use the debugger a lot.

Set a breakpoint, burn through the code. Chrome has some really nice features - you can tell it to skip over files (like jQuery) you can open the console and poke around, set variables to see what happens.

Stepping though the code line by line for a few hours will soon show you the basics.

[+] parshimers|10 years ago|reply
Debugging through the test cases in particular is a good way to decipher/dissect things, at least that I have found. Usually you can find a test case that is only for a specific component that you are interested in, and then the test case should only exercise those pieces, so there is not an overwhelming amount of information all at once.
[+] collyw|10 years ago|reply
I am surprised how few younger programmers use a debugger these days.
[+] shogun21|10 years ago|reply
What debugger do you use for PHP? I've yet to find one I really like.
[+] __oz|10 years ago|reply
Debugger++

Without a debugger you're a sitting duck!

[+] kabdib|10 years ago|reply
I just crack open the source base with Emacs, and start writing stuff down.

I use a large format (8x11 inch) notebook and start going through the abstractions file by file, filling up pages with summaries of things. I'll often copy out the major classes with a summary of their methods, and arrows to reflect class relationships. If there's a database involved, understanding what's being stored is usually pretty crucial, so I'll copy out the record definitions and make notes about fields. Call graphs and event diagrams go here, too.

After identifying the important stuff, I read code, and make notes about what the core functions and methods are doing. Here, a very fast global search is your friend, and "where is this declared?" and "who calls this?" are best answered in seconds. A source-base-wide grep works okay, but tools like Visual Assist's global search work better; I want answers fast.

Why use pen and paper? I find that this manual process helps my memory, and I can rapidly flip around in summaries that I've written in my own hand and fill in my understanding quite quickly. Usually, after a week or so I never refer to the notes again, but the initial phase of boosting my short term memory with paper, global searches and "getting my hands to know the code" works pretty well.

Also, I try to get the code running and fix a bug (or add a small feature) and check the change in, day one. I get anxious if I've been in a new code base for more than a few days without doing this.

[+] agentgt|10 years ago|reply
There is a significant number of answers that may interest you on Stackoverflow. Specifically: http://stackoverflow.com/questions/215076/whats-the-best-way...

Two things I do to familiarize with a code base is to look at how the data is stored. Particularly if its using a database with well named tables I can get some rough ideas of how the system works. Then from there I look at other data objects. Data is easier to understand than behavior.

The other is watching the initialization process of the application with a debugger or logger. Along those lines if your lucky (my opinion) and the application uses dependency injection of some sort you can look to see how the components are wired together. Generally there is an underlying framework to how code pieces work together and that generally reveals itself in the initialization process if its not self evident.

[+] bite_victim|10 years ago|reply
Side rant:

I just cannot believe people praising 'Unit Test'-ing. Fellow programmers, how exactly do you unit test a method / function which draws something on the canvas for example? You assert that it doesn't break the code?!

I see some really talented people out there who write unit test as proof that their code works without issues, that it's awesome and it cooks eggs and bacon etc. They write such laughable tests you cannot even tell if they are joking or not. They test if the properties / attributes they are using in methods are set or not at various points in the setup routine. Or if some function is being called after an event is being triggered.

My point is this: unit testing can only cover such tiny, tiny scenarios and mostly logic stuff that it is almost useless in understanding what is going on in the big picture. Take for example a backbone application like the Media Manager in WordPress. Please tell me how somebody can even begin to unit test something like that.

Unit testing is a joke. And sometimes a massive time consuming joke with a fraction of a benefit considering the obvious limitation(s).

[+] Mithaldu|10 years ago|reply
This may or may not apply to you, since i work with Perl. Typically i'm in a situation where i'm supposed to improve on code written by developers with less time under their belt.

As such my first steps are:

1. tidy/beautify all the code in accordance with a common standard

2. read though all of it, while making the code more clear (split up if/elsif/else christmas trees, make functions smaller, replace for loops with list processing)

While doing that i add todo comments, which usually come with questions like "what the fuck is this?" and make myself tickets with future tasks to do to clean up the codebase.

By the end of it i've looked at everything once, got a whole bunch of stuff to do, and have at least a rough understanding of what it does.

[+] lukaslalinsky|10 years ago|reply
Please don't take this as a criticism, but how long have you been programming? I'm asking because I used to have an opinion like this when I was just starting, but after a few years I realized that changing all of the code as the first thing is one of the worst things to do.
[+] jeremiep|10 years ago|reply
Both points aren't possible to do with large codebases.

It will take months merely to tidy the code with the effect of making the rest of the team hate you for committing thousands of files for superficial changes. Its much more productive for everyone to simply adapt to the existing style guidelines.

Reading all of the code is only an option for the smallest of codebases. Reading code will only get you so far before you get lost in the complexity of how all the parts interact with each other.

A better approach would be to limit yourself to a subset of the codebase and start poking around with a debugger while the system is running. Then you can gradually work your way through the codebase starting from the core functionality.

[+] vitorbaptistaa|10 years ago|reply
Do you follow this process even on code without tests? How do you make sure you're not introducing bugs in the process? Or you don't commit your changes afterwards?
[+] agentgt|10 years ago|reply
A while back I worked with team that had just brought on some new developers. Some of the developers were eager and would learn with out human intervention. Others would require and ask for some mentoring.

Neither way of learning is right or wrong and I appreciated both groups but one thing I did remember was one guy that did exactly what you did and it was highly irritating as a project lead to see relatively random files (given whatever sprint we were on) that were considered stable to show up in the source control change list with out some consideration of discussion. I would have to waste time and look what change he made to see if it was ok.

[+] jlarocco|10 years ago|reply
Well, that doesn't scale at all. I don't remember the last time I've worked on a project where even skimming all of the code would be possible in a reasonable amount of time, much less actually reading it and refactoring it.

That said, it probably does scale to the OP's 8k line code base.

Also, running the code through a formatter and refactoring it all right off the bat is a sure fire way to piss off everybody else working on the project.

In any case, my 2 cents for the OP is to not bother trying to learn the whole codebase, but instead focus on just the areas you want to enhance. For example, in the case of the new data source, find out how the existing data sources are implemented, and use them as examples for adding a new one.

[+] vineet|10 years ago|reply
I studied a lot of people doing this as part of my PhD. The thing is that there are not many answers that work well in a lot of situations. Given that though, my suggestions is to iterate on developing three views of the code:

1. The Mile High View: A layered architectural diagram can be really helpful to know how the main concepts in a project are related to one another. 2. The Core: Try to figure out how the code works with regards to these main concepts. Box and arrow diagrams on paper work really well. 3. Key Use Cases: I would suggest tracing atleast one key use case for your app.

[+] jpgvm|10 years ago|reply
I usually work on more traditional command line applications and daemons so my approach might be a little different to a web developer.

I always start by gauging how much source code there is and how it's structured. The *nix utility "tree" and the source code line counter "cloc" are usually the first 2 things I run on a codebase. This tells me what languages the applications uses, how much of each, how well commented it is and where those files are.

The next thing I usually do is find the entry point of the program. In my case this is usually an executable that calls into the core of the library and sets up the initial application state and starts the core loop and routine that does the guts of the work.

Once I have found said core routine I try to get a grasp for how the state machine of the program looks like. If it's a complicated program this step takes quite a while but is very important for gaining an intuitive understanding of how to either add new features or fix bugs. I like to use my pen and paper to help me explore this part as I often have to back track over source files and re-evaluate what portions mean.

Once I have what I think is the state machine worked out I like to understand how the program takes input or is configured. In the case of a daemon that often means understanding how configuration files are loaded and how the configuration is represented in memory. Important to cover here is how default values are handled etc. I actually prioritise this over exploring the core loops ancillary functions (the bits that do the "real" work) as I find it hard to progress to that stage without understanding how the initial state is setup.

Which brings us to said "real" work. Hanging off of the core loop will be all the functions/modules are called to do the various parts of the programs function. By this time you should already know what these do even if you don't know how they work. Because you already have a good high level understanding at this point you can pick and choose which modules you need to cover and when to cover them.

[+] droppedasakid|10 years ago|reply
Whatever your IDE/editor of choice is, I think these having these three functions are critical to learning a new codebase, or even developing for that matter: 1. Go to definition 2. Find all references 3. Navigate back

This allows you to go down any code rabbit hole, figure stuff out, then get back to where you were. If you can't do those things it will take much longer to understand how things are interconnected.

[+] gshx|10 years ago|reply
I start with running the tests if there are any. Typically peeling layers of the onion starting with the boundary. If there are no tests, then I'll try to write them. Then running tests in debug mode helps step through the code. If I have the luxury of asking questions to an engineer experienced with the codebase, I request a high level whiteboarding session all the while being cognizant of their time.

Some others have mentioned recency/touchTime as another signal. For large complex codebases, that may or may not always work.

[+] brudgers|10 years ago|reply
When you think you understand something write a test and test your belief. If the test passes then both your knowledge and the code base are better for it. If the test fails then rewrite the test to the failure and write another test. Again you will know more and the code base will be better.

Good luck.

[+] tessierashpool|10 years ago|reply
I feel your comment should be the top one. but I disagree with this bit:

> the code base will be better.

I'd change "will be" to "might get." because this is true if you're doing unit tests that the code base can use. but sometimes you do characterization tests, which are not worth keeping around. or you might build a couple variations on "hello world" with the unfamiliar code base, just to be sure that it works the way you think it does.

[+] contentfairplay|10 years ago|reply
What if you wrote a test that passes at the time of writing because of how something is implemented at that time but its not actually an invariant?
[+] nissimk|10 years ago|reply
I agree with what many others on here have said. It's also a personal thing. In general I like to try to force myself to learn only the minimum required to do what I need to do. If that philosophy sounds good to you, I would recommend taking the buggy version of frozen columns and try to fix the bugs. You may learn that the bugs are structural and you need to implement it differently, or you might be able to fix it with minimal changes. You will certainly get an understanding of the parts of slickgrid that you need to interact with to add this feature.

For the ajax data source thing, I would try to modify or extend the existing data source code to add the behavior you are looking for. As you mess around with it trying to figure out what you need to change, you will encounter the areas of the code that you need to understand.

With this sort of strategy you can avoid having to fully understand all the code while still being able to modify it. You might end up implementing stuff in a way which is not the best, but you will probably be able to implement it faster. It's the classic technical debt dilemma: understanding the complete codebase will allow you to design features that fit in better and are easier to maintain and enhance, but it will take a lot longer than just hacking something together that works.

[+] fourier|10 years ago|reply
I'm working a lot with a huge legacy codebases in C/C++. Here are some advices:

1. Be sure what you can compile and run the program

2. Have good tools to navigate around the code (I use git grep mostly)

3. Most of the app contain some user or other service interaction - try to get some easy bit (like request for capabilities or some simple operation) and follow this until the end. You don't need a debugger for it - grep/git grep is enough, these simple tools will force you to understand codebase deeply.

4. Sometimes writing UML diagrams works -

- Draw the diagrams (class diagrams, sequence diagrams) of the current state of things

- Draw the diagrams with a proposal of how you would like to change

5. If it is possible, use a debugger, start with the main() function.

[+] Sakes|10 years ago|reply
I wish I had a better answer, but I honestly just stumble around it. I typically start by trying to understand how they structured their files, then I'll start diving into the code. I wouldn't try to "understand" it completely. Just look over it until you feel comfortable enough to try to make some modifications.

Michael's code looks clean and well organized. Shouldn't be terribly difficult for someone proficient at JS.

[+] eterm|10 years ago|reply
My approach is to break stuff. If I can break it (and I am good at finding bugs, so I usually can) then I now have a narrow focus which helps me getting "lost" in the code base.

Once I've found and fixed a few things, or if the code base is particularly small or clean that I can't find bugs to fix, I'll set about hacking in the feature I'd like.

I usually start by doing it in the most hacky way possible. That sounds like a bad approach but it narrows the search of how to implement it and means I'm not constraining myself to fit the code base that I don't yet appreciate.

In hacking that feature I'll often break a few things through my carelessness. In then trying to alter my hacked approach so it no longer breaks stuff I'll become more aware of the wider code base from the point of view of my initial narrow focus. This lets me build up the mental model.

Eventually I'll be comfortable enough I can re-write the feature in a way more consistent with the wider code base.

I don't normally start by trying to "read all the code" because that guarentees I won't understand much of it (I'm not quick at picking up function from code). I might have a skim if it is well organised, but I find the "better" written a lot of stuff is, the harder it is to grok what it is actually doing from reading it. to me, reading good code is often like trying to read the FizzBuzz Enterprise Edition[1].

I've worked on many legacy systems: I was last year implementing new features into a VB6 code base, this year (at a different job) I am helping migrate from asp webforms to a more modern system. I've found that starting with trying to fix an issue to be the best way to dive into the code base.

Use good source control so you're never "worried" about changing anything or worrying that you might lose your current state. Commit early, commit often, even when "playing around".

[1] https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

[+] jdefr89|10 years ago|reply
I tend to use a hybrid approach, but In general I try to identify the entry point of the code which will lead me to the core datastructures and possibly event loops that act as a central hub for any other code that is called.. That is I look for some kind of dispatch pattern that Integrates the rest of the system, routing and calling different code when needed. Once you identify this "hub" you will have a good mental model and the system and its high level components. From there you can delve into different subsystems and slowly tweak and make changes to be sure a code path does what you conjecture it may do. Using a debugger is helpful at certain points to explore depth of the code.. When you can get a small tweak working as expected you probably have a decent starting model of the code base that you can easily add to.
[+] spion|10 years ago|reply
Another thing that is helpful, especially if you don't even have knowledge of the problem domain of the codebase: Write a glossary.

As you read the code and encounter terms/words you don't know, write them down. Try to explain what they mean and how they relate to other terms. Make it a hyperlinked document (markdown #links plus headings on github works pretty well), that way you can constantly refresh your memory of previous items while writing

Items in the glossary can range from class names / function names to datatype names to common prefixes to parts of the file names (what is `core`? what belongs there?)

Bonus: parts of the end result can be contributed back to the project as documentation.

[+] aleem|10 years ago|reply
Some good pointers and links here, surprisingly they miss both my favourite approaches.

1. If it's on Github, find an issue that seems up your alley and check the commits against it. Or the commit log in general for some interesting commits. I often use this approach to guide other devs to implement a new feature using nothing more than a previous commit or issue as a reference and starting point.

2. Unit tests are a great way to get jump started. It functions as a comprehensive examples reference--having both simple and complex examples and workflows. Not only will it contain API examples but it will also let you use experiment with the library using the unit test code as a sandbox.