benfrederickson's comments

benfrederickson | 4 years ago | on: Why Python needs to be paused during profiling – but Ruby doesn't always

Sorry about leaving your program suspended - I believe that was because of a bug that should be fixed now https://github.com/benfred/py-spy/issues/390 =(

benfrederickson | 5 years ago | on: Show HN: Austin-Tui – Spy inside a running Python program at no performance cost

> you can either race with the program and hope that you read its memory to get the function stack before it changes what function it's running (and you're likely to win the race, because C is faster than Python), or you can pause the program briefly while taking a sample.

Py-spy defaults to blocking because the results can be pretty wrong otherwise: https://github.com/benfred/py-spy/issues/56 . You can see this problem profiling a program like https://github.com/benfred/py-spy/blob/master/tests/scripts/... with or without the nonblocking flag in py-spy - the nonblocking version produces garbage output.

Somewhat interestingly, this problem doesn't seem to occur with Ruby - and rbspy can get away without pausing the target program with only minor errors seen when profiling a similar function. I suspect this is because of differences between how the Ruby and Python interpreters store call stack information, but haven't had a chance to dig into the specifics.

benfrederickson | 6 years ago | on: Making Python Programs Blazingly Fast

Interesting article. While I definitely think you should be profiling your code to figure out the hot spots, cProfile has some limitations for profiling: cProfile doesn't give you line numbers, doesn’t work with threads, and significantly slows your program down.

I wrote a tool py-spy (https://github.com/benfred/py-spy) that is worth checking out if you’re interesting in profiling python programs. Not only does it solve those problems with cProfile - py-spy also lets you generate a flamegraph, profile running programs in production, works with multiprocess python applications, can profile native python extensions etc.

benfrederickson | 6 years ago | on: Profiling Native Python Extensions

Author here. It's worth noting that since I wrote this post, py-spy has gained the ability to profile multiprocess python applications - and can also now show local variables in the dump command.

benfrederickson | 7 years ago | on: Using /proc to get a process' current stack trace

I wrote something that will get you the python interpreter stack from any running cpython process : https://github.com/benfred/py-spy/ , and rbspy can do the same for ruby https://github.com/rbspy/rbspy

benfrederickson | 7 years ago | on: Show HN: Py-spy – A new sampling profiler for Python programs

the name pyspy was taken on pypi already : https://github.com/tdfischer/pyspy =)

benfrederickson | 7 years ago | on: Show HN: Py-spy – A new sampling profiler for Python programs

Thanks! Both of your suggestions totally make sense. I've created an issue to track the poll() issue here https://github.com/benfred/py-spy/issues/13 - I think that should be an easy fix.

benfrederickson | 7 years ago | on: Show HN: Py-spy – A new sampling profiler for Python programs

Not yet - but I'm hoping to have a version that supports this next week. Will update this issue when it's done: https://github.com/benfred/py-spy/issues/3

benfrederickson | 7 years ago | on: Show HN: Py-spy – A new sampling profiler for Python programs

Check out rbspy https://github.com/rbspy/rbspy (rbspy was the inspiration for this project =)

benfrederickson | 7 years ago | on: How to crawl a quarter billion webpages in 40 hours (2012)

I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites

benfrederickson | 8 years ago | on: Drawing Venn Diagrams

There is a pretty good breakdown of a bunch of different options for visualizing sets here: http://www.cvast.tuwien.ac.at/SetViz

Venn/Euler diagrams don't work all that well past 3 sets, not all areas will be shown if using circles - so unless some of the sets are disjoint it will be a misleading diagram (like in the music example). However, I think it works well for 3 set diagams, I have an interactive example on last.fm data here https://www.benfrederickson.com/distance-metrics/ in the context of explaining some simple distance metrics.

benfrederickson | 8 years ago | on: Drawing Venn Diagrams

Neat demonstration!

A while back I wrote a small package in Javascript for computing area proportional Venn and Euler diagrams: https://github.com/benfred/venn.js . The 2 circle case here is relatively easy, but the problem gets tricky when you have 3+ sets. I wrote up my approach here https://www.benfrederickson.com/venn-diagrams-with-d3.js/ and https://www.benfrederickson.com/better-venn-diagrams/

benfrederickson | 8 years ago | on: Darts, Dice, and Coins: Sampling from a Discrete Distribution (2011)

Alias tables are pretty cool, I wrote an interactive visualization of how they get built as part of this post a while ago: http://engineering.flipboard.com/2017/02/storyclustering . We used alias tables with MCMC and the hastings metropolis test to build a super fast LDA.

Also worth reading up on are sum-heaps. Alias tables are O(1) to sample from but O(n) to build/modify. Sum-heaps let you modify in O(log(n)) at the cost of sampling in O(log(n)) as well. A good writeup is here: https://timvieira.github.io/blog/post/2016/11/21/heaps-for-i...

benfrederickson | 8 years ago | on: Why GitHub Won't Help with Hiring

Should have been clearer here: I've only been interviewed once, but I've given hundreds of interviews over the same time frame.

benfrederickson | 8 years ago | on: Ranking Programming Languages by GitHub Users

I just saw your question on proggit, and I lazily cut-and-paste the answer for here =):

For your first question - yes this means few people use more than one language in a month. There is also a power law distribution happening with user activity each month, so most users only have a handful of events each month (which happen to be mostly in a single language). I'm trying to measure how broad support it so this was mostly done on purpose. I was finding counting total events was getting biased by things that I most have been automatic activity (I was seeing single accounts with 10K commits a day for instance).

Percent of MAU in the charts is the total percentage of unique users who were active that month. I haven't tried out with yearly active users =(

benfrederickson | 8 years ago | on: Ranking Programming Languages by GitHub Users

Thanks! It was fun to put together.

benfrederickson | 8 years ago | on: Ranking Programming Languages by GitHub Users

Author here - happy to answer any questions anyone has.

benfrederickson | 8 years ago | on: Analyzing One Million Robots.txt Files

I also wrote up an analysis of the top 1M robots.txt files: http://www.benfrederickson.com/robots-txt-analysis/

I ended up analyzing very different things from this article though, so this article was still pretty interesting to me.

benfrederickson | 9 years ago | on: Interactive Numerical Optimization Tutorial

Your visualization of using momentum with gradient descent in that post is really great - nice work!

benfrederickson | 9 years ago | on: Interactive Numerical Optimization Tutorial

For the venn code, I wrote up the approach here : http://www.benfrederickson.com/better-venn-diagrams/

Basically though, I'm using the non-linear CG method - so it doesn't require a positive definite matrix. The loss function is a little funky with handling the disjoint set/ subset relationships in the euler diagrams appropriately (defines the loss/gradient to be 0 if these constraints are satisfied), but this approach still works pretty well.

That venn diagram post has a couple interactive demos of how this works, and also a randomized test showing overall performance.

I actually believe its the best known algorithm for laying out area proportional venn diagrams. I benchmarked against the code from the venneuler paper here: http://benfred.github.io/venn.js/tests/venneuler_comparison/