benfrederickson's comments

benfrederickson | 5 years ago | on: Show HN: Austin-Tui – Spy inside a running Python program at no performance cost

> you can either race with the program and hope that you read its memory to get the function stack before it changes what function it's running (and you're likely to win the race, because C is faster than Python), or you can pause the program briefly while taking a sample.

Py-spy defaults to blocking because the results can be pretty wrong otherwise: https://github.com/benfred/py-spy/issues/56 . You can see this problem profiling a program like https://github.com/benfred/py-spy/blob/master/tests/scripts/... with or without the nonblocking flag in py-spy - the nonblocking version produces garbage output.

Somewhat interestingly, this problem doesn't seem to occur with Ruby - and rbspy can get away without pausing the target program with only minor errors seen when profiling a similar function. I suspect this is because of differences between how the Ruby and Python interpreters store call stack information, but haven't had a chance to dig into the specifics.

benfrederickson | 6 years ago | on: Making Python Programs Blazingly Fast

Interesting article. While I definitely think you should be profiling your code to figure out the hot spots, cProfile has some limitations for profiling: cProfile doesn't give you line numbers, doesn’t work with threads, and significantly slows your program down.

I wrote a tool py-spy (https://github.com/benfred/py-spy) that is worth checking out if you’re interesting in profiling python programs. Not only does it solve those problems with cProfile - py-spy also lets you generate a flamegraph, profile running programs in production, works with multiprocess python applications, can profile native python extensions etc.

benfrederickson | 6 years ago | on: Profiling Native Python Extensions

Author here. It's worth noting that since I wrote this post, py-spy has gained the ability to profile multiprocess python applications - and can also now show local variables in the dump command.

benfrederickson | 8 years ago | on: Drawing Venn Diagrams

There is a pretty good breakdown of a bunch of different options for visualizing sets here: http://www.cvast.tuwien.ac.at/SetViz

Venn/Euler diagrams don't work all that well past 3 sets, not all areas will be shown if using circles - so unless some of the sets are disjoint it will be a misleading diagram (like in the music example). However, I think it works well for 3 set diagams, I have an interactive example on last.fm data here https://www.benfrederickson.com/distance-metrics/ in the context of explaining some simple distance metrics.

benfrederickson | 8 years ago | on: Darts, Dice, and Coins: Sampling from a Discrete Distribution (2011)

Alias tables are pretty cool, I wrote an interactive visualization of how they get built as part of this post a while ago: http://engineering.flipboard.com/2017/02/storyclustering . We used alias tables with MCMC and the hastings metropolis test to build a super fast LDA.

Also worth reading up on are sum-heaps. Alias tables are O(1) to sample from but O(n) to build/modify. Sum-heaps let you modify in O(log(n)) at the cost of sampling in O(log(n)) as well. A good writeup is here: https://timvieira.github.io/blog/post/2016/11/21/heaps-for-i...

benfrederickson | 8 years ago | on: Ranking Programming Languages by GitHub Users

I just saw your question on proggit, and I lazily cut-and-paste the answer for here =):

For your first question - yes this means few people use more than one language in a month. There is also a power law distribution happening with user activity each month, so most users only have a handful of events each month (which happen to be mostly in a single language). I'm trying to measure how broad support it so this was mostly done on purpose. I was finding counting total events was getting biased by things that I most have been automatic activity (I was seeing single accounts with 10K commits a day for instance).

Percent of MAU in the charts is the total percentage of unique users who were active that month. I haven't tried out with yearly active users =(

benfrederickson | 9 years ago | on: Interactive Numerical Optimization Tutorial

For the venn code, I wrote up the approach here : http://www.benfrederickson.com/better-venn-diagrams/

Basically though, I'm using the non-linear CG method - so it doesn't require a positive definite matrix. The loss function is a little funky with handling the disjoint set/ subset relationships in the euler diagrams appropriately (defines the loss/gradient to be 0 if these constraints are satisfied), but this approach still works pretty well.

That venn diagram post has a couple interactive demos of how this works, and also a randomized test showing overall performance.

I actually believe its the best known algorithm for laying out area proportional venn diagrams. I benchmarked against the code from the venneuler paper here: http://benfred.github.io/venn.js/tests/venneuler_comparison/

page 1