top | item 16323320

(no title)

That is the _FAST_ version of the code (people keep saying "of course, it's slow", when it's the fast version).

Here is an earlier version (intermediate speed): https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...

It's not so easy to post the data to reproduce a real use-case as it's a few Terabytes :)

Here's a simple easy code that is incredibly slow in Python:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total = 0
    for line in open('data.txt'):
        id,val = line.split('\t')
        if id in interesting:
           total += int(val)

This is not unlike a lot of code I write, actually.

discuss

proto-n|8 years ago

I've also found that loops with dictionary (or set) lookups are a pain point in python performance. However, this example strikes me as a pretty-obvious pandas use-case:

    interesting = set(line.strip() for line in open('interesting.txt'))
    total=0
    for c in chunks: # im lazy to actually write it
        df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val'])
        total += df['val'][df['id'].isin(interesting)].sum()

I'm not exactly sure, but pretty sure that isin() doesn't use python set lookups, but some kind of internal implementation, and is thus really fast. I'd be quite surprised if disk IO wasn't the bottleneck in the above example.

luispedrocoelho|8 years ago

`isin` is worse in terms of performance as it does linear iteration of the array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.

aldanor|8 years ago

Could you give a hint of how the data ("sample1", "sample2") looks like, or how to randomly generate it in order to benchmark it sensibly? I guess these are similarly-indexed float64 series where the index may contain duplicates? Maybe you could share a chunk of data (as input to genetic_distance() function) as an example if it's not too proprietary and if it's sufficient to run a micro benchmark.

There's also code in genetic_distance() function that IIUC is meant to handle the case when sample1 and sample2 are not similarly-indexed, however (a) you essentially never use it, since you only pass sample1 and sample2 that are columns of the same dataframe (what's the point then?), and (b) your code would actually throw an exception if you tried doing that.

P.S. I like the part where you've removed the comment "note that this is a slow computation" :)

onuralp|8 years ago

Have you checked out scikit-allel? It is fairly comprehensive in terms of calculating basic population stats, and the developer is highly active.

scikit-allel: http://scikit-allel.readthedocs.io/en/latest/index.html

scikit-allel example: http://alimanfoo.github.io/2015/09/21/estimating-fst.html

zarr: https://github.com/zarr-developers/zarr

BerislavLopac|8 years ago

The speed could possibly be improved by using map. Also, not related to speed if this is all of the code, but might affect it in a larger programs: you should make sure your file pointers are closed. Something like:

    with open('interesting.txt') as interesting_file:
        interesting = {line.strip() for line in interesting_file}
    with open('data.txt') in data_file:
        total = sum(int(val) for id, val in map(lambda line: line.split('\t'), data_file) if id in interesting)

jkabrg|8 years ago

`map` is not going to make it faster. `map` is a loop. Only vectorized code is faster.

deckiedan|8 years ago

Have you tried using Cython to compile code like the above? Python's sets / maps / reading data etc should be fairly optimised, so Cython might let you bypass boxing counter variables instead using native C ints or whatever.

Also, if the data you're reading is numeric only - or at least non-unicode / character data - you might be able to get a speed boost reading the data as binary not as python text strings.