(no title)
luispedrocoelho | 8 years ago
Here is an earlier version (intermediate speed): https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...
It's not so easy to post the data to reproduce a real use-case as it's a few Terabytes :)
*
Here's a simple easy code that is incredibly slow in Python:
interesting = set(line.strip() for line in open('interesting.txt'))
total = 0
for line in open('data.txt'):
id,val = line.split('\t')
if id in interesting:
total += int(val)
This is not unlike a lot of code I write, actually.
proto-n|8 years ago
luispedrocoelho|8 years ago
Reading in chunks is not bad (and you can just use `chunksize=...` as a parameter to `read_csv`), but pandas `read_csv` is not so efficient either. Furthemore, even replacing `isin` with something like `df['id'].map(interesting.__contains__)` still is pretty slow.
Btw, deleting `interesting` (when it goes out of scope) might take hours(!) and there is no way around that. That's a bona fides performance bug.
In my experience, disk IO (even when using network disks) is not the bottleneck for the above example.
aldanor|8 years ago
There's also code in genetic_distance() function that IIUC is meant to handle the case when sample1 and sample2 are not similarly-indexed, however (a) you essentially never use it, since you only pass sample1 and sample2 that are columns of the same dataframe (what's the point then?), and (b) your code would actually throw an exception if you tried doing that.
P.S. I like the part where you've removed the comment "note that this is a slow computation" :)
onuralp|8 years ago
scikit-allel: http://scikit-allel.readthedocs.io/en/latest/index.html
scikit-allel example: http://alimanfoo.github.io/2015/09/21/estimating-fst.html
zarr: https://github.com/zarr-developers/zarr
BerislavLopac|8 years ago
jkabrg|8 years ago
deckiedan|8 years ago
Also, if the data you're reading is numeric only - or at least non-unicode / character data - you might be able to get a speed boost reading the data as binary not as python text strings.