bnprks | 1 month ago | on: Ask HN: Distributed SQL engine for ultra-wide tables
bnprks's comments
bnprks | 1 year ago | on: Everything I know about the fast inverse square root algorithm
bnprks | 1 year ago | on: Energy-Efficient Llama 2 Inference on FPGAs via High Level Synthesis
An interesting twist is that this DRAM might not need to be a central pool where bandwidth must be shared globally -- e.g. the Tensortorrent strategy seems to be aiming for using smaller chips that each have their own memory. Splitting up memory should yield very high aggregate bandwidth even with slower DRAM, which is great as long as they can figure out the cross-chip data flow to avoid networking bottlenecks
bnprks | 1 year ago | on: Energy-Efficient Llama 2 Inference on FPGAs via High Level Synthesis
And from a hardware cost perspective the AWS f1.2xlarge instances they used are $1.65/hr on-demand, vs say $1.29/hr for an A100 from Lambda Labs. A very interesting line of thinking to use FPGAs, but I'm not sure if this is really describing a viable competitor to GPUs even for inference-only scenarios.
bnprks | 1 year ago | on: Deaths at a California skydiving center, but the jumps go on
In other words, we would expect that 14 facilities of similar death counts to the one in the article would equal the total US fatalities for a year. The USPA dropzone locator [1] lists 142 facilities, so if we take everything at face value then this facility is ~10x worse than the average for USPA members.
> But I'd bet it's less than $200/jump worth of risk
In this case at least, it seems that this specific facility is higher risk than that. And with a lack of legally mandated reporting requirements, I'd say the onus is on a facility to prove safety once it's averaging a death every 1.3 years.
bnprks | 1 year ago | on: Oxide Cloud Computer. No Cables. No Assembly. Just Cloud
bnprks | 1 year ago | on: Oxide Cloud Computer. No Cables. No Assembly. Just Cloud
For example "3.2TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier" is $3,301.65 each, and you'd need 10 of those to match the Oxide storage spec -- already above the $30k total price you quoted. Similarly, "128GB LRDIMM, 3200MT/s, Quad Rank" was $3,384.79 each, and you'd need 8 of those to reach the 1TiB of memory per server Oxide provides.
With just the RAM and SSD cost quoted by Dell, I get to $60k per server (x16 = $960k), which isn't counting CPU, power, or networking.
I agree these costs are way way way higher than what I'd expect for consumer RAM or SSD, but I think if Oxide is charging in line with Dell they should be asking at least $1MM for that hardware. (At least compared to Dell's list prices -- I don't purchase enterprise hardware either so I don't know how much discounting is typical)
Edit: the specific Dell server model I was working off of for configuration was called "PowerEdge R6515 Rack Server", since it was one of the few I found that allowed selecting the exact same AMD EPYC CPU model that Oxide uses [1]
[1]: https://www.dell.com/en-us/shop/dell-poweredge-servers/power...
bnprks | 1 year ago | on: Oxide Cloud Computer. No Cables. No Assembly. Just Cloud
bnprks | 2 years ago | on: Array Languages: R vs. APL (2023)
bnprks | 2 years ago | on: Array Languages: R vs. APL (2023)
bnprks | 2 years ago | on: Array Languages: R vs. APL (2023)
I think software engineers often get turned off by the weird idiosyncrasies of R, but there are surprisingly unique (arguably helpful) language features most people don't notice. Possibly because most of the learning material is data-science focused and so it doesn't emphasize the bonkers language features that R has.
bnprks | 2 years ago | on: 8 years later: A world Go champion's reflections on AlphaGo
I haven't played Go in a while, but I'm kind of excited to try going back to use the KataGo-based analysis/training tools that exist now.
bnprks | 2 years ago | on: What the Gardasil Testing May Have Missed (2017)
It's worth noting the benefits of HPV vaccination do seem to be quite real, though. In the US, >20% of the female population has a high-risk HPV infection [1], and cervical cancer runs at ~12k new cases and ~4k deaths a year [2]. A follow-up study found women vaccinated before age 17 had about 88% reduction in cervical cancer, with around 53% for women vaccinated at 17-30 years of age [3] (presumably later-vaccinated women had a high chance of already having an HPV infection so the vaccine wouldn't be useful).
I think potentially saving >3.5k lives and >10k cervical cancer cases annually in the US is a pretty good return if we can get widespread HPV vaccination, though of course we should also work hard to study and minimize vaccine side-effects. I'm similarly hopeful of news about EBV as a cause of multiple sclerosis [4], which is another situation where preventing a widespread infection might prevent rare but serious illnesses.
[1] https://www.cdc.gov/nchs/products/databriefs/db280.htm
[2] https://gis.cdc.gov/Cancer/USCS/#/Trends/1,2,73,1,3,value,23
[3] https://www.cancer.gov/news-events/cancer-currents-blog/2020...
[4] https://www.hsph.harvard.edu/news/press-releases/epstein-bar...
bnprks | 2 years ago | on: Learning From DNA: a grand challenge in biology
For design tasks like in this paper, I think computational models have a big hill to climb in order to compete with physical high-throughput screening. Most of the time the goal is to get a small number of hits (<10) out of a pool of millions of candidates. At those levels, you need to work in the >99.9% precision regime to have any hope of finding significant hits after multiple-hypothesis correction. I don't think they showed anything near that accurate in the paper.
Maybe we'll get there eventually, but the high-throughput techniques in molecular biology are also getting better at the same time.
bnprks | 2 years ago | on: How do computers calculate sine?
Unfortunately, floating point results will probably continue to differ across platforms for the foreseeable future.
bnprks | 2 years ago | on: Free data transfer out to internet when moving out of AWS
> Where a data processing service is being used in parallel with another data processing service, the providers of data processing services may impose data egress charges, but only for the purpose of passing on egress costs incurred, without exceeding such costs.
Hopefully this article doesn't end up with exploitable loopholes. Bringing down AWS, GCP, and Azure egress costs to market rates could majorly help reduce cloud lock-in gradually without having to close your entire account.
[1]: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:...
bnprks | 2 years ago | on: French court issues damages award for violation of GPL
Though I can certainly imagine that a multinational company might not be confident of the copyright status of API usage in all countries they operate in.
bnprks | 2 years ago | on: Mass Retraction of unethical Chinese Forensic Genetics Papers
In a US lab, for instance, I would expect that a similar genetic study would have hard-copy signed consent forms from every participant in the study with 3-6 year retention requirements, and this could be audited by their institution's IRB if there were concerns. (Ultimately I think there is an accountability chain all the way to the US federal government, though I'm not familiar with how the institutional IRBs are monitored). I don't know what equivalent institutions might exist in China, and whether the journals got/requested any verification of consents from them.
Though for papers like these where co-authors have affiliations with police departments or academies, I'm not sure how trustworthy it would be even if the police did claim they had evidence of consent for the data in these papers. (Given that Chinese police are known to be collecting genetic samples without consent in some documented cases.)
bnprks | 2 years ago | on: French court issues damages award for violation of GPL
1. You give your user a non-GPL python package with requirements.txt file (no bundled dependencies)
2. Your user pip-installs the dependencies (including some GPL-licensed ones)
3. Your user runs the application
As long as your country doesn't consider use of an API prohibited under the copyright of the implementing code, I think steps 1-3 would be fine (though not very practical for a product).
I'd be curious for others input, though, as this has bugged be for a while in the R community where several core libraries (like the Matrix package) are GPL licensed but many packages that depend on GPL packages claim to be licensed under MIT or some other license.
bnprks | 2 years ago | on: Mass Retraction of unethical Chinese Forensic Genetics Papers
The research in question is directly related to finding and cataloging genetic markers that could be used in such a surveillance database. And with no way to credibly verify that the genetic samples were given with full consent, it seems probable that the studies themselves were part of this project to create an Orwellian surveillance state for certain minorities in China. Needless to say, western journals would prefer to not be accomplices to these human rights abuses, hence the retractions.
[1]: https://www.nytimes.com/2019/02/21/business/china-xinjiang-u...
My best experience has been ignoring SQL and using (sparse) matrix formats for the genomic data itself, possibly combined with some small metadata tables that can fit easily in existing solutions (often even in memory). Sparse matrix formats like CSC/CSR can store numeric data at ~12 bytes per non-zero entry, so a single one of your servers should handle 10B data points in RAM and another 10x that comfortably on a local SSD. Maybe no need to pay the cost of going distributed?
Self plug: if you're in the single cell space, I wrote a paper on my project BPCells which has some storage format benchmarks up to a 60k column, 44M row RNA-seq matrix.