jkbonfield | 3 years ago | on: Bzip3 – A better and stronger spiritual successor to bzip2
jkbonfield's comments
jkbonfield | 3 years ago | on: Bzip3 – A better and stronger spiritual successor to bzip2
On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.
Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.
jkbonfield | 4 years ago | on: Alarm raised after Microsoft wins data-encoding patent
Sadly the patent is woefully vague. I understand RANS and a lot of the optimisations used for doing it efficiently, but I don't understand that patent. It's just not clear enough for someone skilled in that field to know exactly what it is they're patenting.
Specifically rANS states are updated on a symbol by symbol basis. A query is made to get a range, that is looked up to determine the symbol for that range, and then the state is updated and possibly renormalised to fit in the desired range of valid states. There is no whether-or-not a previous symbol was decoded scenario. So I think they're talking at a higher level of block based decoding and using stats from the previous block for decoding the current block, and the possibility of a symbol in the current block not being observed in the previous one (ie escape codes, as used in PPM etc). However I'm really not sure. That's just me grasping at straws trying to think of something that makes sense.
In short, the patent fails the most basic test of being implementable by someone else having read the patent. It's deliberately vague, so they can then apply it to as much of the problem space as possible. That annoys me! As does patenting trivial modifications to rANS that aren't inventive. rANS is just a drop in replacement for the earlier (arithmetic) range-coders, albeit in reverse order. Therefore any tricks and techniques previously applied to range coders can also be applied to rANS. However being the first to do such a thing is no more an invention that the first person to use fiberglass or carbon-fiber in a boring every day item such as a chair. The new material is the invention, not being the first to use it in a particular manner that is obvious to all.
jkbonfield | 6 years ago | on: How to Become the Best in the World at Something
However that is only because the skills I'm thinking of aren't in themselves hugely common. I'm not mega skilled at them, but the relative rarity of some means the overlap really is quite small - fortunately!
It's not going to be good advice if your key skills are, say selling things and being super knowledgable about cars. Oh look, an obvious union but that combination is popular already. So I'd say start by becoming good at a couple niche skills and then see where you end up.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
MPEG are also well aware of the prior art. The authors of various existing state of the art genome compression tools were invited to one of the first MPEG-G conferences where they presented their work. Do not assume because they do not compare against the state of the art, that they are not aware of its existence or how it performs. It's more likely simply that "10x better than BAM" is a powerful message that sells products, more so than "a little bit better than CRAM". It's standard advertising techniques.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
Secondly you missed out a key part of funding - precompetitive alliances. Eg see the Pistoia Alliance (https://www.pistoiaalliance.org/) who funded the SequenceSqueeze project into compression of FASTQ (http://www.sequencesqueeze.org/).
The notion here is simple - there are some technologies that are so core they cross all the commercial and academic boundaries. Collaboration rather than competition is considered to be to the mutual benefit of everyone involved. It is this mind set which lead to the Alliance for Open Media (AOM), who are also in direct competitive with MPEG.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
The GenomSys patents aren't even listed in the ISO patent list yet: https://www.iso.org/iso-standards-and-patents.html
I don't know if this is against ISO rules - it is unclear to me whether they only need to add their patents on grant, rather than submission.
The only reason I discovered these was due to an accidental hit from a Google Scholar alert. I didn't even realise it searched patents when I set that up.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).
As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
There are no "comparisons with" CRAM in the MPEG-G preprint, only comparison between CRAM and DeeZ, taken from the DeeZ paper. Those comparisons are fair and correct, but obviously were done at the time that paper was written - some 4 years ago. Since then CRAM has moved on (as has deez), but modern CRAM generally beats modern DeeZ if we restrict ourselves to the formats that permit random access (DeeZ has a higher compression non-random access mode).
So far there have been no direct data on how well MPEG-G does bar an old slide from a year ago; ISMB talk I think.
https://mpeg.chiariglione.org/sites/default/files/events/Mat...
From that we can glean some compression ratios at least. I attempted to compress the same data set with CRAM, but that data set, while public, is in FASTQ format instead of BAM. I asked for author of the talk how he had produced the BAM, but got no response. I tried my best stab at creating something similar, but it's not a satisfactory comparison yet.
jkbonfield | 7 years ago | on: MPEG-G: the ugly
If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.
jkbonfield | 7 years ago | on: How to Identify Almost Anyone in a Consumer Gene Database
Disclaimer, that's my blog and I'm the primary author of the newer version of CRAM.
I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.
Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.
Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.