jkbonfield's comments

jkbonfield | 3 years ago | on: Bzip3 – A better and stronger spiritual successor to bzip2

Well yes it was one file, but it was stated as being good on text and enwik8 is a pretty standard test corpus for text compressors.

I could have done more, but it somewhat vindicated what I was saying really. It has a very similar core to bsc (based on the same code) and gives very similar file sizes as expected. Note you may wish to use bsc -tT to disable both forms of threading. I don't know if that changes memory usage any.

Have you tried making PRs back to libbsc github to fit the UB and fuzzing issues? I'm sure the author would welcome fixes given you've already done the leg work.

Anyway, please do consider benchmarking against libbsc. It's conspicuously absent given the shared ancestry.

jkbonfield | 3 years ago | on: Bzip3 – A better and stronger spiritual successor to bzip2

It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).

On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.

Also see https://quixdb.github.io/squash-benchmark/unstable/ (and without /unstable for more system types) for various charts. No bzip3 there yet though.

jkbonfield | 4 years ago | on: Alarm raised after Microsoft wins data-encoding patent

As the author of the CRAM implementationn of rANS, I can say that these sort of articles aren't helpful. Clearly my work predates this by several years, so there is nothing here which can realistically impact on CRAM, however fear alone is sufficient to damage uptake and usage. It's back to the classic strategy of FUD: Fear, Uncertainty, Doubt.

Sadly the patent is woefully vague. I understand RANS and a lot of the optimisations used for doing it efficiently, but I don't understand that patent. It's just not clear enough for someone skilled in that field to know exactly what it is they're patenting.

Specifically rANS states are updated on a symbol by symbol basis. A query is made to get a range, that is looked up to determine the symbol for that range, and then the state is updated and possibly renormalised to fit in the desired range of valid states. There is no whether-or-not a previous symbol was decoded scenario. So I think they're talking at a higher level of block based decoding and using stats from the previous block for decoding the current block, and the possibility of a symbol in the current block not being observed in the previous one (ie escape codes, as used in PPM etc). However I'm really not sure. That's just me grasping at straws trying to think of something that makes sense.

In short, the patent fails the most basic test of being implementable by someone else having read the patent. It's deliberately vague, so they can then apply it to as much of the problem space as possible. That annoys me! As does patenting trivial modifications to rANS that aren't inventive. rANS is just a drop in replacement for the earlier (arithmetic) range-coders, albeit in reverse order. Therefore any tricks and techniques previously applied to range coders can also be applied to rANS. However being the first to do such a thing is no more an invention that the first person to use fiberglass or carbon-fiber in a boring every day item such as a chair. The new material is the invention, not being the first to use it in a particular manner that is obvious to all.

jkbonfield | 6 years ago | on: How to Become the Best in the World at Something

I know I'm not the best at anything I do, but when combined the overlap makes a niche for me that has so far worked out well. Pure luck frankly.

However that is only because the skills I'm thinking of aren't in themselves hugely common. I'm not mega skilled at them, but the relative rarity of some means the overlap really is quite small - fortunately!

It's not going to be good advice if your key skills are, say selling things and being super knowledgable about cars. Oh look, an obvious union but that combination is popular already. So I'd say start by becoming good at a couple niche skills and then see where you end up.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

Some of the MPEG-G authors are experts in genomics data compression, while others are experts in video compression. It should, in theory, be a good mix.

MPEG are also well aware of the prior art. The authors of various existing state of the art genome compression tools were invited to one of the first MPEG-G conferences where they presented their work. Do not assume because they do not compare against the state of the art, that they are not aware of its existence or how it performs. It's more likely simply that "10x better than BAM" is a powerful message that sells products, more so than "a little bit better than CRAM". It's standard advertising techniques.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

Firstly, GA4GH has commercial members as well as academics, and all collaborate together to produce file formats, standards and protocols.

Secondly you missed out a key part of funding - precompetitive alliances. Eg see the Pistoia Alliance (https://www.pistoiaalliance.org/) who funded the SequenceSqueeze project into compression of FASTQ (http://www.sequencesqueeze.org/).

The notion here is simple - there are some technologies that are so core they cross all the commercial and academic boundaries. Collaboration rather than competition is considered to be to the mutual benefit of everyone involved. It is this mind set which lead to the Alliance for Open Media (AOM), who are also in direct competitive with MPEG.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

Show me where they notified others taking part of their patents or intent to patent. They sought out academics and invited them to take part. Yes I was naive, but I also felt rather mislead.

The GenomSys patents aren't even listed in the ISO patent list yet: https://www.iso.org/iso-standards-and-patents.html

I don't know if this is against ISO rules - it is unclear to me whether they only need to add their patents on grant, rather than submission.

The only reason I discovered these was due to an accidental hit from a Google Scholar alert. I didn't even realise it searched patents when I set that up.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

I am the author of an implementation, although not the author of the file format itself. Although yes that it is still a fair point if you look at just the one blog post. However there are a series of them where I clearly explain the process and my involvement, so don't just look at the last.

I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).

As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

Disclaimer - I am the author of the blog.

There are no "comparisons with" CRAM in the MPEG-G preprint, only comparison between CRAM and DeeZ, taken from the DeeZ paper. Those comparisons are fair and correct, but obviously were done at the time that paper was written - some 4 years ago. Since then CRAM has moved on (as has deez), but modern CRAM generally beats modern DeeZ if we restrict ourselves to the formats that permit random access (DeeZ has a higher compression non-random access mode).

So far there have been no direct data on how well MPEG-G does bar an old slide from a year ago; ISMB talk I think.

https://mpeg.chiariglione.org/sites/default/files/events/Mat...

From that we can glean some compression ratios at least. I attempted to compress the same data set with CRAM, but that data set, while public, is in FASTQ format instead of BAM. I asked for author of the talk how he had produced the BAM, but got no response. I tried my best stab at creating something similar, but it's not a satisfactory comparison yet.

jkbonfield | 7 years ago | on: MPEG-G: the ugly

It's hard with huffman given you need to deal with N. Realistically you'll end up with 3 bases at 2 bits, 1 and 3 and the other 3 being a prefix for everything else (N, ambiguity codes, etc), so somewhere averaging close to 2.3 bits is the norm.

If N is rare though, you're better off just doing blocks of 2-bit encoding and dropping to something more complex for the rare cases. Or of course just using an arithmetic / range / ANS coder.

jkbonfield | 7 years ago | on: How to Identify Almost Anyone in a Consumer Gene Database

To be "fair", their latest revision of the paper includes some figures from a 4 year out of date version of CRAM, which is an improvement on the 10 year old outdated format they used initially. ;-)

Disclaimer, that's my blog and I'm the primary author of the newer version of CRAM.