> There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward. Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.
Google and Facebook can afford to spend money on R&D because they throw off gobs of money from near-monopolies in important economic sectors. This is one of the archetypal models for R&D, and has a lot of precedent: AT&T Bell Labs (bankrolled by AT&T's telephone monopoly) and Xerox PARC (bankrolled by the copier monopoly built on Xerox's patents). Much of the really fundamental technologies underlying computing were developed this way.
But MPEG is thirty years old now, and the MPEG-1 standard is 25 years old. Until recently, the MPEG standard has been pushed forward not by a single giant corporation that can afford to bankroll everything, but a consortium of companies using patents and licensing to recover their investment into the R&D. This is one of the other archetypal models for R&D. Many of the other fundamental technologies underlying computing were developed this way.
(The third archetypal model is the government-funded project, e.g. TCP/IP, which is also an example of a monopoly bankrolling R&D.)
The "benevolent monopoly" model obviously has advantages for open source--because the company bankrolls R&D by monetizing something else, it can afford to release the results of the research for everyone to use. But it's not sustainable without the sponsor (and we know this, because open source has been around for a long time, and there is little precedent for a high-performance video codec designed by an independent group of open source developers).[1]
I see people demonizing MPEG and espousing reliance on Google and FB as the way forward, but it's not clear to me that everyone fully understands the implications of that approach.
[1] Query whether Theora counts--it was based on an originally proprietary, patented codec.
I think that's the only excerpt from the blog post suggesting that companies will do this research out of good will. The reality is that this work is largely being done by academia and government research groups for now and it's unclear that MPEG is pushing the state of the art forward more than incrementally.
There's also the idea that our personal medical data (which is only a portion of all genomic data currently but it's increasing rapidly) should be entirely open to reading, no licenses required.
Concerted effort to improve technology can be done without patents, that's what collaboration is for. Instead, some like MPEG-LA are doing concerted effort to extort money and prevent competing technologies from emerging. That's the opposite of progress.
> a consortium of companies using patents and licensing to recover their investment
If this is their model, I expect them to be straightforward about it and I expect that most everyone will not touch their standards with 10 ft pole. Standards are not in short supply. I expect "this food contains known poison" sort of explicit label on their wares.
Instead they make it look like community effort, stay silent about strings attached.
Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI[1][2][3] in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".
I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".
They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.
[2] These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.
[3] While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.
The title is just a title, it doesn't have any legal significance. These patents address a specific encoding that is compact and indexed, not the general idea of such encodings. The patents expressly distinguish the claimed method from the approach you're describing (applying traditional compression techniques to FASTA/FASTQ files): https://patentscope.wipo.int/search/en/detail.jsf?docId=WO20...
> [0003] The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)
The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:
> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.
It also explains:
> [0006] The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.
Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.
If there are really patents protecting this format, it makes it a complete non-starter for a great deal of work (commercial and academic). Posts like this scare me. I don't want to devote effort to support a format that I might not be able to use in the future. The only thing that I could think of that might work is putting the patents in some sort of defensive portfolio in much the same way that the Open Invention Network protects Linux.
I understand the desire to develop bioinformatics file formats in a more disciplined way than we have done in the past, but this process seems like it may be more of a pain than a benefit. Unfortunately, I couldn't see some of the MPEG-G talks at ISMB this year (other talks were concurrent).
Could anyone explain what the benefits of the MPEG-G format is over something like CRAM? I mean, we were already starting to get close to the theoretical minimum in terms of file size. I personally would like to see more support for encryption and robustness (against bitrot) in formats, but this could be done in a very similar way to current formats.
I agree, the comparison with CRAM is in the whitepaper of MPEG-G. But the author of the blog has some more recent posts, where he is very skeptical of the claims made with respect to the CRAM format. It's worth the read.
Interesting isn't it? As a biochemist I can see only a slight correspondence between a sequence of audio or visual data and of genetic data, (as a contrary notion, there is only a small statistical expectation of correlation between frame N and N+1 for DNA data; yet a lot between whole sequences in terms of evolutionary homology). But yet there it is plopped right in the middle of the wikipedia standards page[1]. Likely more easily explained from the business point of view than science.
All MPEG standards have patents and this one is not an exception. If companies are interested they can license its use (assuming fair terms). This is far better than having proprietary formats which are locked or formats made by a single company which you don't know the patent situation clearly. Also, companies involved invested in the development of this standard and expect some return.
What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.
> What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.
Actually, the author has been part of the genomics community for a long time. CRAM (and BAM) are existing de-facto standards. There is no rent-seeking organization behind those formats; there are no patents.
MPEG, the Moving Picture Experts Group, is trying to move into the genomics space to make money. They are trying to create a 'standard' called MPEG-G. The very same people who are driving the MPEG-G spec are trying to obtain patents that cover the format.
These patent applications are probably invalid. They seem to be obvious and there seems to be lots of prior art in CRAM and other applications and papers. Proving this in order to invalidate them will be time-consuming and expensive. But also necessary, because you can be sure that the patents, if granted, will be used to extort money from people in bio-informatics. They may also be used offensively against CRAM.
I am the author of an implementation, although not the author of the file format itself. Although yes that it is still a fair point if you look at just the one blog post. However there are a series of them where I clearly explain the process and my involvement, so don't just look at the last.
I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).
As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.
It's more that CRAM is an incumbent format, developed by (different members of) the same genomics community that made the preceding BAM format in the same space. Both BAM and CRAM have been in common use in the field for 5+ years.
As the newcomer, the onus is on the MPEG-G proponents to compare its performance to the formats already in common use.
There are several points to the blog post, but I think the main point you are missing is this: CRAM represents (as do other things, like BAM) prior art that calls the new patents into question. Still inconclusive for the moment, but the question has been raised, and your answer does not address it.
It is a mistake to take for granted that "more technological advance"
is worth the price society would pay for it. That price, imposed
through patents, is unacceptable in this case.
We are better off if other people encode in older, less efficient
codecs that we can support in in free/libre software, than if they
encode the files a little smaller and we are forbidden by the MPEG
patent portfolio to handle it with free software.
You'll note that I do not use the term "open source". Since 1983, I
have led the free software movement, which campaigns to win freedom
in our computing by insisting on software that respects users'
freedom. Open source was coined in 1998 to discard the ethical
foundation and present the software as a mere matter of convenience.
[+] [-] rayiner|7 years ago|reply
> There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward. Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.
Google and Facebook can afford to spend money on R&D because they throw off gobs of money from near-monopolies in important economic sectors. This is one of the archetypal models for R&D, and has a lot of precedent: AT&T Bell Labs (bankrolled by AT&T's telephone monopoly) and Xerox PARC (bankrolled by the copier monopoly built on Xerox's patents). Much of the really fundamental technologies underlying computing were developed this way.
But MPEG is thirty years old now, and the MPEG-1 standard is 25 years old. Until recently, the MPEG standard has been pushed forward not by a single giant corporation that can afford to bankroll everything, but a consortium of companies using patents and licensing to recover their investment into the R&D. This is one of the other archetypal models for R&D. Many of the other fundamental technologies underlying computing were developed this way.
(The third archetypal model is the government-funded project, e.g. TCP/IP, which is also an example of a monopoly bankrolling R&D.)
The "benevolent monopoly" model obviously has advantages for open source--because the company bankrolls R&D by monetizing something else, it can afford to release the results of the research for everyone to use. But it's not sustainable without the sponsor (and we know this, because open source has been around for a long time, and there is little precedent for a high-performance video codec designed by an independent group of open source developers).[1]
I see people demonizing MPEG and espousing reliance on Google and FB as the way forward, but it's not clear to me that everyone fully understands the implications of that approach.
[1] Query whether Theora counts--it was based on an originally proprietary, patented codec.
[+] [-] tgb|7 years ago|reply
There's also the idea that our personal medical data (which is only a portion of all genomic data currently but it's increasing rapidly) should be entirely open to reading, no licenses required.
[+] [-] shmerl|7 years ago|reply
[+] [-] doombolt|7 years ago|reply
If this is their model, I expect them to be straightforward about it and I expect that most everyone will not touch their standards with 10 ft pole. Standards are not in short supply. I expect "this food contains known poison" sort of explicit label on their wares.
Instead they make it look like community effort, stay silent about strings attached.
We don't want their RnD under such terms.
[+] [-] pdkl95|7 years ago|reply
Based on the patent titles (I'll see if I can read some of them in detail tomorrow), most of these sounds almost exactly like the code I wrote while working at the JGI[1][2][3] in the early 2000s that managed moving large amounts of reads from the ABI (Sanger) sequencers, running it through phred/phrap, and storing it all so the biologists could access it easily. This included a custom Huffman tree based encoder/decoder to efficiently store FASTA files at (iirc) about ~2.5 bit/base (quality scores were just stored as packed array of bytes), a very large MySQL backend, and a large set of Perl libraries that provided easy access to reads/libraries/assemblies/etc. It was certainly a "method and apparatus" for "storing and accessing" + "indexing" bioinformatics data using a "compact representation" that provided many different types of "selective access".
I even had code that did a LD_PRELOAD hack on (circa 2002) Consed that intercepted calls to open(2) to load reads automagically from the DB. Reading Huffman encoded data in bulk from the DB (instead of one file per read) reduced the network bandwidth required to open an assembly with all it's aligned reads by ~90%. That sounds a lot like "transmission of bioinformatics data" over a network and "access ... structured in access units". It defiantly involved "reconstruction of genomic reference sequences from compressed genomic sequence reads".
They may have a more efficient compression method, and we didn't do anything re: "multiple genomic descriptors" (was that even a thing <2004?), but... no... they didn't invent what is basically a bioinformatics-specific variations of the same methods used everywhere in the computer industry for as long as "text file formats" have existed.
[1] https://jgi.doe.gov/
[2] These are my personal comments and opinions only, which are not endorsed by or currently affiliated with the Joint Genome Institute, Lawrence Berkeley National Laboratory, or the U.S. Department Of Energy.
[3] While I have no idea if any of that code even exists today (I left the JGI in 2004), I did mark the source files with the BSD license, since there was historical precedent.
[+] [-] twic|7 years ago|reply
Given that there are four bases, i would have thought you could reliably do it in 2. What am i missing?
[+] [-] rayiner|7 years ago|reply
> [0003] The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
(Note that GZIP, mentioned in the patent as an example of the prior art, uses Huffman coding.)
The patent also expressly distinguishes using an index separate from the bitstream of the genomic data, which would be the case in your method above where you're storing compressed data in a MySQL database:
> 1. For CRAM, data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it's implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded bit stream.
It also explains:
> [0006] The present invention aims at compressing genomic sequences by organizing and partitioning data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are enabled.
Without knowing more about it, it sounds like the approach you describe wouldn't allow for incremental updates.
[+] [-] mbreese|7 years ago|reply
I understand the desire to develop bioinformatics file formats in a more disciplined way than we have done in the past, but this process seems like it may be more of a pain than a benefit. Unfortunately, I couldn't see some of the MPEG-G talks at ISMB this year (other talks were concurrent).
Could anyone explain what the benefits of the MPEG-G format is over something like CRAM? I mean, we were already starting to get close to the theoretical minimum in terms of file size. I personally would like to see more support for encryption and robustness (against bitrot) in formats, but this could be done in a very similar way to current formats.
[+] [-] deugtniet|7 years ago|reply
[+] [-] 0xcde4c3db|7 years ago|reply
[+] [-] theophrastus|7 years ago|reply
[1] https://en.wikipedia.org/wiki/Moving_Picture_Experts_Group
[+] [-] ezoe|7 years ago|reply
What are they thinking?
[+] [-] jascenso|7 years ago|reply
What I don't like in this post, is the call for non-adoption when the author has a competing format (CRAM) for which the patent situation and the performance is not clear. It seems a biased opinion.
[+] [-] cure|7 years ago|reply
Actually, the author has been part of the genomics community for a long time. CRAM (and BAM) are existing de-facto standards. There is no rent-seeking organization behind those formats; there are no patents.
MPEG, the Moving Picture Experts Group, is trying to move into the genomics space to make money. They are trying to create a 'standard' called MPEG-G. The very same people who are driving the MPEG-G spec are trying to obtain patents that cover the format.
These patent applications are probably invalid. They seem to be obvious and there seems to be lots of prior art in CRAM and other applications and papers. Proving this in order to invalidate them will be time-consuming and expensive. But also necessary, because you can be sure that the patents, if granted, will be used to extort money from people in bio-informatics. They may also be used offensively against CRAM.
This is a complete waste of time.
[+] [-] jkbonfield|7 years ago|reply
I agree though the message would be better if it came from a third party. I was hoping this would happen, but it didn't look likely before the GA4GH conference (where both myself and MPEG were speaking), so I self published before that to ensure people were aware and could ask appropriate questions (to both myself and MPEG of course).
As for royalties, CRAM comes under the governance of the Global Alliance for Genomics & Health (https://www.ga4gh.org). They stated explicitly in the recent conference that their standards are royalty free (as far as is possible to tell) and promote collaboration on the formats / interfaces, competition on the implementation. For the record, we are unaware of any patents covering CRAM and we have filed none of our own, nor do we intend to for CRAM 4.
[+] [-] jomar|7 years ago|reply
It's more that CRAM is an incumbent format, developed by (different members of) the same genomics community that made the preceding BAM format in the same space. Both BAM and CRAM have been in common use in the field for 5+ years.
As the newcomer, the onus is on the MPEG-G proponents to compare its performance to the formats already in common use.
[+] [-] natch|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] RichardStallman|7 years ago|reply
We are better off if other people encode in older, less efficient codecs that we can support in in free/libre software, than if they encode the files a little smaller and we are forbidden by the MPEG patent portfolio to handle it with free software.
See https://www.gnu.org/philosophy/software-literary-patents.htm... and https://www.gnu.org/philosophy/limit-patent-effect.html.
You'll note that I do not use the term "open source". Since 1983, I have led the free software movement, which campaigns to win freedom in our computing by insisting on software that respects users' freedom. Open source was coined in 1998 to discard the ethical foundation and present the software as a mere matter of convenience.
See https://gnu.org/philosophy/open-source-misses-the-point.html for more explanation of the difference between free software and open source. See also https://thebaffler.com/salvos/the-meme-hustler for Evgeny Morozov's article on the same point.
Which one you advocate is up to you. If you stand for freedom, please show it -- by saying "free" and "libre", rather than "open".