I am a CS person who works with bioinformaticians every day as part of my job.
I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.
Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.
I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.
Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.
Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.
Scientists like using R instead of because the language lets them get set up and coding quickly with RStudio. More importantly, the language, tooling, and ecosystem is very forgiving when it comes to code quality and style. There is good R code out there, but the R community generally lacks the wide acceptance of good coding practices you see with Python users: unit tests, sane dependency management, type hints, documentation, safe namespacing, etc.
It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.
So basically, the same thing that kept(keeps?) Visual Basic in use for so long.
My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.
It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.
Very interesting! I noticed a similar phenomenon in the GIS space. All of my colleagues with formal training in GIS use ArcGIS and its Python API, but those without such background gravitate towards FLOSS solutions.
I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.
> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.
This shows imo that BioJulia is better, precisely because it validates data and is a broader programming language invented for science, not a DSL that optimises for speed over all else. Besides the new version of BuiJulia seems to perform even better than seq.
> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications
> Seq is able to outperform Python code by up to 160x.
So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)
Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.
I also know literally nothing about this particular project, but why not? If you support a small restricted subset of Python it's completely doable under certain conditions for specific types of programs. E.g., Numba can easily outperform Python 100-1000x in numerical applications (done it myself multiple times), simply because it jit-compiles the code by first translating it to LLVM IR.
Look at the link: https://github.com/seq-lang/seq
It says 96% of the code is C++ in the "Languages" box on the right. C ( and C++ and Rust) outperforms Python in benchmarks and certain optimized C code can do 160x over very naive Python. So this is very possible, though the routines tested are probably cherry picked for bragging rights.
> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations
It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).
Most newer languages will give you multiple orders of magnitude better performance than python.
Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).
It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.
Hi everyone, I’m one of the developers on the Seq project — I was delighted to see it posted here! We started this project with a focus on bioinformatics, but since then we’ve added a lot of language features/libraries that have closed the gap with Python by a decent margin, and Seq today can be useful in other areas or even for general Python programs (although there are still limitations of course). We’re in the process of creating an extensible / plugin-able Python compiler based on Seq that allow for other domain-extensions. The upcoming release also has some neat features like OpenMP integration (e.g. “@par(num_threads=10) for i in range(N): …” will run the loop with 10 threads). Happy to answer any questions!
Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.
It's an impressive project, but I'm not sure the niche is big enough. It's certainly come a long way since the last time I looked at it!
My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.
e.g: The example of reading Fasta files in Seq:
# iterate over everything
for r in FASTA('genome.fa'):
print r.name
print r.seq
versus BioPython:
from Bio import SeqIO
for r in SeqIO.parse("genome.fa", "fasta"):
print(r.id)
print(r.seq)
It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)
> It's an impressive project, but I'm not sure the niche is big enough.
Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.
Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.
I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.
IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.
I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.
> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.
A pitch most people doing applied bioinformatics won’t understand/appreciate.
I'm in the target market but can't use this unless it supports all of my Python libraries like Django and Numpy.
It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.
V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.
It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.
I like this idea. However to me it is similar to using à la carte tools/programs along with bash script or DSL such as Nextflow. More often these stand-alone programs are already written in compiled languages. I am sure Seq will allow to build customized programs as compared to scripting or gluing programs.
(I'm one of the developers on Seq.) We've actually been working mostly on closing the gap with Python for the last year or so. Seq can be useful for plain Python programs as well -- I give a bit more context in my comment above.
clusterhacks|4 years ago
I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.
Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.
I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.
Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.
Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.
psychometry|4 years ago
It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.
travisgriggs|4 years ago
My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.
It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.
dr_kiszonka|4 years ago
I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.
encode|4 years ago
dgb23|4 years ago
> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.
Reminds me of the myriads of Excel catastrophes.
dunefox|4 years ago
bscphil|4 years ago
> Seq is able to outperform Python code by up to 160x.
So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)
Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.
amelius|4 years ago
aldanor|4 years ago
drocer88|4 years ago
arc-in-space|4 years ago
It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).
snicker7|4 years ago
Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).
It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.
hoseja|4 years ago
arshajii|4 years ago
adgjlsfhk1|4 years ago
fuzzythinker|4 years ago
Not claiming it as any sort of reference, but you can see how it [2] may be used to solve some basic genome sequencing.
[1] https://www.coursera.org/specializations/bioinformatics
[2] https://github.com/fuzzthink/seq-genomics
fwip|4 years ago
My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.
e.g: The example of reading Fasta files in Seq:
versus BioPython: It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)1. https://biopython.org/
chmaynard|4 years ago
Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.
totalperspectiv|4 years ago
jpxw|4 years ago
dekhn|4 years ago
I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.
adgjlsfhk1|4 years ago
east2west|4 years ago
[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...
f6v|4 years ago
A pitch most people doing applied bioinformatics won’t understand/appreciate.
car|4 years ago
Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?
tdido|4 years ago
haihaibye|4 years ago
It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.
V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.
It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.
Bostonian|4 years ago
haihaibye|4 years ago
https://github.com/seq-lang/seq/issues/223
unknown|4 years ago
[deleted]
kasperset|4 years ago
unknown|4 years ago
[deleted]
tdido|4 years ago
https://dl.acm.org/doi/pdf/10.1145/3360551
https://www.nature.com/articles/s41587-021-00985-6 (paywalled)
gandalfgeek|4 years ago
chmaynard|4 years ago
arshajii|4 years ago
dunefox|4 years ago
jack_riminton|4 years ago
adgjlsfhk1|4 years ago
winter_squirrel|4 years ago
[deleted]