top | item 28537179

Seq – A programming language for computational genomics and bioinformatics

116 points| tdido | 4 years ago |github.com

59 comments

order

clusterhacks|4 years ago

I am a CS person who works with bioinformaticians every day as part of my job.

I really like that Seq seems to have built-in some parallelization ability. I spend no small amount of time in my day job doing that manually in R with RcppParallel for loops that are totally independent across each iteration.

Bioinformaticians are often educated to use a specific programming language and environment. They aren't usually looking to try other languages. For example, I support our bioinformatics group and they are basically 100% R and RStudio users. We have a single user of Python and that user is doing "typical" tensorflow stuff with images.

I've noticed this same bias towards a single language for some other academic niches. Like SAS or Stata camps in public health or psychology - I think of these languages as basically the same, but for non-CS folks the perception seems to be more like English vs Russian.

Even more complicated, researchers may be extremely committed to a specific library in a language and suspicious of languages that don't have their favorite library available.

Any shift to new tooling for these highly-committed users will almost certainly require large and obvious benefits to gain traction.

psychometry|4 years ago

Scientists like using R instead of because the language lets them get set up and coding quickly with RStudio. More importantly, the language, tooling, and ecosystem is very forgiving when it comes to code quality and style. There is good R code out there, but the R community generally lacks the wide acceptance of good coding practices you see with Python users: unit tests, sane dependency management, type hints, documentation, safe namespacing, etc.

It's really saying something when scientists think writing Python code is a pain, because Python's a pretty forgiving language, too.

travisgriggs|4 years ago

So basically, the same thing that kept(keeps?) Visual Basic in use for so long.

My son works in polysci analytics and I see the same thing you describe. A group will pick a tool and flog all problems with it. Change rarely occurs. He was in the Stata camp at one university, the TidyVerse at MIT.

It’s very weird for me, I develop and maintain a piece of software that that has 3 OSes, and 5 languages to wrestle with as well as multiple “tool” technologies like Ansible/MQTT, etc. so I’m very much in a polyglot-best-tool-for-the-job environment. Observationally from a casual POV, I see pros/cons both ways.

dr_kiszonka|4 years ago

Very interesting! I noticed a similar phenomenon in the GIS space. All of my colleagues with formal training in GIS use ArcGIS and its Python API, but those without such background gravitate towards FLOSS solutions.

I am aware of only one case where a community migrated to other software. Many economists I know switched from Stata to R. Some of them later moved on to Python.

encode|4 years ago

Also see this comparison between Julia's BioSequences and Seq by Jakob Nissen and Ben Ward: https://biojulia.net/post/seq-lang/

dgb23|4 years ago

An interesting takeaway:

> So it appears the primary reason BioJulia code is slower than Seq code in these three benchmarks is that BioSequences.jl is doing important work for you that Seq is not doing. As scientists, we hope you value tools that spend the time and effort to validate inputs given to it rather than fail silently.

Reminds me of the myriads of Excel catastrophes.

dunefox|4 years ago

This shows imo that BioJulia is better, precisely because it validates data and is a broader programming language invented for science, not a DSL that optimises for speed over all else. Besides the new version of BuiJulia seems to perform even better than seq.

bscphil|4 years ago

> Seq is a Python-compatible language, and the vast majority of Python programs should work without any modifications

> Seq is able to outperform Python code by up to 160x.

So ... a reimplementation of Python that can outperform cpython by over 100 times? I know literally nothing about this project, but I have to say that rings pretty false for me. Hell, even PyPy has trouble with many applications. (Plus they're claiming to outperform "equivalent" C code by 2x.)

Even if the performance claims are overblown, it's always nice to see new work on compiled languages with easy-to-read syntax. It's hard to beat Python for an education / prototyping language, so I will definitely be giving this a look.

amelius|4 years ago

It's probably in the same sense that Numpy is much faster than doing matrix operations with pure Python arrays and Python for-loops.

aldanor|4 years ago

I also know literally nothing about this particular project, but why not? If you support a small restricted subset of Python it's completely doable under certain conditions for specific types of programs. E.g., Numba can easily outperform Python 100-1000x in numerical applications (done it myself multiple times), simply because it jit-compiles the code by first translating it to LLVM IR.

drocer88|4 years ago

Look at the link: https://github.com/seq-lang/seq It says 96% of the code is C++ in the "Languages" box on the right. C ( and C++ and Rust) outperforms Python in benchmarks and certain optimized C code can do 160x over very naive Python. So this is very possible, though the routines tested are probably cherry picked for bragging rights.

arc-in-space|4 years ago

> We show that many important and widely-used NGS algorithms can be made up to 160× faster than their Python counterparts as well as 2× faster than the existing hand-optimized C++ implementations

It seems it's better to think of this particular claim as "we made a C++ algorithm that is 2x faster than the previous SotA C++ algorithm" (with the help of a heavily optimized DSL).

snicker7|4 years ago

Most newer languages will give you multiple orders of magnitude better performance than python.

Python’s main advantage was that it was easier than some of its competitors (C++/Java). But that is no longer the case with modern languages (Nim/Crystal/Julia/JavaScript) being both faster and comparably as easy (or easier).

It is now coasting off its momentum, mostly do to the vast amount of (usually poorly designed) open source libraries. That and Jupyter.

hoseja|4 years ago

Probably, it can outperform generic python specifically for genomics payloads, versus python code/C code.

arshajii|4 years ago

Hi everyone, I’m one of the developers on the Seq project — I was delighted to see it posted here! We started this project with a focus on bioinformatics, but since then we’ve added a lot of language features/libraries that have closed the gap with Python by a decent margin, and Seq today can be useful in other areas or even for general Python programs (although there are still limitations of course). We’re in the process of creating an extensible / plugin-able Python compiler based on Seq that allow for other domain-extensions. The upcoming release also has some neat features like OpenMP integration (e.g. “@par(num_threads=10) for i in range(N): …” will run the loop with 10 threads). Happy to answer any questions!

adgjlsfhk1|4 years ago

Have follow-up benchmarks vs BioJulia been done since 2019? If I remember correctly at the time, the result was that BioJulia was faster once you consider that it did validation.

fwip|4 years ago

It's an impressive project, but I'm not sure the niche is big enough. It's certainly come a long way since the last time I looked at it!

My biggest concern is that Seq sucks users into a sort of local maximum. While piping syntax is nice, and the built-in routines are handy, it's a lot less flexible than a "mainstream" programming language, simply because of the smaller community and relative paucity of libraries. BioPython[1] has been around a long long time, and I think a lot of potential users of Seq would be better suited by using a regular bioinformatics library in the language they know best.

e.g: The example of reading Fasta files in Seq:

    # iterate over everything
    for r in FASTA('genome.fa'):
        print r.name
        print r.seq
versus BioPython:

    from Bio import SeqIO
    for r in SeqIO.parse("genome.fa", "fasta"):
        print(r.id)
        print(r.seq)
It might be pretty useful as a teaching tool, but I'm skeptical of its long-term benefit to professionals. I'm not sure the ecosystem of Seq users will be large enough, y'know? Again, it's pretty impressive work, and it's come a long way. I wish the devs all the best. :)

1. https://biopython.org/

chmaynard|4 years ago

> It's an impressive project, but I'm not sure the niche is big enough.

Big enough for what? Instead of a gratuitous critique of its "benefit to professionals", maybe you could comment on the project's design choices and implementation. That would be more useful to us amateurs.

dekhn|4 years ago

Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.

I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.

adgjlsfhk1|4 years ago

IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.

f6v|4 years ago

> Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

A pitch most people doing applied bioinformatics won’t understand/appreciate.

car|4 years ago

Looks great, will definitely give this a try since it does sequence manipulations that I otherwise have to write myself.

Will this be available via conda? And how would seq integreate with Snakemake, since that is also based on Python?

haihaibye|4 years ago

I'm in the target market but can't use this unless it supports all of my Python libraries like Django and Numpy.

It seems to me there is a huge demand for making Python faster, whether it be via making a more optimisation friendly subset, or ideally throwing engineering talent into improving the interpreter.

V8 shows this can be done with highly dynamic Javascript. I guess we need a big corporate sponsor or the community to fund some positions.

It's kind of crazy how few developers are working on optimising cPython, it may even be a worth it for environmental reasons.

kasperset|4 years ago

I like this idea. However to me it is similar to using à la carte tools/programs along with bash script or DSL such as Nextflow. More often these stand-alone programs are already written in compiled languages. I am sure Seq will allow to build customized programs as compared to scripting or gluing programs.

chmaynard|4 years ago

I'm wondering if Seq can also serve as a general-purpose replacement for Python whenever a fast executable is needed.

arshajii|4 years ago

(I'm one of the developers on Seq.) We've actually been working mostly on closing the gap with Python for the last year or so. Seq can be useful for plain Python programs as well -- I give a bit more context in my comment above.

dunefox|4 years ago

It's a domain specific language for bioinformatics. So, most likely not.