A look at the Mojo language for bioinformatics

fwip|2 years ago

For what it's worth, I couldn't reproduce the benchmarks cited in the post, which claimed a 50% speedup over Rust on M1. The rust implementation was consistently about two to three times as fast as Mojo with the provided test scripts and datasets. It's possible I was compiling the Mojo program suboptimally, though.

  hyperfine -N --warmup 5 test/test_fastq_record 
  'needletail_test/target/release/rust_parser data/fastq_test.fastq'
  Benchmark 1: test/test_fastq_record
    Time (mean ± σ):      1.936 s ±  0.086 s    [User: 0.171 s, System: 1.386 s]
    Range (min … max):    1.836 s …  2.139 s    10 runs
  
  Benchmark 2: needletail_test/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     838.8 ms ±   4.4 ms    [User: 578.2 ms, System: 254.3 ms]
    Range (min … max):   833.7 ms … 848.2 ms    10 runs
  
  Summary
    needletail_test/target/release/rust_parser data/fastq_test.fastq ran
      2.31 ± 0.10 times faster than test/test_fastq_record

(Edit: I built the Rust version with `cargo build --release` on Rust 1.74, and Mojo with `mojo build` on Mojo 0.7.0.)

chromatin|2 years ago

It was later noted on Twitter/X by someone that the rust version was not compiled with `--release`

MohamedMabrouk|2 years ago

Hey, the Mojo parser author here. the test folder is just for the unit tests. All the benchmarking code is located in the /benchmark folder. It would be great if you can give it another go on your machine. https://github.com/MoSafi2/MojoFastTrim/tree/restructed/benc...

WeatherBrier|2 years ago

The language is far from stable, but I have had a LOT of fun writing Mojo code. I was surprised by that! The only promising new languages for low-level numerical coding that can dislodge C/C++/Fortran somewhat, in my opinion, have been Julia/Rust. I feel like I can update that last list to be Julia/Rust/Mojo now.

But, for my work, C++/Fortran reign supreme. I really wish Julia had easy AOT compilation and no GC, that would be perfect, but beggars can't be choosers. I am just glad that there are alternatives to C++/Fortran now.

Rust has been great, but I have noticed something: there isn't much of a community of numerical/scientific/ML library writers in Rust. That's not a big problem, BUT, the new libraries being written by the communities in Julia/C++ have made me question the free time I have spent, writing Rust code for my domain. When it comes time to get serious about heterogeneous compute, you have to drop Rust and go back to C++/CUDA, when you try to replicate some of the C++/CUDA infrastructure for your own needs in Rust: you really feel alone! I don't like that feeling ... of constantly being "one of the few" interested in scientific/numerical code in Rust community discussions ...

Mojo seems to be betting heavy on a world where deep heterogeneous compute abilities are table stakes, it seems the language is really a frontend for MLIR, that is very exciting to me, as someone who works at the intersection of systems programming and numerical programming.

I don't feel like Mojo will cause any issues for Julia, I think that Mojo provides an alternative that complements Julia. After toiling away for years with C/C++/Fortran, I feel great about a future where I have the option of using Julia, Mojo, or Rust for my projects.

adgjlsfhk1|2 years ago

> I really wish Julia had easy AOT compilation and no GC, that would be perfect

I pretty strongly disagree with the no gc part of this. A well written GC has the same throughout (or higher) than reference counting for most applications, and the Rust approach is very cool, but a significant usability cliff for users that are domain first, CS second. A GC is a pretty good compromise for 99% of users since it is a minor performance cost for a fairly large usability gain.

aldanor|2 years ago

Well, there's some big DS projects written in rust that are now very widely used in Python world - e.g., polars.

pjmlp|2 years ago

I just came from a CERN event, HEP seems to still be all about C++, Fortran, Python, Java, and some Go due to Kubernetes.

No Rust or Julia in their radar.

jdiaz97|2 years ago

Great post. I think Mojo's claims like the speedup over Rust are a problem, like the 65000x speedup over Python. How can we differentiate between good new tech and Silicon Valley shenanigans when they use claims like that? They do nice titles and slogans but are shady in substance

latenightcoding|2 years ago

I can't take this language or company serious after reading stuff like:

"Mojo may be the biggest programming language advance in decades"

https://www.fast.ai/posts/2023-05-03-mojo-launch.html

boxed|2 years ago

A little bit of clickbait is what you need to get interest at all. That's just a fact of life.

As for this specific claim, it was coupled with a blog post that actually demonstrated the speedup on a specific problem. Getting several orders of magnitude speedup over plain python is often quite easy. That's why we have numpy and pandas after all!

whoami17357|2 years ago

Probably reasonable to label as a shenanigan if they try to differentiate with a emoji file extension.

unknown|2 years ago

[deleted]

unknown|2 years ago

[deleted]

ubj|2 years ago

Great post, but I think the author missed a few advantages of Mojo:

* Mojo provides first-class support for AoT compilation of standalone binaries [1]. Julia provides second-class support at best.

* Mojo aims to provide first-class support for traits and a modern Rust-like memory ownership model. Julia has second-class support for traits ("Tim Holy trait trick") and uses a garbage collector.

To be clear, I really like Julia and have been gravitating back to it over time. Julia has a very talented community and a massive head start on its package ecosystem. There are plenty of other strengths I could list as well.

But I'm still keeping my eye on Mojo. There's nothing wrong with having two powerful languages learning from each other's innovations.

[1]: https://docs.modular.com/mojo/manual/get-started/hello-world...

jdiaz97|2 years ago

True, but the title of the blog is about Bioinformatics, and like another comment said:

> Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R

Considering that, do you need AOT, memory ownership for doing plotting and statistics? I'd argue not, and that's why R and Python are so popular in Bio.

WeatherBrier|2 years ago

I feel the same way, I love using Julia, but the features that Mojo provides are exciting. It's great that we have both of them.

mcqueenjordan|2 years ago

Another point of clarification that is of great importance to the results, and is a common Rust newcomer error: The benchmarks for the Rust implementation (in the original post that got all the traction) were run with a /debug/ build of rust, i.e. not an optimized binary compiled with --release.

So it was comparing something that a) didn't do meaningful parsing against b) the full parsing rust implementation in a non-optimized debug build.

SushiHippie|2 years ago

Am I missing something? In the git repository [0] it says:

> needletail_benchmark folder was compiled using the command cargo build --release and ran using the following command ./target/release/<binary> <path/to/file.fq>.

Or are you talking about something else here?

[0] https://github.com/MoSafi2/MojoFastTrim

tehsauce|2 years ago

How much does this particular result change when running in release mode?

bhansconnect|2 years ago

This is not accurate. The blog post used `--release` for it's Rust numbers. The confusion comes from the 50% performance win being specific to running on an M2 mac. On an x86_64 Linux machine, the results are more or less equivalent.

stellalo|2 years ago

> If I include the time for Julia to start up and compile the script, my implementation takes 354 ms total, on the same level as Mojo's.

I don’t think the article mentions it explicitly, but I suppose the timing is from Julia 1.10: as far as I can remember, this kind of execution time would have been impossible in Julia 1.8 even to run a simple script.

Bravo, Julia devs. Bravo.

adgjlsfhk1|2 years ago

for a script like this that doesn't have any dependencies, Julia 1.10 doesn't make a significant difference. that said, for real world usability, Julia 1.10 is dramatically better than previous versions.

dr_kiszonka|2 years ago

Folks using multiple languages, what is your workflow?

I do most DS/ML work in Python but move to R for stats, and publication-ready plots and tables (gt is really great). I often switch between them frequently, which is a hassle in the EDA and prototyping stages, especially when using notebooks. I enjoy Quarto in RStudio, but the VS Code version is not that great.

How do you make it work?

Also, after so many years using Python and R, I would love to learn a new language, even if only for just a couple of use cases. I considered Elixir for parallel processing and because it has a nice syntax, but ultimately decided against it because it can be a little slow and isn't used much in my area (sadly!). Rust seems to require too much time to get decent at it. Any recommendations? (Prolog?)

skwb|2 years ago

Use python and write my results in a CSV that I quickly import into R and do my fancy stats.

Tbf python's stats implementations can be garbage; the last time I checked you can't do multiple levels for hierarchical regression.

carbocation|2 years ago

My workflow is similar to yours: python for deep learning and surface reconstruction. R for stats and plots.

I use go extensively for data preprocessing. Sounds weird but it works well for highly repetitive conversion tasks like DICOM parsing, converting EKGs to numpy, etc.

_huayra_|2 years ago

It's hard to learn a language for fun, so I'd pick something that fits your needs to build something (or even just your curiosity). Elixir and Prolog, although both cool, might not fit the bill because they really excel at one particular thing.

Golang is a popular answer, as you can start building stuff with it fairly quickly (especially compared to Rust). Java can also be useful if you haven't learned it and find a use case (although you will hear it bemoaned as the "New COBOL", there is still a lot of work done using it).

samuell|2 years ago

I've been thinking to learn Rust for these use cases, but always get frustrated with the complexity.

I find Go is a great middle-ground though! And now there starts to be a few more bio-related tools and toolkits out there, including:

- https://github.com/vertgenlab/gonomics

- https://github.com/biogo/biogo

- https://github.com/pbenner/gonetics

- https://github.com/shenwei356/bio

... except from there being some really popular bio tools written in Go, like:

- https://github.com/shenwei356/seqkit

I think Go lost a bit of steam in bio after Rust started to take off, but it seems the field is growing to such an extent, and people are also starting to realize Rust isn't the answer to everything. I.e. it is fantastic for fast tools, but for replacing Python for all of the various ad hoc coding in biology ... nah, not so much. That's where I think Go shines.

f6v|2 years ago

As someone who practices bioinformatics, it doesn’t seem appealing. Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R, by the way.

tstactplsignore|2 years ago

To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting. That's not to say I dismiss those things and haven't done my fair share of stats, but just that the difference between real results and incorrect results most often happens before that step.

I'm a microbiologist though, for stuff like human RNA-Seq I understand that it's often plug and play to get a gene counts table at this point.

folli|2 years ago

I guess that depends on your exact ecological niche within bioinformatics.

I got my start at a NGS facility, so handling FASTQ was closer to 80% of my time, so any speedups would have been greatly appreciated.

__MatrixMan__|2 years ago

As someone who is considering a switch from generic software engineering towards bioinformatics, what would you say the pain points are?

If this is not the way to remove workflow friction, what is?

gandalfgeek|2 years ago

> It does grate me then, when someone else manages to raise 100M dollars on the premise of reinventing the wheel to solve the exact same problem, but from a worse starting point because they start from zero and they want to retain Python compatibility. Think of what money like that could do to Julia!

Python is a juggernaut with total control of the ML space and is a huge part (even if less dominant) in modern scientific computing.

A VC has way better chances of success building solutions compatible with Python rather than replacing it.

chaxor|2 years ago

I was interested in trying our mojo. Then I looked at it booked out quick.

No one will use a language that isn't free and open source.

If mojo was free and open source (wasn't a company), and didn't just give out binaries with a 'trust me bro' stamp if approval, then I would have worked with it. But it's not, so I will never use it.

hkmaxpro|2 years ago

https://news.ycombinator.com/item?id=39296559

math_dandy|2 years ago

I’m really excited about Mojo’s potential. But I don’t think it’s ready for real use outside it’s AI niche yet. Being able to call Mojo functions from Python is the sentinel capability I’m waiting for before considering its use for general purpose code.

refulgentis|2 years ago

I felt like I learned more about the author than Mojo.

- Never actually runs it. Seriously.

- Wants us to know it's definitely not a real parser as compared to Needlepoint...then 1000 words later, "real parser" means "handles \r\n...and validates 1st & 3rd lines begin with @ and +...seq and qual lines have the same length".

- At the end, "Julia is faster!!!!" off a one-off run on their own machine, comparing it to benchmark times on the Mojo website

It reads as an elaborate way to indicate they don't like that the Mojo website says it's faster, coupled to a entry-level explanation of why it is faster, coupled to disturbingly poor attempts to benchmark without running Mojo code

jakobnissen|2 years ago

I feel like if you believe my conclusion was that "Julia is faster" then you are missing the point.

The point is that the original blogs claims of "Mojo is faster" isn't right - it's comparing different programs. That implementation in Mojo is faster than Needletail - but that doesn't say very much and I prove it by also beating Needletail in Julia by using the same algorithm as Mojo does. So it's the algorithm. Not Mojo. Not Julia.

Also, did you even read my discussion on how much a parser ought to validate? Your resume is completely missing the point.

cbkeller|2 years ago

It looks like you very dramatically missed the point

zaptheimpaler|2 years ago

How does a software engineer transition into bioinformatics or computational biology? I've taken some online courses on bioinformatics and have some experience in large distributed jobs but these jobs seem few and far in between and generally want M.S/PhDs in bioinformatics. Is it really a field that's not viable to enter without an MS?

jltsiren|2 years ago

Doing a Master's and/or PhD in bioinformatics is probably the easiest way. It's a pretty specialized field, and the first couple of years are usually spent learning the basics. You are unlikely to find anyone willing to hire you to a real job to do that.

samuell|2 years ago

I think the challenge is learning enough of the biology outside of academia. I think it is fully possible, e.g. from books and videos ... but will take a lot of determination.

For the bioinformatics part, I think something like the "Genomics data science" specialization on Coursera should be a pretty good start.

jakobnissen|2 years ago

I'm not sure what's the best strategy to get hired, but professionally, you need to learn as much biology as you can. Cell biology, molecular biology, genetics, physiology. My experience has been that there are a bunch of software engineers in bioinfo already who fall short on the biology side. Differentiate yourself from those.

jimbob45|2 years ago

Crystal was never able to find traction as a Ruby clone that could compete with C speeds. Why would a Python clone have any better luck? I don’t think anyone would accuse Python of being dramatically more usable than Ruby.

Alifatisk|2 years ago

I think the appeal with Crystal is for users who already know Ruby, so the marked was already limited there.

Crystal itself is a gem, but comparing it to Mojo and its relation to Python is fair but gives the wrong message. Python is by far more popular becuse of all the packages, so the market is way larger there.

coldtea|2 years ago

Well, for the domains Mojo targets, Python is king. So a faster-Python-like language would have more potential audiences. A fast Ruby-like language, not so much, as Ruby was never that special in those domains, or in most places outside web development, and even for that it kind of lost steam in the past 10 years.

Besides people opting for closer to C speed had Rust, Go, Java, Swift, and other options to go to, all with more momentum and support, before going for a yet unproven Ruby clone.

akkad33|2 years ago

Crystal is an entirely different language with a similar syntax. Valid Python is valid Mojo

breather|2 years ago

Crystal didn't have much use in ruby's sweet spot—being a DSL for some immensely complicated-to-configure framework (eg rails, chef).

samuell|2 years ago

From someone who would love for Crystal to be the answer here, because of its fantastic concurrency features: It is a bit of a non-starter because of excessive compile times for larger projects. Also, they hadn't solved the cross-compilation issue last time I checked.

jdiaz97|2 years ago

I think it's less about the language and it's more about Modular's product, their MAX supercomputer thingy.

pjmlp|2 years ago

Because of the people and companies behind the project.

zer00eyz|2 years ago

>>> As a bioinformatician who is obsessed with high-performance, high-level programming, that's right in my wheelhouse!... Mojo currently only runs on Ubuntu and MacOS, and I run neither. So, I can't run any Mojo code

1. Back to the rust vs mojo article that kicked this off... this isnt someone who is going to use rust.

2. Availably, portability, ease of use... These are the reasons python is winning.

3. I am baffled that this person has to write code as part of their job, and does not know what a VM is! Note: This isnt a slight against the author, I doubt they are an isolated case. I think this is my own cognitive dissonance showing.

jakobnissen|2 years ago

Author here. I do know about VMs. Is it too lazy for me to write that article and not bother to install a VM with Mojo (and Rust and Julia, to benchmark in the same environment)? Maybe. If this was for my work I certainly would have felt compelled to.

On the other hand, the fact that Mojo doesn't run on Windows and most Linux distros is a point in itself. And also, would the blog post really be substantially improved if I had gotten the number of milliseconds right for the Mojo implementation on my computer? Of course not. It should be clear that the implementations are incomparable, and that a similar Julia implementation is very fast which implies that the reason the original Mojo implementation allegedly beat Rust is not because Mojo is faster. It's just a different program.

refulgentis|2 years ago

Got the same general impression, TL;DR: wrote a benchmark article without...running it? Then you conclude with "the language I use is faster!!!" based on a one-off run on your machine, which surely isn't the same machine Mojo used to run bechmarks for their website copy?

It's odd to read something that's pretty well-versed with some relatively complex CS concepts, i.e. it's not just a PhD with a blank text editor. But simultaneously, makes egregiously obvious mistakes that I wouldn't expect any college graduate to roll with.

There's a certain type, and I don't know what name to give it, especially because I certainly don't want to give it a condescending name. I call it "data scientist types" when I'm in person with someone who I trust to give me some verbal rope.

Software really feels like it ate everything and everyone. So you end up with insanely bright people who do software engineering as part of their job, but miss some pieces you expect from trad software engineering.

120 comments