top | item 39699245

(no title)

slyrus | 1 year ago

And Gene wrote the assembly tools that did shotgun assembly of the human genome for Celera when most folks (except Jim Kent who wrote the _other_ assembler (used by the public sequencing effort (NIH,Broad,UCSC, etc…)) said it couldn’t be done. IMO, he and Jim Kent deserve a Nobel prize for these efforts.

discuss

dekhn|1 year ago

True. Celera had a large TruCluster, machines with 8 (16?) processors, 64GB and inter-node cluster networking for a truly shared filesystem. Shotgun at the time required very large memory many-CPU high throughput IO machines (although 64GB isn't really a large memory machine now) while Kent's approach worked fine on clusters but IIRC was tuned for the specific type of scaffold sequencing done by the public project.

From the original celera paper, an endnote describing what was pretty impressive hardware for the time:

Celera’s computing environment is based on Compaq Computer Corporation’s Alpha system technology running the Tru64 Unix operating system. Celera uses these Alphas as Data Servers and as nodes in a Virtual Compute Farm, all of which are connected to a fully switched network operating at Fast Ethernet speed (for the VCF) and gigabit Ethernet speed (for data servers). Load balancing and scheduling software manages the submission and execution of jobs, based on central processing unit (CPU) speed, memory requirements, and priority. The Virtual Compute Farm is composed of 440 Alpha CPUs, which includes model EV6 running at a clock speed of 400 MHz and EV67 running at 667 MHz. Available memory on these systems ranges from 2 GB to 8 GB. The VCF is used to manage trace file processing, and annotation. Genome assembly was performed on a GS 160 running 16 EV67s (667MHz) and 64 GB of memory, and 10 ES40s running 4 EV6s (500 MHz) and 32 GB of memory. A total of 100 terabytes of physical disk storage was included in a Storage Area Network that was available to systems across the environment. To ensure high availability, file and database servers were configured as 4-node Alpha TruClusters, so that services would fail over in the event of hardware or software failure. Data availability was further enhanced by using hardware- and software-based disk mirroring (RAID-0), disk striping (RAID-1), and disk striping with parity (RAID-5).

inciampati|1 year ago

What's more, this isn't history. Code from the Celera assembler lives on in a lineage of assembly methods (Canu, HiCanu, Verkko) which have ultimately _completely automated the process of complete genome assembly_ https://doi.org/10.1038/s41587-023-01662-6. The fact that this assembly approach remained relevant until practical resolution of the assembly problem is a testament to its solid theoretical foundation (the string graph) which relates read length, error rate, and information theoretic limits of genome assembly https://doi.org/10.1093/bioinformatics/bti1114.