96-core ARM supercomputer using the NanoPi-Fire3

[+] dragontamer|7 years ago|reply

Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.

A practical baseline for anyone interested in ARM-compute, would be the Thunder X CPU (Cloud rental: https://www.packet.com/cloud/servers/c1-large-arm/). 48-cores per socket, 2x for 96-core servers.

As another commenter said: the primary use of this NanoPi is the ability to emulate a "real" super-computer and really use MPI and such. MPI is a different architecture than a massive node (like a 96-core Thunder X ARM), and you need to practice a bit with it to become proficient.

[+] rwmj|7 years ago|reply

I wonder if you could run distributed QEMU[1] on it and present it as a single (very "NUMA-ish") virtual machine? I know - node to node latency would kill you, but it could be fun to try.

[1] https://events.linuxfoundation.org/wp-content/uploads/2017/1...

[+] eiaoa|7 years ago|reply

> Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.

I read somewhere that some real supercomputer systems programmers actually use toy clusters of Raspberry Pi's to test their scheduling software. It helps speed up their development cycle because they can do initial testing on their desktops.

Edit: I think this is what I was thinking of: https://www.youtube.com/watch?v=78H-4KqVvrg

http://www.bitscope.com/blog/FM/?p=GF13L

[+] marmaduke|7 years ago|reply

> emulate a "real" super-computer and really use MPI and suc

Wouldn’t containers be a easier way to do that?

[+] nine_k|7 years ago|reply

60 GFlops on 96 cores is not that large.

OTOH if you want to see how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.

[+] patrioticaction|7 years ago|reply

The comparisons by GFlops was more or less a lark. Especially the ones comparing energy efficiency with a supercomputer from the 90s. This 96 core rig produces 1 GFlop per Watt, compare that to an i9-9900k (250GFlop), z390 chipset and 1 stick of DDR4 (95W + 7W + 2.5W = 104.5W) which does ~2.3 GFlop per Watt.*

* this is back of napkin, real world results will vary

[+] walterbell|7 years ago|reply

Link or search term?

[+] sannee|7 years ago|reply

Can these NanoPis boot over PXE? I was pleasantly surprised a few weeks ago by the fact that the Raspberry Pi can do network boot without an SD card.

[+] ElBarto|7 years ago|reply

What's fascinating in that article is to see that a Raspberry Pi 3 has about 10% of the floating-point processing power of a Cray C90...

Cue the many forum questions: "I'm planning to use a Raspberry Pi to control a <simple-ish device>. Will it be powerful enough?"

[+] adrianN|7 years ago|reply

The real question is "Will it be powerful enough even though I use a Desktop operating system and a software stack designed for programmer comfort rather than efficiency to control <simple-ish device>".

[+] geezerjay|7 years ago|reply

Truth be told, due to today's wealth of computational resources, some very popular software stacks were not designed to be lean or efficient.

[+] qwerty456127|7 years ago|reply

> The NanoPi Fire3 is a high performance ARM Board developed by FriendlyElec for Hobbyists, Makers and Hackers for IOT projects. It features Samsung's Cortex-A53 Octa Core [email protected] SoC and 1GB 32bit DDR3 RAM

Who needs such a powerful CPU with so little RAM? The reason I have still not bought any Pi is all of them have 2 or less GiBs of RAM and I don't feel interested in buying anything with less than 4.

[+] giancarlostoro|7 years ago|reply

You'd be wanting to look into ARM-64 boards like the ROCKPro64:

https://www.pine64.org/?page_id=61454

There's others that are pricier (> $100) with x86 arch the UDOO boards if you really want a SBC with much more RAM too.

[+] gnulinux|7 years ago|reply

> Who needs such a powerful CPU with so little RAM?

What do you need that much RAM for? What do you plan to run in this machine?

[+] epanchin|7 years ago|reply

Little RAM needed for processing streaming data.

[+] nightcracker|7 years ago|reply

I do for real-time audio synthesis.

[+] sheepybloke|7 years ago|reply

I've been trying to do something similar with 4 Orange Pi Zero Plus boards (this blog was one of my main inspirations). While I know it's not practical, it's fun to design the case and the stand, how everything needs to connect, and route it all together. I hope to in the end host a distributed personal website on it and a MQTT server on it for any IoT tinkering I'd want to do!

[+] floatboth|7 years ago|reply

Significantly cheaper than the 24-core (also A53) SynQuacer Developerbox. But of course you're getting a cluster instead of one machine…

[+] megous|7 years ago|reply

Nice! Distcc based compilation might be something to try on this. :) One thing I noticed is that heatsink fins are oriented in a wrong direction. Air should be going through the fins, not to the side of them. But I guess any air movement is enough to cool this.

[+] otherlife35|7 years ago|reply

Here is a simple study on distcc, pump and using of the make -j# option on low end hardware. It seems that the network could be a bottleneck. The compilation time probably would decrease to 1/4. But I think the use of -j# is the best advice.

https://forums.gentoo.org/viewtopic-t-1056580-start-0.html

[+] mschaef|7 years ago|reply

The only supercomputer they compare it to is 27 years old, and it uses Gigabit Ethernet as its interconnect. I think they have a much looser definition of 'Supercomputer' than most people.

[+] geezerjay|7 years ago|reply

It's a few SoC crammed in a shoebox. Of course the comparison was never meant to be taken seriously.

[+] fluxty|7 years ago|reply

I wonder what topology this has--it definitely seems reminiscent of older supercomputers like the famous Thinking Machines CM-5, which used a hypercube.

[+] aepiepaey|7 years ago|reply

Probably nothing interesting.

There are two 8-port ethernet switches.

With 12 nodes, this leaves 4 unused port (2 in each switch).

From the pictures you can see that the box itself has two jacks, both of which are likely connected to one switch each.

The switches don't seem to support link aggregation, so likely to look like this:

    switch1
    ├── external
    ├── nano-pi1
    ├── nano-pi2
    ├── nano-pi3
    ├── nano-pi4
    ├── nano-pi5
    └── nano-pi6
    switch2
    ├── external
    ├── nano-pi7
    ├── nano-pi8
    ├── nano-pi9
    ├── nano-pi10
    ├── nano-pi11
    └── nano-pi12

and if you connect both the switches to the same external switch, you'd get something like:

     switch1
    ┌┼── switch2
    │├── nano-pi1
    │├── nano-pi2
    │├── nano-pi3
    │├── nano-pi4
    │├── nano-pi5
    │└── nano-pi6
    │switch2
    └┼── switch1
     ├── nano-pi7
     ├── nano-pi8
     ├── nano-pi9
     ├── nano-pi10
     ├── nano-pi11
     └── nano-pi12

[+] albertgoeswoof|7 years ago|reply

This is cool! But why test on this instead of using a virtual environment locally?

[+] zamadatix|7 years ago|reply

Unless you have 96 physical cores testing it in a virtual environment doesn't tell you the same thing.

44 comments