Software Infrastructure 2.0: A Wishlist

[+] jiggawatts|5 years ago|reply

Years and years ago I saw an advertisement by a SAN array vendor where their gimmick was that they supported very cheap read/write snapshots of very large datasets, in combination with a set of transforms such as data masking or anonymisation.

Their target market was same as the OP: developers that need rapid like-for-like clones of production. The SAN could create a full VM server pool with a logical clone of hundreds of terabytes of data in seconds, spin it up, and blow it away.

The idea was that instead of the typical "DEV/TST/PRD" environments, you'd have potentially dozens of numbered test environments. It was cheap enough to deploy a full cluster of something as complex as SAP or a multi-tier Oracle application for a CI/CD integration test!

Meanwhile, in the Azure cloud: Practically nothing has a "copy" option. Deploying basic resources such as SQL Virtual Machines or App Service environments can take hours. Snapshotting a VM takes a full copy. Etc...

It's like the public cloud is in a weird way ahead of the curve, yet living with 1990s limitations.

Speaking of limitations: One reason cheap cloning is "hard" is because of IPv4 addressing. There are few enough addresses (even in the 10.0.0.0/8 private range) that subnets have to be manually preallocated to avoid conflicts, addresses have to be "managed" and carefully "doled out". This makes it completely impossible to deploy many copies of complex multi-tier applications.

The public cloud vendors had their opportunity to use a flat IPv6 address space for everything. It would have made hundreds points of complexity simply vanish. The programming analogy is like going from 16-bit addressing with the complexity of near & far pointers to a flat 64-bit virtual address space. It's not just about more memory! The programming paradigms are different. Same thing with networking. IPv6 simply eliminates entire categories of networking configuration and complexity.

PS: Azure's network team decided to unnecessarily use NAT for IPv6, making it exactly (100%!) as complex as IPv4...

[+] pgwhalen|5 years ago|reply

Your comment makes me realize how lucky I am at my job: we do this production environment replication via “SAN magic” and it is so insanely useful.

[+] mcny|5 years ago|reply

> Meanwhile, in the Azure cloud: Practically nothing has a "copy" option. Deploying basic resources such as SQL Virtual Machines or App Service environments can take hours. Snapshotting a VM takes a full copy. Etc...

To add to the list of etc, I can't query across two databases in azure SQL.

[+] dilyevsky|5 years ago|reply

gcp doesn’t even give you ipv6 at all. Talk about limitations...

[+] znpy|5 years ago|reply

> There are few enough addresses (even in the 10.0.0.0/8 private range) that subnets have to be manually preallocated to avoid conflicts

I hate to be that guy, but it's 2021 and we should start deploying stuff on ivp6 networks to avoid this kind of problems (beside other kind of problems)...

edit: ok I just saw the PS about ipv6 and Azure... sorry.

[+] whoisburbansky|5 years ago|reply

Do you remember the name of the original SAN array provider?

[+] olavgg|5 years ago|reply

I want to a copy of our Azure SQL database dumped to my local dev MS SQL instance very often. At the moment I do this by using SSMS, where I generate a script of the Azure SQL database and use sqlcmd to import that script. It is a really painful process, and sqlcmd sometimes just hang and becomes unresponsive.

Why is there no simple export/import functionality?

[+] otabdeveloper4|5 years ago|reply

> Azure's network team decided to unnecessarily use NAT for IPv6, making it exactly (100%!) as complex as IPv4...

People like the NAT. It's a feature, not a bug. If your selling point for IPv6 is "no more NAT" then no wonder it never went anywhere!

P.S. No, "you're doing it wrong" and "you're not allowed to like the NAT" are not valid responses to user needs.

[+] mikewarot|5 years ago|reply

Virtualization eventually will be seen as the unnecessary layer added to make up for operating systems that lack capability based security.

It's going to take a decade to refactor things to remove that layer. Once done, you'll be able to safely run a process against a list of resources.

[+] philipswood|5 years ago|reply

Agreed. Operating systems abstract resources, the fact that we needed containers and VMs point to the fact that the first set of implementations weren't optimal and needs to be refactored.

For VMs, security is the one concern, the others would be more direct access to lower levels and greater access to the hardware and internals than just a driver model.

For containers I'd say that the abstraction/interface presented to an application was too narrow: clearly network and filesystem abstractions needed to be included as well (not just memory, OS APIs and hardware abstractions).

I imagine that an OS from the far future could perform the functions of a hypervisor and a container engine and would allow richer abilities to add code to what we consider kernel space, one could write a program in Unikernel style, as well as have normal programs look more container-like.

[+] nonameiguess|5 years ago|reply

It's not just security. There's incompatible dependencies. Virtualization allows you to run Windows and Linux on the same server, 40 different C libraries with potentially incompatible ABIs, across X time zones and Y locales, and they'll all work.

You can provide the same level of replication and separation of dependencies in an OS, but at a certain point you're just creating the same thing as a hypervisor and calling it an OS and the distinction becomes academic.

[+] w7|5 years ago|reply

We've had capability based security frameworks aka MAC (ex: AppArmor) in Linux since 1999 or earlier. Containers (which also existed long before docker) have been popularized for convenience, and virtualization would still be useful for running required systems that are not similar to the host. If anything it looks like we're going towards a convergence with "microvms".

[+] js8|5 years ago|reply

It already exists in the IBM mainframe. But nobody wants to write apps for it..

[+] nine_k|5 years ago|reply

Virtualization, no. A hypervisor running a Windows kernel and a Linux kernel side by side is not about capability-based security. You can even see it like cap-based security approach: VMs only see what the hypervisor gave them, and have no way to refer to anything that the hypervisor did not pass to them.

Containers, yes. They are a pure namespacing trick, and can be replaces by cap-based security completely.

[+] sneak|5 years ago|reply

I am fairly certain that virtualization is going to be a core tenet of computing up to and including the point where the atoms of the gas giants and Oort cloud are fused into computronium for the growing Dyson solar swarm to run ever more instances of people and the adobe flash games they love.

[+] tifadg1|5 years ago|reply

I highly doubt that. If so inclined you can do that today with SEL as the granularity and capability is there, but how long would tuning it in would take? What about updates - will it still work as expected? What about edge cases? What about devs requiring more privilege than is needed?

[+] encryptluks2|5 years ago|reply

Agreed 100%, but you still are going to get the old-timers who insist that virtualization and containers just make things more difficult and that their old school approach is the best.

[+] chii|5 years ago|reply

going by how the web is doing today, the refactor you refer to will never be done.

[+] bob1029|5 years ago|reply

I love the test-in-production illustration. This is a fairly accurate map of my journey as well. I vividly remember scolding our customers about how they needed a staging environment that was a perfect copy of production so we could guarantee a push to prod would be perfect every time.

We still have that customer but we have learned a very valuable & expensive lesson.

Being able to test in production is one of the most wonderful things you can ever hope to achieve in the infra arena. No ambiguity about what will happen at 2am while everyone is asleep. Someone physically picked up a device and tested it for real and said "yep its working fine". This is much more confidence inspiring for us.

I used to go into technical calls with customers hoping to assuage their fears with how meticulous we would be with a test=>staging=>production release cycle. Now, I go in and literally just drop the "We prefer to test in production" bomb on them within the first 3 sentences of that conversation. You would probably not be surprised to learn that some of our more "experienced" customers enthusiastically agree with this preference right away.

[+] nine_k|5 years ago|reply

You can test in production by moving fast and breaking things (the clueless guy).

You can test in production by having canaries, filters, etc, and allowing some production traffic to the version under test. This is the "wired" guy.

For many backend things, you can test in production by shadowing the current services, and copying the traffic into your services under test. It has limitations, but can produce zero disruption.

[+] erikerikson|5 years ago|reply

IMO this still keeps production a previous thing. Next comes multi-version concurrent environments on the same replicated data streams. There is only production and delivery to customers is access version 3 instead of version 2. This can be proceeded by an automated comparison of v2's outputs to v3's and a review (or code description of the expected differences).

[+] edoceo|5 years ago|reply

you still have unit tests and similar right? and maybe test in a staging env lightly first? or, right to Live?

[+] zmmmmm|5 years ago|reply

This post mingles different interpretations of "serverless" - the current fad of "lambda" style functions etc and premium / full stack managed application offerings like Heroku. Although they are both "fully managed" they have a lot of differences.

Both though assume the "hard" problem of infrastructure is the one-time setup cost, and that your architecture design should be centered around minimising that cost - even though it's mostly a one-time cost. I really feel like this is a false economy - we should optimise for long term cost, not short term. It may be very seductive that you can deploy your code within 10 minutes without knowing anything, but that doesn't make it the right long term decision. In particular when it goes all the way to architechting large parts of your app out of lambda functions and the like we are making a huge archtitectural sacrifice which is something that should absolutely not be done purely on the base of making some short term pain go away.

[+] ngngngng|5 years ago|reply

> but it takes 45 steps in the console and 12 of them are highly confusing if you never did it before.

I am constantly amazed that software engineering is as difficult as it is in ways like this. Half the time I just figure I'm an idiot because no one else is struggling with the same things but I probably just don't notice when they are.

[+] jcelerier|5 years ago|reply

I'm so happy to just develop desktop apps in C++ / Qt. I commit, I tag a release, this generates mac / windows / linux binaries, users download them, the end.

[+] m_st|5 years ago|reply

I'm very happy in the same camp using C# and SQL (though Windows only).

Thank you for making desktop apps.

[+] abraxas|5 years ago|reply

I want to live in that world again. It ended for me about 19 years ago when I left my last "desktop software" job. How do you manage to avoid "web stacks" in this day and age?

[+] drinchev|5 years ago|reply

There is hardly an app that I use these days that doesn't require internet connection. I guess once your app needs it, you will have the same problems.

[+] XorNot|5 years ago|reply

Some of this I agree with, even if the overall is to me asking for "magic" (I wouldn't mind trying to get there though).

For me the one I really want is "serverless" SQL databases. On every cloud platform at the moment, whatever cloud SQL thing they're offering is really obviously MySQL or Postgres on some VMs under the hood. The provisioning time is the same, the way it works is the same, the outage windows are the same.

But why is any of that still the case?

We've had the concept of shared hosting of SQL DBs for years, which is definitely more efficient resource wise, why have we failed to sufficiently hide the abstraction behind appropriate resource scheduling?

Basically, why isn't there just a Postgres-protocol socket I connect to as a user that will make tables on demand for me? No upfront sizing or provisioning, just bill me for what I'm using at some rate which covers your costs and colocate/shared host/whatever it onto the type of server hardware which can respond to that demand.

This feels like a problem Google in particular should be able to address somehow.

[+] dragonwriter|5 years ago|reply

> Basically, why isn't there just a Postgres-protocol socket I connect to as a user that will make tables on demand for me? No upfront sizing or provisioning, just bill me for what I'm using at some rate which covers your costs and colocate/shared host/whatever it onto the type of server hardware which can respond to that demand.

How isn’t RDS Aurora Serverless from AWS, available in both MySQL-and Postgres-compatible flavors, exactly what you are looking for?

[+] anurag|5 years ago|reply

Testing in 'production' has been possible for a while: Heroku has Review Apps, and Render (where I work) has Preview Environments [1] that spin up a high-fidelity isolated copy of your production stack on every PR and tear it down when you're done.

[1] https://render.com/docs/preview-environments

[+] qaq|5 years ago|reply

Given how much capitalization AWS and Azure account for it's surprising how little progress has being made in many areas. A true serverless RDBMS database priced sanely that basically charged for qry execution time and storage without need to provision some "black box" capacity units and would be suitable for a large scale app. Most of all workflows that require contacting support and arbitrary account limits that force you to split solutions across multiple accounts wtf.

[+] erikerikson|5 years ago|reply

Both Dynamo DB and Cosmos DB offer effectively what you're asking for.

Limits are a risk that must be mitigated but they aren't arbitrary, they exist to avoid interference of one customer on to others. They protect the service which, to providers and reasonably so, is more important than any one customer.

[+] Ericson2314|5 years ago|reply

I'm sympathetic to the motivation, but the author is a bit naive in some of the specific requirements --- we absolutely must spin things up locally to retain our autonomy. Sure, allow testing in the cloud too, but don't make that the only option.

The cloud should be a tiny margin business, that it isn't shows just how untrustworthy it is. The win-win efficiency isn't there because there's so much rentier overhead.

Oh, and use Nix :).

[+] skywhopper|5 years ago|reply

While it’s a good idea to lay out a vision for the future of infrastructure as a service, it’s annoying to read this sort of article that seems to believe it ought to be simple to provide true serverless magical on-demand resources for random code. Fact is it’s a lot more complicated than that unless you’re just a small fry without any really hard problems to solve.

It’s silly to complain that AWS doesn’t have a service that allows any random programmer to deploy a globally scalable app with one click. For one thing, you aren’t AWS’s primary target audience. For two, the class of problems that can be solved with such simple architectures is actually pretty small. But for three, it’s not actually even possible to do what you’re asking for any random code unless you are willing to spend a really fantastic amount of money. And even then once you get into the realm of a serious number of users or a serious scale of problem solving, then you’ll be back into bespoke-zone, making annoying tradeoffs and having to learn about infra details you never wanted to care about. Because, sorry, but abstraction is not free and it’s definitely not perfect.

[+] nine_k|5 years ago|reply

> It’s silly to complain that AWS doesn’t have a service that allows any random programmer to deploy a globally scalable app with one click. For one thing, you aren’t AWS’s primary target audience.

I won't be surprised if AWS makes disproportionately more money off smaller operations which don't want to care about honing their infra, but want something be done quickly and with low effort. They quench the problem with money, while the absolute amounts are not very high. So targeting them even more, and eating a slice of Heroku's pie, looks pretty logical for AWS to do.

[+] pbiggar|5 years ago|reply

So I'm creating something that's exactly like the author is describing: https://darklang.com

He's right that there's no reason that things shouldn't be fully Serverless, and have instant deployment [1]. These are key parts of the design of Dark that I think no one else has managed to solve in any meaningful way. (I'll concede that we haven't solved a lot of other problems had still need to be solved to be usable and delightful though).

[1] https://blog.darklang.com/how-dark-deploys-code-in-50ms/

[+] mathgladiator|5 years ago|reply

That's super cool. There is a lot to be said about simple things, and I think the language approach makes a ton of sense.

I'm building a serverless document store with mini-databases using a programming language I designed for board games: http://www.adama-lang.org/

[+] EdwardDiego|5 years ago|reply

> The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.

Because very often, software that runs on clusters, requires thought when resizing.

In fact, I'm trying to think of a distributed software product where resizing a cluster, doesn't require a human in the loop making calculated decisions.

Even with things like Spark where you can easily add a bunch of worker nodes to a cluster in the cloud, there's still core nodes maintaining a shared filesystem for when you need to spill to disk.

[+] Throaway838921|5 years ago|reply

My biggest problem with modern infrastructure is how much resources they use for simple things. They scale upwards, but they don't scale down.

I tried OpenFAAS in Kubernetes, to see if I could run simple lambda like loads. Default setup took north of 8 Gb of memory!

I just want efficient and simple microservices infrastructure, that doesn't hog 8 Gb of memory to run a simple functions or small microservices.

[+] kazen44|5 years ago|reply

might i recommend running simple (go) binaries on a couple of hosts running freebsd? (running them in a jail if you want).

Proper isolation that is battle tested, rock solid operating system that doesn't try to reinvent the wheel every 3 years and as an added bonus you get ZFS which makes development, upgrades and maintenance a breeze thanks to proper snapshotting.

We did this a simple side project, originally written in PHP, then python and now go and it has been rock solid and easily scalable for atleast a decade or more.

Heck, you can even run linux compatible binaries in linux compatible jails nowadays[0].

0: https://wiki.freebsd.org/LinuxJails

[+] lnsp|5 years ago|reply

I am currently working on a project attempting to achieve something like this called Valar (https://valar.dev). It's still quite experimental and in private beta, but feel free to take a look and put yourself on the waiting list. I'm targeting small, private projects so it's not really suited for full-on production workloads yet.

The core of it is a serverless system that shrinks workload quotas while running them (similar to the Google Autopilot system). This way you don't have to rewrite your entire codebase into lambdas and the end user can save a lot of costs by only specifying a resource cap (but probably using a lot less, therefore paying less).

[+] asim|5 years ago|reply

Been there. Dreamed that. Built my ideal experience -> https://m3o.com

Honestly I've spent the better part of a decade talking about these things and if only someone would solve it or I had the resources to do it myself. I'm sure it'll happen. It's just going to look different than we thought. GitHub worked because it was collaborative and I think the next cloud won't look like an AWS, it'll look more like a social graph that just happens to be about building and running software. This graph only exists inside your company right now. In the future it'll exist as a product.

[+] max_|5 years ago|reply

>Truly serverless

>The beauty of this is that a lot of the configuration stuff goes away magically. The competitive advantage of most startups is to deliver business value through business logic, not capacity planning and capacity management!

Is this exactly what cloudflare's workers is doing? [0]

I love the fact that I only need to think of the business logic.

Development without the need for VMs and obscure configuration scripts feels — and is in fact very productive. I don't think I will want to use anything else for hosting my back ends.

[0]:https://workers.cloudflare.com/

[+] simonpantzare|5 years ago|reply

I'm really excited by Cloudflare Workers. I hope to get a chance to build something tangible with it soon. The developer experience with Wrangler is superb and paired with Durable Objects and KV you got disk and in-memory storage covered. I'm a fan of and have used AWS for 10+ years but wiring everything together still feels like a chore, and Serverless framework and similar tools just hide these details which will inevitable haunt you at some point.

Cloudflare recently announced database partners. Higher-level options for storage and especially streaming would be welcome improvements to me. Macrometa covers a lot of ground but sadly it seems like you need to be on their enterprise tier to use the interesting parts that don't overlap with KV/DO, such as streams. [0]

I played with recently launched support for ES modules and custom builds in Wrangler/Cloudflare Workers the other day together with ClojureScript and found the experience quite delightful. [1]

[0]: https://blog.cloudflare.com/partnership-announcement-db/

[1]: https://dev.to/pilt/clojurescript-on-cloudflare-workers-2d9g

[+] ramoz|5 years ago|reply

I think we assume too much about where infrastructure is headed. Serverless, containers... they're fueling this notion of increased productivity but I don't think they themselves are keeping up with where Software is headed. E.g. "AI: --- in a lot of cases you're going to be breaking through every level of abstraction with any *aaS above the physical cores and attached devices. Traditional CRUD simplified? -- sure, but I feel like these "infrastructure" dev opportunities will be eaten by digital platforms and cloud solutions in the coming years.

[+] nonameiguess|5 years ago|reply

This is a lot harder than the author wants to realize. Elastic infrastructure deployment in sub second latency is not likely to ever happen. The problem is simply a lot harder than serving http requests. It's not receive some bytes of text, send back some bytes of text. VMs are just files and can be resized, but optimal resource allocation on multitenant systems is effectively solving the knapsack problem. It's NP hard, and at least some of the time, it involves actually powering devices on and off and moving physical equipment around. There is still metal and silicon back there. In order to just be able to grow on demand, the infrastructure provider needs to overprovision for worst case, as the year of Covid should have taught us the entire idea of just in time logistics for electronics is a house of cards. Ships still need to cross oceans.

My wife was the QA lead for a data center provisioner for a US government agency a few years back and the prime contractor refused to eat into their own capital and overprovision, instead waiting for a contract to be approved before they would purchase anything. They somehow made it basically work through what may as well have been magic, but damn the horror stories she could tell. This is where Amazon is eating everyone else's lunch by taking the risk and being right on every bet so far so if they build it, the customers will come.

But even so, after all of that to complain that deploying infrastructure changes takes more than a second, my god man! Do you have any idea how far we've come? Try setting up a basic homelab some time or go work in a legacy shop where every server is manually configured at enterprise scale, which is probably every company that wasn't founded in the past 20 years. Contrast that with the current tooling for automated machine image builds, automated replicated server deployment, automated network provisioning, infrastructure as code. Try doing any of this by hand to see how hard it really is.

Modern data centers and networking infrastructure may as well be a 9th wonder of the world. It is unbelievable to see how underappreciated they are sometimes. Also, hardware is not software. Software is governed solely by math. It doesn't exactly what it says it will do, every single time. Failures of software are entirely the fault of bad program logic. Hardware isn't like that at all. It is governed by physics and it is guaranteed to fail. Changes in temperature and air pressure can change results or cause a system to shut down completely. Hiding that from application developers is absolutely the goal, but you're never going to make a complex problem simple. It's always going to be complex.

Remember my wife? Let me tell you one of her better stories. I can't even name the responsible parties because of NDAs, but one day a while back an entire network segment just stopped working. Connectivity was lost for several applications considered "critical" the US defense industrial base, after they had done nothing but pre-provision reserved equipment for a program that hadn't even gone into ops yet and wasn't using the equipment. It turned there was some obscure bug in a particular model of switch that caused it to report a packet exchange as successful even though it wasn't so that failover never happened, even though alternative routes were up and available. Nobody on site could tell why this was happening because the cause ended up being locked away in chip level debug logs that only the vendor could access. The vendor claimed this could only happen one a million times, so they didn't consider it a bug. Guess how often one in a million times happens when you're routing a hundred million packets a second? Maybe your little startup isn't doing that, but the entire US military industrial complex or Amazon is absolutely doing that, and those are the problems they are concerned with addressing, not making the experience better for some one man shop paying a few bucks a month who think yaml dsls make software-defined infrastructure too complicated for his app that can plausibly run on a single laptop.

[+] pm90|5 years ago|reply

It’s true that we’ve made a ton of progress. But I LOVE it when developers demand more and I will bet good money that it’s only a matter of time when what the author describes will actually happen.

We’ve been making tremendous strides in making production software systems easy and accessible for developers (and not sysadmins) to understand and operate. We’re currently in the stage where the devops principle has been somewhat realized; devs love the ability to deploy when they want but absolutely hate that infrastructure provisioning takes so long compared to the tools they use for software development. This is a good thing! I want more devs to demand this; someone may actually solve it.

[+] kazen44|5 years ago|reply

> The vendor claimed this could only happen one a million times, so they didn't consider it a bug. Guess how often one in a million times happens when you're routing a hundred million packets a second?

This reminds of people on HN who claim BGP is outdated and should be replaced.

They have no idea how mind boggling large and complex the global internet routing is. I would even argue that BGP in the default free zone (aka, the public internet) is the largest distributed system in the world.

> VMs are just files and can be resized, but optimal resource allocation on multitenant systems is effectively solving the knapsack problem.

as someone who works in a datacenter/cloud enviroment. Having optimal resource distribution is simply impossible to get 100% correct. Their is always some waste or over/underallocation. The same goes for large scale networks aswell.

[+] InsomniacL|5 years ago|reply

A trainer said to me YAML is like JSON but easier to read, he then said count the whitespace on every line... Sure, An IDE makes it easier to see, but the same can be said about JSON.

195 comments