top | item 18428985

(no title)

justinsb | 7 years ago

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

discuss

antpls|7 years ago

The status dashboard is inaccurate and/or a lie. It only tells about the GKE incident, while in fact the problem also impacts Google Compute Engine users. I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

As another comment pointed out, what's the point of having so many zones and redundancy around the globe if such global failure can still happen? I thought the "cloud" was supposed to make this kind of failure impossible

stevehawk|7 years ago

This is unfortunately the norm. Like when AWS S3 went down (but couldn't update its own status images because they're in S3 and we all laughed) and along with it went Alexa, lambda, and every other service dependent on S3.

carbocation|7 years ago

> I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).

0xbadcafebee|7 years ago

> I thought the "cloud" was supposed to make this kind of failure impossible

You have to remember that you're trying to have access to backend platforms and infrastructure at all times, which almost no public utility does (assuming "the cloud" is "public utility computing"). Power plants go into partial shutdown, water treatment plants stop processing, etc. Utilities are only designed to provide constant reliability for the last mile.

If there's a problem with your power company, they can redirect power from another part of the grid to service customers. But some part of your power company is just... down. Luckily you have no need to operate on all parts of the grid at all times, so you don't notice it's down. But failure will still happen.

Your main concern should be the reliability of the last mile. Getting away from managing infrastructure yourself is the first step in that equation. AppEngine and FaaS should be the only computing resources you use, and only object storage and databases for managing data. This will get you closer to public utility-like computing.

But there's no way to get truly reliable computing today. We would all need to use edge computing, and that means leaning heavily on ISPs and content provider networks. Every cloud computing provider is looking into this right now, but considering who actually owns the last mile, I don't think we're going to see edge computing "take over" for at least a decade.

aiisjustanif|7 years ago

> I thought the "cloud" was supposed to make this kind of failure impossible

If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.

davemp|7 years ago

I’ll suggest considering whether entities enamored with centralizing ideals are more likely to fail to properly realize the robustness of a distributed system.

aviv|7 years ago

We have created GCE instances in several US regions without any issue today. Last one was 10 minutes ago in west2.

marcinzm|7 years ago

I appreciate all the effort you're putting in and I understand such situations can be stressful but user's having to depend on someone responding on hacker news for status updates seems really amateur for an organization the size of google.

NicoJuicy|7 years ago

The default is : https://status.cloud.google.com/incident/container-engine/18...

People who respond here could be employees of Google, caring about it and respond here because they know it.

What he can mention ( a lot of people are working on it) is what you can suspect when something is going down. All other cloud providers do the same.

trhway|7 years ago

>really amateur for an organization the size of google.

There is a reason while Google have been having hard time making inroads in the enterprise cloud. Kind of impedance mismatch between enterprise and the Google style. That 2 stories like high "We heart API" sign on the Google Enterprise building facing 237 just screams about it :)

rdtsc|7 years ago

Strangely and sadly with gmail account blocking and other such issues HN and Twitter is often better way to get Google's support than to contact support.

ben_jones|7 years ago

As much as I love bashing big corps I see HN as a supplementary communication channel for products like GCP - its a luxury we get to access alongside normal customer support channels in the GCP console, twitter, etc.

tomcam|7 years ago

Thanks for jumping in here on your own time. The following question is not meant to be hostile, it is merely curiosity. Isn’t this supposed to be the kind of thing that monitoring and diagnostics software should find automatically? Serious question, not meant to embarrass you.

rlancer|7 years ago

Creating clusters via the UI is still not working for me.

rlancer|7 years ago

UPDATE: Created a Cluster successfully in Australia... Still not able to do so in the US.

zachberger|7 years ago

Have you tried via the gcloud command?

fizzledbits|7 years ago

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

dilyevsky|7 years ago

So, given that i filed this months ago via official support and it’s still not fixed, can you look into misleading container memory reporting ui bug. It reports memory_total but should be working_set