[Global][Public Cloud] Autoscaling feature and nodepool CRD management in degraded state

Original incident notification from OVH Data Center

Affected Service: [Global][Public Cloud] Autoscaling feature and nodepool CRD management in degraded state

Jul 7, 09:22 UTC
Resolved - Start Time : 05/07/2023 04:00 UTC
End Time : 05/07/2023 07:07 UTC

On July 5th 06:03 am CEST, OVHcloud has identified a partial unavailability of its services affecting Public Cloud and VPS customers.

During this time, the management of the following services was degraded: Compute, Storage, Network, Containers & orchestration. Access to data on Standard Object Storage Swift & Cloud Archive was not possible. Access to the Public Cloud segment of the OVHcloud manager was also unavailable. No data was lost during the incident.

Concerning VPS, ordering, upgrade, reinstallation and unsubscribe were unavailable as well as Snapshot and Automated backup features.

Resources of both Public Cloud and VPS remained available during the course of the incident, excluding Standard Object Storage Swift and Cloud Archive.

The incident was fixed at 9:07am CEST, thanks to our fully mobilized teams, with a progressive ramp-up to a nominal state up until 11:00am CEST for all affected services.

We sincerely apologize to all affected customers.

Jul 5, 16:25 UTC
Update - We are continuing to monitor for any further issues.

Jul 5, 16:24 UTC
Monitoring - On July 5th 06:03 am CEST, OVHcloud has identified a partial outage on its services affecting customers of the Public Cloud universe. Services including Control Panel, Kubernetes, Private Registry, VPS were notably and partially unavailable.

At 10:05am CEST, following actions from our fully mobilized technical teams, services were back to a nominal state concerning Control Panel, Kubernetes, Private Registry and VPS.

At 11:00am CEST, Cold Archive, Object Storage and PCI services were restored to nominal status.
We continue to actively monitor the situation with impacted services. We will communicate more information on the cause of the incident as our investigations progress.

We continue to actively monitor the situation with impacted services. We will communicate more information on the cause of the incident as our investigations progress.

We sincerely apologize to all affected customers.

Jul 5, 14:43 UTC
Identified - Updates : Management of Openstack ressources is back to normal. Cluster autoscaler component and nodepool customer resources management are still in a degraded state.

Our team is still working to stabilize theses components

Jul 5, 09:24 UTC
Monitoring - Start time : 2023/07/05 00:16 UTC
End Time : 05/07/2023 09:00 UTC
Ongoing actions : Monitoring
Our technical teams deployed a solution for the issue. We are monitoring the situation for the time being.

Jul 5, 08:11 UTC
Update - During investigations, Public Cloud shared the following information:
Existing Openstack resources are not impacted unless modification request sent to the Openstack API (load balancers, volumes, instances, etc.)
Creation or deletion of existing Openstack resources is impossible at the moment
Our team is monitoring actively all MKS services, we will fix them as soon as possible if they are concerned by this incident.

Jul 5, 07:08 UTC
Identified - Start time : 2023/07/05 00:16 UTC
End time : In progress
Service impact : Cluster autoscaler component and nodepool customer resources management are currently in a degraded state.
Root cause : Keystone API temporarily unreachable: https://public-cloud.status-ovhcloud.com/incidents/1shkj36zsphs
Ongoing actions : Waiting outage resolution from the Public Cloud team.

For a live view and overall network health, visit our system status page

Have an issue thats being affected by this maintenance? Open a Ticket and we can help you find a solution.

Leave a Reply Cancel reply