Increased Failure Rate of Apache Cassandra Backups

Incident Report for Instaclustr

Resolved

This incident has been resolved.
Posted Apr 21, 2025 - 07:15 UTC

Update

We have seen a consistent reduction in errors rates from Apache Cassandra backup events over the past 3 days, however, we will continue to closely monitor. We expect to provide another update in the next 24 hours.
Posted Apr 20, 2025 - 21:47 UTC

Update

We are continuing to observe a reduction in errors rates from Apache Cassandra backup events, however, we will continue to closely monitor. We expect to provide another update in the next 24 hours.
Posted Apr 19, 2025 - 21:49 UTC

Update

We are continuing to observe a reduction in errors rates from Apache Cassandra backup events, however, we will continue to closely monitor. We expect to provide another update in the next 24 hours.
Posted Apr 18, 2025 - 22:29 UTC

Update

A fix has been deployed to all Apache Cassandra nodes. Initial results indicate the problem has been resolved, however, we will continue to closely monitor. We expect to provide another update in the next 24 hours.
Posted Apr 18, 2025 - 02:38 UTC

Monitoring

We have started rolling out a fix, this is expected to take a few hours to be applied to all Apache Cassandra nodes, we will be monitoring the progress and effectiveness of the rollout closely.
Posted Apr 18, 2025 - 00:01 UTC

Identified

The issue has been identified and we are preparing to rollout a fix.
Posted Apr 17, 2025 - 22:58 UTC

Investigating

We are currently seeing an elevated rate of Backup Failures for AWS nodes for our Apache Cassandra offering. Currently we are expecting that these backups will continuously retry and eventually succeed, however this will be visible in the Instaclustr console and APIs as failed backup events.

We are actively monitoring and working on a solution to this, and will provide more updates as investigation continues.

If you have any questions or concerns please reach out via support@instaclustr.com
Posted Apr 17, 2025 - 01:29 UTC
This incident affected: Management Console and Cluster Management API.