Back to blog

February 2024 - Incident Report

In this post, we detail the incidents encountered in February 2024, including high memory usage, database restarts, repeated gateway timeouts, and executor crashes. We share our analysis, user impact, and the resolution steps taken to mitigate these issues, reinforcing our commitment to transparency and continuous improvement.

As Appwrite Cloud continues to grow and scale, it is inevitable that we face challenges and obstacles that test the resilience of our systems and our team's ability to respond swiftly. February 2024 was no exception, as we encountered a series of incidents that put our protocols to the test. While these moments are opportunities for growth, they also remind us of the importance of transparency and communication with our community. This blog post aims to provide a detailed overview of the incidents we experienced throughout the month, including the causes, our responses, and the lessons learned. Our commitment to continuous improvement drives us to share these insights openly, reinforcing our dedication to reliability and trust.

TLDR; for February 2024

2024-02-01High Memory Usage During Scheduled Backups4 hours, 6 mins
2024-02-12Database Restart Incident1 hour, 34 mins
2024-02-13Gateway Timeouts in Traefik Containers15 mins
2024-02-14Gateway Timeouts in Traefik Containers8 mins
2024-02-15Gateway Timeouts in Traefik Containers8 mins
2024-02-19Executor Crash and Bad Webhook27 mins
2024-02-21Gateway Timeouts in Traefik Containers25 mins
2024-02-24Gateway Timeouts in Traefik Containers30 mins
Total7 hours, 33 mins
High Memory Usage During Scheduled Backups
  • Date: February 1, 2024

  • Duration: 4 hours and 6 minutes

  • Summary: High memory usage in a database cluster caused widespread impact on cloud components.

  • Description: A high memory usage alert was triggered for one of the database clusters, leading to unresponsive behavior that impacted critical cloud components, including Databases, API, Functions, etc. The root cause was identified as a high memory load during scheduled backups, affecting API performance for all users on the cloud.

  • User Impact: Projects on the affected database were impacted, and partial API performance degradation during the initial phases.

  • Resolution and Steps Taken: The problematic database was temporarily disconnected from the connection string, allowing for a necessary restart. After addressing the high memory usage, the database was reconnected, mitigating the issue.

Database Restart Incident
  • Date: February 12, 2024

  • Duration: 1 hour and 34 minutes

  • Summary: Manual restart of a database affected project APIs.

  • Description: An unexpected manual restart of the database_db_fra1_self_hosted_2_0 database was triggered by an engineer, impacting all projects associated with this database. The root cause was identified as high memory usage, compounded by a scheduling error due to time zone confusion.

  • User Impact: Projects on the affected database experienced downtime and service disruption.

  • Resolution and Steps Taken: Transparent communication was maintained with affected users throughout the incident. Post-incident, measures were implemented to ensure such time zone errors are avoided in the future.

Repeated Gateway Timeouts in Traefik Containers
  • Dates: February 13, 14, 15, 21, & 24

  • Duration: Varied; a total of 86 minutes across all incidents

  • Summary: Multiple incidents of gateway timeouts following cloud deployments.

  • Description: Routine cloud deployments triggered gateway timeout errors due to a specific configuration introduced to Traefik (--serversTransport.maxIdleConnsPerHost=-1), affecting routing post-deployment. This issue was consistent across several incidents, impacting users' ability to access Appwrite Cloud through the API and console.

  • User Impact: Sporadic gateway timeout errors on the Appwrite API and console, affecting user access to Appwrite Cloud.

  • Resolution and Steps Taken: Immediate action involved restarting the Appwrite service to force correct container routing. The root cause was addressed by reverting the Traefik configuration change on February 25, 2024, which resolved the gateway timeout errors.

Executor Crash and Bad Webhook
  • Date: February 19, 2024

  • Duration: 27 minutes

  • Summary: A crash in the functions executor combined with a faulty webhook led to increased API latency and failures.

  • Description: The system experienced a crash in one of its executors, triggered by a faulty webhook. This incident was compounded by a high load on the function executions, triggering multiple webhooks simultaneously, which blocked API workers ultimately leading to increased latency and monitors being triggered.

  • User Impact: Users experienced API timeouts and failed function executions, leading to service disruption.

  • Resolution and Steps Taken: Immediate actions included restarting the API and executor to recover from the crash. The problematic webhook was removed, and steps were taken to restart the webhook worker, ensuring pending jobs could continue to execute.

Subscribe to our newsletter

Sign up to our company blog and get the latest insights from Appwrite. Learn more about engineering, product design, building community, and tips & tricks for using Appwrite.