Fairly serious MariaDB issue with function execution queueing

I encountered what is hopefully a super rare bug over the weekend and had a doozy of a time finding the solution to it. I'd like to document the issue and solution for anyone else who might run into something similar, as well as let the Appwrite team know in case it points to a deeper problem.

The problem: Over the weekend, I noticed that my server had randomly crashed while several users were testing different functionality, and was able to tie it to function executions. That is, whenever a serverless function was being executed, it would result in a 500 error that necessitated a full restart.

The investigation: After diving into the debugging rabbit hole, I found some promising logs in the mariadb:10.7 container.

The logs indicate an assertion failure inside the MariaDB server. In the context of MariaDB, this usually indicates a bug in the server, potential database corruption, or a problematic query.

The specific error message is: InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.7.5/storage/innobase/que/que0que.cc line 728. This is followed by a message related to a duplicate key on an FTS auxiliary index table (appwrite._1_executions).

The solution: After backing up my server and database, I decided to re-index the appwrite._1_executions table via a suggestion from GPT-4. So, I dropped and recreated all of the indices. This did indeed turn out to be the necessary fix.

My question: Does this point towards an issue with execution queuing whereby multiple functions being queued to execute at the same time could result in a duplication of keys in the executions table? This type of thing SHOULD be an atomic operation by the very nature of SQL databases, but I suppose might occur if the execution queue was handled manually. I'll add some extra helpful information about my setup in particular for more context in my next message (as I'm running out of message space here).

TL;DR

Issue: Fairly serious MariaDB issue with function execution queueing; server crashes when executing serverless functions. Investigation: Logs indicate an assertion failure in the MariaDB server due to a duplicate key on an FTS auxiliary index table. Re-indexing the table resolved the issue. Solution: Drop and recreate all indices of the `appwrite._1_executions` table. Additional Details: The issue occurred when multiple functions were queued to execute at the same time, potentially causing duplication of keys in the executions table. Immediate solution is to re-index the table, but the issue may point to a problem with execution queuing. Note:

The most likely culprit for this bug occurring that I can think of is that I implemented the proposed solution given in this bug report: https://github.com/appwrite/appwrite/issues/5629

This has allowed me to keep my function executions from freezing and also keeps them non-blocking, and is more or less an absolute necessity. If it is indeed the cause of the execution queueing issue, do you guys have any thoughts on how that might be ameliorated? That github issue has a response from joeyouss stating that it's being looked at, and so perhaps a preferable solution is in the works, but until then, is there anything I might be able to do on my end to prevent this type of issue from happening in the future?

Also, I'm happy to provide more of the logs I combed through en route to the above solution.

Hey,

Let's get some basic knowledge

What solution of the my issue have you tried the semi-solution or the second possible one?
What is your server specs? as they blocking can happened mostly in low-budget servers when the appwrite main container can process more request as all the workers already occupied.
Do you have swap file can you share screen shot of htop if you've it?
Is the error happened before or this is first time? when you've change the functions logic?
Can you share the logs:

appwrite
appwrite-executor
appwrite-worker-functions

You're right, by nature MariaDB (MySQL) is ACID compliance which mean the durability of the data should be consist, but, a possible problem could be is that your server doesn't have sufficient RAM, making the MariaDB service collapse in middle of inserting data, make the index corrupt.

I my self use the semi-solution for over 1M requests with no interfering so far.

Also, Is all the function execution was done in sync or async? or mix?

Thanks for the response!

I use the semi-solution.
I'm running a Digital Ocean droplet with 4 vCPUs, 8GB of RAM, and 50 GB of disk space (though I also use a separate Spaces storage container). CPU usage is basically never over 10%, memory sticks around 30% utilization, disk usage at 40%. So, I have a fair bit of headroom at the moment.
DO's usage graphs are where I got the info above, let me know if you need more granular detail.
This is the first time I've ever seen the issue happen. I implemented the semi-solution around 3 weeks ago.
I'll add this in a follow-up message
All of the function executions are done in sync (I used to run some in async, but swapped everything over to sync when I implemented the semi-solution, as I misunderstood what sync/async were doing before)

Okay, so 4/8 server is pretty strong server, and with memory at 30% it means that MariaDB didn't crashed because missing RAM.

By guessing how much function per second you think you had the most?

Yes those already on a different matter.

I'm not certain that my logs will be super helpful as some of them may have rolled over, but I'll write in any errors or remotely pertinent stuff I find.

appwrite logs:

[Error] Timestamp: 2023-08-09T22:09:21+00:00 [Error] Method: GET [Error] URL: /v1/databases/:databaseId/collections/:collectionId/documents [Error] Type: Appwrite\Extend\Exception [Error] Message: Collection with the requested ID could not be found. [Error] File: /usr/src/code/app/controllers/api/databases.php [Error] Line: 2899 [Error] Timestamp: 2023-08-09T22:09:21+00:00 [Error] Method: GET [Error] URL: /v1/databases/:databaseId/collections/:collectionId/documents [Error] Type: Appwrite\Extend\Exception [Error] Message: Collection with the requested ID could not be found. [Error] File: /usr/src/code/app/controllers/api/databases.php [Error] Line: 2899 [Error] Timestamp: 2023-08-09T23:01:43+00:00 [Error] Type: Utopia\Exception [Error] Message: Not Found [Error] File: /usr/src/code/vendor/utopia-php/framework/src/App.php [Error] Line: 803

appwrite-executor:

yeesh, sorry, got a bit messy there

Definitely nothing even remotely close to scratching its limits. This is a pre-production server and I had maybe 10 concurrent users testing out different functions, each of them running a function or sequence of 2-4 functions every minute or so

From what I see in the process of Appwrite functions in relate to the execution table it's as follow.

Every execution table has only two unique indexes the _id which is handled by MariaDB and the _uid which is generated by Appwrite here https://github.com/appwrite/appwrite/blob/master/app/controllers/api/functions.php#L1083

The only idea I have is that the generation of the id at row 1080 of two function occurred in the exact same time (and most random are time based) which is pretty rare and should rase an error.

What I think would be best, is to see when it happened again - if it will - then we can run some tests on the data by connecting directly to MariaDB and check the data integrity

The appwrite-worker-functions log is insanely huge, and, actually, was one of the things I thought might be the issue when the bug first arose (ie, perhaps it had grown too large and was causing read/write issues). The logs have been loading into my terminal for the past 2 minutes or so, and are filled with insane amounts of info that are not really human parsable ala "`"name":"userStatus","enabled":true,"documentSecurity":true,"attributes":[{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"lastActive","type":"integer","status":"available","size":8,"required":true,"default":null,"signed":true,"array":false,"format":"intRange","formatOptions":{"min":-9223372036854775808,"max":9223372036854775807},"filters":[],"options":[],"$id":"1_37_lastActive","$internalId":"550","$createdAt":"2023-01-11T04:36:41.839+00:00","$updatedAt":"2023-06-13T19:59:15.032+00:00","$permissions":[],"min":-9223372036854775808,"max":9223372036854775807,"$collection":"attributes"},{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"devices","type":"string","status":"available","size":666666,"required":false,"default":"{}","signed":true,"array":false,"format":"","formatOptions":[],"filters":[],"options":[],"$id":"1_37_devices","$internalId":"860","$createdAt":"2023-02-18T23:27:33.863+00:00","$updatedAt":"2023-06-13T19:59:16.789+00:00","$permissions":[],"$collection":"attributes"}],"indexes":[{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"lastActive","type":"key","status":"available","attributes":["lastActive"],"lengths":[0],"orders":["DESC"],"$id":"1_37_lastActive","$internalId":"387","$createdAt":"2023-08-02T19:01:04.134+00:00","$updatedAt":"2023-08-02T19:01:04.450+00:00","$permissions":[],"$collection":"indexes"}],"search":"userStatus userStat", etc...

mm, very interesting. One potentially useful note: I shut down and restarted the server to confirm the bug a solid 10-15 times (confirming that it was indeed caused by any serverless function execution), so wherever the issue originated from, it seemed to persist until I re-indexed the table.

Yes, that observation looks like the right one.

The best way when tackling data discrepancy is when we can have data to run some checks

Sounds good. I was definitely relieved to hear your experience of running 1M+ requests with the semi-solution, though. I'll be doing some more stress testing over the coming weeks as I lead into the launch of the platform I'm working on and see if I can reproduce the issue, and I'll try to gather as much information as I can if/when that happens!

Great, will be here

Thanks a ton for your support, though - I highly appreciate it! Have a good one! 😁

<:appwritefire:823999000330895380> You too

Oh, and for any passersby that may read this thread, these were the commands I used to re-index the database:

-- Access the SQL command editor docker-compose exec mariadb sh -c 'exec mysql -u"$MYSQL_USER" -p"$MYSQL_PASSWORD"'

-- Drop indexes ALTER TABLE _1_executions DROP INDEX _index1; ALTER TABLE _1_executions DROP INDEX _key_function; ALTER TABLE _1_executions DROP INDEX _key_trigger; ALTER TABLE _1_executions DROP INDEX _key_status; ALTER TABLE _1_executions DROP INDEX _key_statusCode; ALTER TABLE _1_executions DROP INDEX _key_duration; ALTER TABLE _1_executions DROP INDEX _created_at; ALTER TABLE _1_executions DROP INDEX _updated_at; ALTER TABLE _1_executions DROP INDEX _fulltext_search;

-- Recreate indexes ALTER TABLE _1_executions ADD UNIQUE KEY _index1 (_uid); ALTER TABLE _1_executions ADD KEY _key_function (functionId); ALTER TABLE _1_executions ADD KEY _key_trigger (trigger); ALTER TABLE _1_executions ADD KEY _key_status (status); ALTER TABLE _1_executions ADD KEY _key_statusCode (statusCode); ALTER TABLE _1_executions ADD KEY _key_duration (duration); ALTER TABLE _1_executions ADD KEY _created_at (_createdAt); ALTER TABLE _1_executions ADD KEY _updated_at (_updatedAt); ALTER TABLE _1_executions ADD FULLTEXT KEY _fulltext_search (search);

I wonder if

duplicate key on an FTS auxiliary index table

Is a symptom or cause...I've seen this before, but it was due to either maridb crash or low memory/cpu such that mariadb wasn't able to keep up with the database inserts 🧐

That was my first guess above, meaning the best way to know it was a cause is by examine some data to see the irregularity

Reply to this thread by joining our Discord

Reply on Discord

Need support?

Join our Discord

Get community support by joining our Discord server.

Join Discord

Get premium support

Join Appwrite Pro and get email support from our team.

Learn more

Recommended threads

Need support?

Join our Discord

Get premium support