Fairly serious MariaDB issue with function execution queueing
- 0
- Databases
- Functions
- Self Hosted
I encountered what is hopefully a super rare bug over the weekend and had a doozy of a time finding the solution to it. I'd like to document the issue and solution for anyone else who might run into something similar, as well as let the Appwrite team know in case it points to a deeper problem.
The problem: Over the weekend, I noticed that my server had randomly crashed while several users were testing different functionality, and was able to tie it to function executions. That is, whenever a serverless function was being executed, it would result in a 500 error that necessitated a full restart.
The investigation: After diving into the debugging rabbit hole, I found some promising logs in the mariadb:10.7 container.
The logs indicate an assertion failure inside the MariaDB server. In the context of MariaDB, this usually indicates a bug in the server, potential database corruption, or a problematic query.
The specific error message is: InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.7.5/storage/innobase/que/que0que.cc line 728.
This is followed by a message related to a duplicate key on an FTS auxiliary index table (appwrite._1_executions
).
The solution: After backing up my server and database, I decided to re-index the appwrite._1_executions
table via a suggestion from GPT-4. So, I dropped and recreated all of the indices. This did indeed turn out to be the necessary fix.
My question: Does this point towards an issue with execution queuing whereby multiple functions being queued to execute at the same time could result in a duplication of keys in the executions table? This type of thing SHOULD be an atomic operation by the very nature of SQL databases, but I suppose might occur if the execution queue was handled manually. I'll add some extra helpful information about my setup in particular for more context in my next message (as I'm running out of message space here).
The most likely culprit for this bug occurring that I can think of is that I implemented the proposed solution given in this bug report: https://github.com/appwrite/appwrite/issues/5629
This has allowed me to keep my function executions from freezing and also keeps them non-blocking, and is more or less an absolute necessity. If it is indeed the cause of the execution queueing issue, do you guys have any thoughts on how that might be ameliorated? That github issue has a response from joeyouss stating that it's being looked at, and so perhaps a preferable solution is in the works, but until then, is there anything I might be able to do on my end to prevent this type of issue from happening in the future?
Also, I'm happy to provide more of the logs I combed through en route to the above solution.
Hey,
Let's get some basic knowledge
- What solution of the my issue have you tried the semi-solution or the second possible one?
- What is your server specs? as they blocking can happened mostly in low-budget servers when the
appwrite
main container can process more request as all the workers already occupied. - Do you have
swap
file can you share screen shot ofhtop
if you've it? - Is the error happened before or this is first time? when you've change the functions logic?
- Can you share the logs:
appwrite
appwrite-executor
appwrite-worker-functions
You're right, by nature MariaDB (MySQL) is ACID compliance which mean the durability of the data should be consist, but, a possible problem could be is that your server doesn't have sufficient RAM, making the MariaDB service collapse in middle of inserting data, make the index corrupt.
I my self use the semi-solution for over 1M requests with no interfering so far.
Also, Is all the function execution was done in sync or async? or mix?
Thanks for the response!
- I use the semi-solution.
- I'm running a Digital Ocean droplet with 4 vCPUs, 8GB of RAM, and 50 GB of disk space (though I also use a separate Spaces storage container). CPU usage is basically never over 10%, memory sticks around 30% utilization, disk usage at 40%. So, I have a fair bit of headroom at the moment.
- DO's usage graphs are where I got the info above, let me know if you need more granular detail.
- This is the first time I've ever seen the issue happen. I implemented the semi-solution around 3 weeks ago.
- I'll add this in a follow-up message
- All of the function executions are done in sync (I used to run some in async, but swapped everything over to sync when I implemented the semi-solution, as I misunderstood what sync/async were doing before)
Okay, so 4/8 server is pretty strong server, and with memory at 30% it means that MariaDB didn't crashed because missing RAM.
By guessing how much function per second you think you had the most?
Yes those already on a different matter.
I'm not certain that my logs will be super helpful as some of them may have rolled over, but I'll write in any errors or remotely pertinent stuff I find.
appwrite logs:
[Error] Timestamp: 2023-08-09T22:09:21+00:00
[Error] Method: GET
[Error] URL: /v1/databases/:databaseId/collections/:collectionId/documents
[Error] Type: Appwrite\Extend\Exception
[Error] Message: Collection with the requested ID could not be found.
[Error] File: /usr/src/code/app/controllers/api/databases.php
[Error] Line: 2899
[Error] Timestamp: 2023-08-09T22:09:21+00:00
[Error] Method: GET
[Error] URL: /v1/databases/:databaseId/collections/:collectionId/documents
[Error] Type: Appwrite\Extend\Exception
[Error] Message: Collection with the requested ID could not be found.
[Error] File: /usr/src/code/app/controllers/api/databases.php
[Error] Line: 2899
[Error] Timestamp: 2023-08-09T23:01:43+00:00
[Error] Type: Utopia\Exception
[Error] Message: Not Found
[Error] File: /usr/src/code/vendor/utopia-php/framework/src/App.php
[Error] Line: 803
appwrite-executor:
yeesh, sorry, got a bit messy there
Definitely nothing even remotely close to scratching its limits. This is a pre-production server and I had maybe 10 concurrent users testing out different functions, each of them running a function or sequence of 2-4 functions every minute or so
From what I see in the process of Appwrite functions in relate to the execution table it's as follow.
Every execution
table has only two unique indexes the _id
which is handled by MariaDB and the _uid
which is generated by Appwrite here https://github.com/appwrite/appwrite/blob/master/app/controllers/api/functions.php#L1083
The only idea I have is that the generation of the id at row 1080 of two function occurred in the exact same time (and most random are time based) which is pretty rare and should rase an error.
What I think would be best, is to see when it happened again - if it will - then we can run some tests on the data by connecting directly to MariaDB and check the data integrity
The appwrite-worker-functions
log is insanely huge, and, actually, was one of the things I thought might be the issue when the bug first arose (ie, perhaps it had grown too large and was causing read/write issues). The logs have been loading into my terminal for the past 2 minutes or so, and are filled with insane amounts of info that are not really human parsable ala "`"name":"userStatus","enabled":true,"documentSecurity":true,"attributes":[{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"lastActive","type":"integer","status":"available","size":8,"required":true,"default":null,"signed":true,"array":false,"format":"intRange","formatOptions":{"min":-9223372036854775808,"max":9223372036854775807},"filters":[],"options":[],"$id":"1_37_lastActive","$internalId":"550","$createdAt":"2023-01-11T04:36:41.839+00:00","$updatedAt":"2023-06-13T19:59:15.032+00:00","$permissions":[],"min":-9223372036854775808,"max":9223372036854775807,"$collection":"attributes"},{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"devices","type":"string","status":"available","size":666666,"required":false,"default":"{}","signed":true,"array":false,"format":"","formatOptions":[],"filters":[],"options":[],"$id":"1_37_devices","$internalId":"860","$createdAt":"2023-02-18T23:27:33.863+00:00","$updatedAt":"2023-06-13T19:59:16.789+00:00","$permissions":[],"$collection":"attributes"}],"indexes":[{"databaseInternalId":"1","databaseId":"633e0e8aabcea10799de","collectionInternalId":"37","collectionId":"userStatus","key":"lastActive","type":"key","status":"available","attributes":["lastActive"],"lengths":[0],"orders":["DESC"],"$id":"1_37_lastActive","$internalId":"387","$createdAt":"2023-08-02T19:01:04.134+00:00","$updatedAt":"2023-08-02T19:01:04.450+00:00","$permissions":[],"$collection":"indexes"}],"search":"userStatus userStat", etc...
mm, very interesting. One potentially useful note: I shut down and restarted the server to confirm the bug a solid 10-15 times (confirming that it was indeed caused by any serverless function execution), so wherever the issue originated from, it seemed to persist until I re-indexed the table.
Yes, that observation looks like the right one.
The best way when tackling data discrepancy is when we can have data to run some checks
Sounds good. I was definitely relieved to hear your experience of running 1M+ requests with the semi-solution, though. I'll be doing some more stress testing over the coming weeks as I lead into the launch of the platform I'm working on and see if I can reproduce the issue, and I'll try to gather as much information as I can if/when that happens!
Great, will be here
Thanks a ton for your support, though - I highly appreciate it! Have a good one! 😁
<:appwritefire:823999000330895380> You too
Oh, and for any passersby that may read this thread, these were the commands I used to re-index the database:
-- Access the SQL command editor docker-compose exec mariadb sh -c 'exec mysql -u"$MYSQL_USER" -p"$MYSQL_PASSWORD"'
-- Drop indexes
ALTER TABLE _1_executions
DROP INDEX _index1
;
ALTER TABLE _1_executions
DROP INDEX _key_function
;
ALTER TABLE _1_executions
DROP INDEX _key_trigger
;
ALTER TABLE _1_executions
DROP INDEX _key_status
;
ALTER TABLE _1_executions
DROP INDEX _key_statusCode
;
ALTER TABLE _1_executions
DROP INDEX _key_duration
;
ALTER TABLE _1_executions
DROP INDEX _created_at
;
ALTER TABLE _1_executions
DROP INDEX _updated_at
;
ALTER TABLE _1_executions
DROP INDEX _fulltext_search
;
-- Recreate indexes
ALTER TABLE _1_executions
ADD UNIQUE KEY _index1
(_uid
);
ALTER TABLE _1_executions
ADD KEY _key_function
(functionId
);
ALTER TABLE _1_executions
ADD KEY _key_trigger
(trigger
);
ALTER TABLE _1_executions
ADD KEY _key_status
(status
);
ALTER TABLE _1_executions
ADD KEY _key_statusCode
(statusCode
);
ALTER TABLE _1_executions
ADD KEY _key_duration
(duration
);
ALTER TABLE _1_executions
ADD KEY _created_at
(_createdAt
);
ALTER TABLE _1_executions
ADD KEY _updated_at
(_updatedAt
);
ALTER TABLE _1_executions
ADD FULLTEXT KEY _fulltext_search
(search
);
I wonder if
duplicate key on an FTS auxiliary index table
Is a symptom or cause...I've seen this before, but it was due to either maridb crash or low memory/cpu such that mariadb wasn't able to keep up with the database inserts 🧐
That was my first guess above, meaning the best way to know it was a cause is by examine some data to see the irregularity
Recommended threads
- Need help with createExecution function
Hi, Need some help understanding createExecution. When requesting function execution via createExecution, the function handler arguments are incorrect and rese...
- Query Appwrite
Hello, I have a question regarding Queries in Appwrite. If I have a string "YYYY-MM", how can I query the $createdAt column to match this filter?
- Type Mismatch in AppwriteException
There is a discrepancy in the TypeScript type definitions for AppwriteException. The response property is defined as a string in the type definitions, but in pr...