A RAID failure has taken the Matrix.org homeserver offline, leaving users of the decentralized messaging service unable to send or receive messages while engineers attempt a 55 TB database restore.
To be clear, those with their own homeservers, such as government organizations, are unaffected, but anyone using Matrix.org as their homeserver will have been hearing the sound of silence from the platform while the team works to bring the service back online.
Problems began at 1117 UTC on September 2, when the secondary Matrix.org database lost its file system due to a RAID failure. The primary fell over at 1726 UTC, and a few minutes later, the organization admitted that things were indeed not very healthy.
The Matrix.org homeserver is backed by a large PostgreSQL database, which caused the organization grief in July when a long-gestating corruption of part of a table index caused issues with “rooms” in the system. The result was that attempts to join rooms would fail, messages wouldn’t send, and occasional cryptic error messages would appear.
The team was understandably a little cautious when restoring the database and eventually reported: “We haven’t been able to restore the DB primary filesystem to a state we’re confident in running as a primary (especially given our experiences with slow-burning postgres db corruption).”
The solution is a full 55 TB database snapshot restore followed by a replay of 17 hours’ worth of traffic. At the time of writing, the team had managed to restore the snapshot and subsequent incremental backups and was about to embark on the traffic replay.
Neil Johnson, chief engineering officer at Element, a messaging platform by the creators of Matrix, told The Register the trouble started with a routine storage upgrade exercise that went badly wrong. “A whole series of things happened at exactly the wrong time in unison, which then led to the situation that we see,” he said.
It’s not a great look for the organization, as users who rely on the Matrix.org homeserver can’t access it. Messages sent to Matrix.org users will be queued until the service is back up and running. “There’s not going to be any data loss. Eventually your message will get through,” Johnson said.
There is no charge for using Matrix.org and there is also no service level agreement.
The incident demonstrates the benefits of a decentralized system. Users with their own homeservers aren’t affected, nor are organizations such as Element, which have customer deployments that utilize the underlying technology.
One homeserver going down does not affect the rest, even one as visible as Matrix.org.
Matrix has become increasingly important in recent years as public and private sector organizations seek to reduce their dependency on centralized messaging services that might not meet sovereignty or privacy requirements. The Matrix.org outage, while embarrassing, serves to highlight that a decentralized approach can protect users from whoopsies on the part of those who run the service. ®