Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: FlexiRaft: Flexible Quorums with Raft [pdf] Site: FlexiRaft: Flexible Quorums with Raft Ritwik Yadav Meta Platforms, Inc. Menlo Park, California, USA Anirban Rahut Meta Platforms, Inc. Menlo Park, California, USA Abstract MySQL is the most popular transactional datastore deployed at Meta with a storage footprint in the order of petabytes. Over the years, several components have undergone signif- icant changes to meet the demands posed by production workloads. One such effort was to redesign the replication protocol to use a modified version of Raft instead of tradi- tional semisynchronous replication. Even though Raft was a good fit for our requirements, the original algorithm did not offer much flexibility in choosing quorums which is important for latency sensitive applica- tions. In this paper, we describe our changes to the original Raft algorithm required for supporting flexible data commit quorums. We discuss the impact of these changes on work- load performance, fault tolerance and ease of integration into the existing production setup. CCS Concepts: • Computing methodologies → Distributed algorithms; • Information systems → Remote replica- tion. Keywords: flexible quorums, consensus, raft, data replica- tion 1 Introduction MySQL is the transactional datastore of choice for relational workloads at Meta. The scale of MySQL deployment spans petabytes of data [19]. Over the years, lots of major improve- ments have been made to the MySQL stack [1] in order to support requirements stemming from serving production workloads. Some notable examples include development of middleware such as Binlog Server to efficiently provide in- region fault tolerance, a new storage engine [19] to reduce write amplification and several features to support multi- tenancy. One such effort was to redesign the replication protocol used by a multi-replica MySQL deployment to use Raft [22]. This paper describes the changes we made to Raft for supporting quorum flexibility and the lessons learned from the production deployment of these changes. This paper is published under the Creative Commons Attribution 4.0 Inter- national (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution, provided that you attribute the original work to the authors and CIDR 2023. 13th Annual Conference on Innovative Data Systems Research (CIDR ’23). January 8-11, 2023, Amsterdam, The Netherlands. Every MySQL database in production has replicas in order to provide low latency reads across geographies. This redun- dancy also helps with fault tolerance. In the steady state, each database has a strong leader responsible for coordinating write operations to the database. Before the adoption of Raft, the consensus mechanism was split between the MySQL server and supporting automation tools. Modifying the code was error prone because the logic to support leader elections and data commit was spread across multiple bespoke au- tomation tools. Crash recovery, leader election and disaster readiness exercises were all coordinated externally making it hard to reason about consistency and correctness of the protocol. During region outages (simulated or otherwise), the problem was even more exacerbated and significant man- ual effort was required to restore availability. In addition to that, clients used to be completely reliant on an external sys- tem to discover the primary replica serving write operations leading to scalability challenges in the past. A redesign of the replication stack was undertaken to con- solidate the logic into the MySQL server using a well defined consensus algorithm called Raft. Raft is an easy to under- stand consensus algorithm which is equivalent to Paxos [14] in fault-tolerance and performance [22]. It has strong leader semantics with clearly defined phases. There are lots of pro- duction grade open source implementations of the algorithm [2]. All of these properties made Raft a suitable candidate for implementing the next generation replication stack for our MySQL deployment. We had to modify the original Raft algorithm to eliminate performance bottlenecks and support configuration parameters which enabled developers to make the necessary tradeoffs for their applications. FlexiRaft is a direct result of these changes and some of its most important contributions are as follows. • Data commit quorums were made configurable. The ad- dition of flexible quorums enabled developers to make the necessary tradeoffs between latency, throughput and fault tolerance [7]. Leader election quorums get au- tomatically computed from the specified data commit quorum to ensure correctness. • Support for dynamic quorums was added wherein both the data commit and leader election quorums get re- configured after every successful election. This option provides low latency commits with enhanced fault tol- erance while restricting quorums to a small group of regionally local servers. Knowledge of the previous CIDR ’23, January 08–11, 2023, Amsterdam, The Netherlands Ritwik Yadav and Anirban Rahut data commit quorums is inferred from voting history. More details in subsection 4.2. • Tail commit latencies became independent of the num- ber of replicas in the cluster. • Automation tools were significantly simplified since the consensus logic was completely incorporated into the MySQL server. Section 2 provides some definitions to establish common terminology across different replication protocols. The pre- exisiting semisynchronous setup and potential solutions for its replacement are discussed in section 3. Section 4 of the paper describes the feature gaps in Raft and stresses on the need for flexibile quorums. It also lists the choices we pro- vide to our end users when selecting configurable quorums followed by the amendments to the algorithm to support this flexibility in section 5. Section 6 discusses the fault tolerance guarantees of FlexiRaft with experimental validation of its performance presented in section 7 and lessons learned from its deployment in section 8. FlexiRaft is compared to other variants of consensus algorithms in section 9 along with a discussion on avenues for further improvement. 2 Common Concepts & Definitions Some of the terms used in the paper are unique to the deploy- ment of MySQL at Meta. This section provides definitions for these commonly used terms. 2.1 MySQL binlog server The MySQL binlog server is a special server which only stores the recent binlogs (write ahead log for MySQL) rather than a full copy of the database. These special servers were developed at Meta to provide regional commits without in- curring the overhead of extra replicas. 2.2 Replica set A MySQL replica set is a collection of all the replicas (includ- ing the primary) and their corresponding binlog servers. 2.3 Group Members of a replica set are grouped together into multiple disjoint sets based on physical proximity. Each disjoint set forms a group and physical proximity can be defined as belonging to the same region, same datacenter or sharing the same main switchboard (MSB) within a datacenter, etc. These groups are useful in defining quorums. 2.4 Data commit quorum The data commit quorum is a minimal set of servers in the replica set (including both MySQL and binlog servers) that must acknowledge a transaction before it can be committed. 2.5 Leader election quorum The leader election quorum is a minimal set of servers in the replica set (including both MySQL and binlog servers) that must accept a candidate server as a leader for it to safely assert leadership over the entire replica set. 2.6 Pessimistic quorum A pessimistic leader election quorum consist