How Flipkart Built a Highly Available MySQL Cluster for 150+ Million Users
Published on: October 08, 2025
Tags: #mysql #high-availability #cluster
High-Availability Model
graph TD subgraph "Clients" App1 App2 App3 end subgraph "MySQL Cluster" Primary[("Primary DB")] Replica1[("Replica DB 1")] Replica2[("Replica DB 2")] end App1 -- Writes --> Primary App2 -- Writes --> Primary App3 -- Writes --> Primary Primary -- "Asynchronous Replication" --> Replica1 Primary -- "Asynchronous Replication" --> Replica2 App1 -- Reads --> Replica1 App2 -- Reads --> Primary App3 -- Reads --> Replica2
Failure Detection Architecture
graph TD subgraph "Database Nodes" Agent1[Agent] --> DB1("MySQL Primary") Agent2[Agent] --> DB2("MySQL Replica") end subgraph "Monitoring System" Monitor1[Monitor] Monitor2[Monitor] Orchestrator Monitor1 <--> Monitor2 end subgraph "ZooKeeper" ZK end Agent1 -- "Health Report (10s)" --> Monitor1 Agent2 -- "Health Report (10s)" --> Monitor2 Monitor1 -- "Writes Status (10s)" --> ZK Monitor2 -- "Writes Status (10s)" --> ZK Monitor1 -- "Suspects Failure" --> Orchestrator Monitor2 -- "Suspects Failure" --> Orchestrator Orchestrator -- "Verifies Failure & Initiates F..." --> MySQL_Cluster(("MySQL Cluster"))
Failover Workflow
sequenceDiagram participant Agent participant Monitor participant Orchestrator participant OldPrimary as Old Primary participant Replica participant DNS Agent->>Monitor: Health Report Monitor->>Orchestrator: Suspect Failure Orchestrator->>OldPrimary: Verify Failure Orchestrator->>Replica: Verify Connectivity alt Failure Confirmed Orchestrator->>Orchestrator: Stop Monitoring Orchestrator->>Replica: Allow Relay Log Catch-up Orchestrator->>OldPrimary: Set to Read-Only (if reachable) Orchestrator->>OldPrimary: Stop Old Primary (if possible) Orchestrator->>Replica: Promote to Primary Orchestrator->>DNS: Update DNS Record end
Split-Brain Scenario
graph TD subgraph "Clients" Client1 Client2 end subgraph "Data Center 1" Primary end subgraph "Data Center 2" Replica Orchestrator end subgraph "Network Partition" P((X)) end %% Solid Links Client1 --> Primary Client2 --> Primary Primary -- "Replication" --> Replica Orchestrator -- "Monitors" --> Primary Orchestrator -- "Monitors" --> Replica %% Dotted Red Links Client2 <-.-> Primary Replica <-.-> P P <-.-> Orchestrator %% Styling for Dotted Links linkStyle 5 stroke:red,stroke-width:4px,stroke-dasharray: 5 5 linkStyle 6 stroke:red,stroke-width:4px,stroke-dasharray: 5 5 linkStyle 7 stroke:red,stroke-width:4px,stroke-dasharray: 5 5
Split-Brain Prevention
sequenceDiagram participant Orchestrator participant ClientApps as Client Applications participant OldPrimary as Old Primary participant Replica participant DNS Orchestrator->>Orchestrator: Pause Failover Orchestrator->>ClientApps: Notify to Stop Writes ClientApps-->>Orchestrator: Writes Stopped Orchestrator->>OldPrimary: Fence/Stop Orchestrator->>Replica: Promote to New Primary Orchestrator->>DNS: Update DNS Orchestrator->>ClientApps: Notify to Restart ClientApps->>DNS: Connect to New Primary
Sources: