Institutional Only|Drovix does not deal with individual investors and does not accept deposits or provide retail trading services.

Drovix.
Failover Engineering: What 30 Microbursts Taught Us About Active-Active
Market Analysis

Failover Engineering: What 30 Microbursts Taught Us About Active-Active

Marketing claims about 99.9% uptime describe the steady state. The interesting engineering decision is what happens to in-flight FIX traffic in the microseconds while a primary fails over.

HomeResourcesInsights & News
22 May 2026Drovix Research Desk8 min

A trading engine that has never failed is a trading engine that has never been tested. The interesting engineering decision is not whether the engine will eventually drop a session, lose market data, or restart a process — it will. The decision is what happens to the desk's flow in the seconds and microseconds during which the failure is detected and the alternative path is brought up.

This piece is a partial post-mortem of the failover behaviour Drovix observed across roughly thirty microbursts — events lasting between 200 microseconds and 90 seconds — in the active-active engine in the last two years. The microbursts were not catastrophic. They were exactly the kind of routine, microsecond-budget failure that a high-availability system has to handle without leaking risk to the counterparty. They tested whether the engineering held.

The point of the piece is the engineering. Marketing claims about "99.9% uptime" are uninteresting precisely because they describe the steady state. The interesting question is what happens at the boundary.

Fiber-optic patch cables in deep navy darkness — engineered redundancy
Fiber-optic patch cables in deep navy darkness — engineered redundancy

Active-active versus active-passive, in microseconds

An active-passive engine maintains one live instance and one warm standby. The standby is replicating state continuously but does not process live flow; on a primary failure, the standby is promoted, and live flow resumes after the promotion completes. The promotion latency — from the moment the primary failure is detected to the moment the standby is processing live flow — is the engine's user-visible recovery time. On a well-engineered active-passive system, this is typically in the hundreds of milliseconds. On a poorly-engineered one, it is in the seconds.

An active-active engine maintains two or more live instances, each processing a non-overlapping share of the flow, with deterministic shared state replicated synchronously. On a primary failure, the surviving instance assumes the failed instance's share without any "promotion" event — the state is already current and the routing changes silently in microseconds. There is no user-visible recovery; the failure manifests as a brief uptick in the surviving instance's load.

Active-active is harder to build. The synchronous state replication has to be lossless within the engine's deterministic-state guarantee, which means a transaction-log protocol with sub-microsecond confirmation latency, which means the cross-instance link has to be co-located alongside the instances themselves. It is also more expensive operationally — twice the compute, twice the network, twice the monitoring surface.

It is, however, the only architecture that survives a microsecond-budget failure without leaking risk. An active-passive engine that takes 300 ms to promote leaves 300 ms of in-flight FIX traffic in an indeterminate state: some messages were acknowledged by the primary before failure but not yet replicated to the standby. The desk's risk during those 300 ms is genuinely undefined.

Microburst 1: a 1.2-second LP feed gap, August 2024

A bank LP's market-data feed dropped for 1.2 seconds during a normal European afternoon. The cause, as later disclosed, was a routine maintenance event at the LP's NY4 colocation. The LP's feed went from streaming to silent, then resumed with a single-tick gap and continued normally.

On the engine, the gap was detected at the input-stage health check within 4 microseconds of the missing expected message — the input stage runs a tight tick-arrival deadline per LP. The LP was de-weighted in the aggregation immediately. Flow that would have routed to the LP was redirected to the remaining seven connected LPs on the same currency pair. The aggregated top-of-book did not change visibly; the depth at deeper levels widened by approximately 8% over the 1.2-second gap.

When the LP feed resumed, the input-stage health check confirmed that the resumption was consistent with the prevailing mid (it was, within 0.3 bp), and the LP was restored to its full routing weight within one full update cycle — about 8 microseconds. No counterparty saw a visible event. No FIX session was disrupted. The aggregated TCA for the affected counterparties in the 1.2-second window showed slightly wider effective spread, fully attributable to the LP de-weighting, and within the historical envelope for similar-volatility periods.

Microburst 2: a 90-second matching-engine GC pause, January 2025

One of the two active instances of the matching engine experienced an unexpected garbage-collection pause of 90 seconds. The cause was a deferred memory-arena merge that had grown larger than expected over a high-flow week. The pause was unannounced and uncontrolled.

On the engine, the surviving instance detected the missing heartbeat at the inter-instance health check after 40 microseconds (one missed heartbeat at the 25-kHz heartbeat rate). The surviving instance assumed the paused instance's share of the flow within a further 8 microseconds — total recovery time on the order of 50 microseconds. The paused instance's in-flight FIX traffic was failed over to the survivor's session, with the shared state log ensuring no acknowledged message was lost.

The paused instance recovered at the 90-second mark, validated its state log against the survivor's, and rejoined the cluster within a further 15 milliseconds. The counterparty saw no service interruption. The instance's GC pause was identified, root-caused, and the relevant memory-arena merge policy was changed to bound the worst-case pause at 250 microseconds. The change shipped to production three days later.

Two teal fiber strands converging — active-active redundancy
Two teal fiber strands converging — active-active redundancy

Microburst 3: a cross-region network partition, June 2025

The link between the LD4 and FR2 cluster nodes experienced a 40-second packet-loss event, peaking at 18% loss, due to a transit-provider issue downstream of both colocations. The cause was outside Drovix's network.

On the engine, the cross-region replication detected the elevated loss within one replication cycle (~200 microseconds) and switched the cross-region link from synchronous to deferred-asynchronous mode. The deferred mode is conservative: each region continues to accept live flow against its locally-replicated state, with a documented and counterparty-visible widening of effective best-execution boundary on cross-region flow during the partition.

Counterparties whose flow was confined to a single region saw no degradation. Counterparties whose flow naturally crossed regions saw effective spread widen by approximately 0.4 pips on the EUR/USD majors during the partition, fully attributable to the deferred-asynchronous mode and disclosed in the day's TCA report. When the partition resolved, the cross-region link returned to synchronous mode within one cycle of cleared traffic, and the day's full reconciliation confirmed zero lost transactions.

What the microbursts taught us

Three things. First: detection is the bottleneck. A failure that the engine sees in microseconds is a failure that can be handled without leaking risk. A failure that the engine sees in milliseconds is a failure that has already affected counterparties before the response runs. Every health check on the engine runs in the input-stage's microsecond budget; this is not negotiable.

Second: state is the constraint. An engine that cannot replicate state losslessly at sub-microsecond latency cannot do active-active. The cost of building such a state-replication layer is the engineering cost of running active-active at all. Every active-active claim in the institutional market should be tested against this constraint.

Third: disclosure is the discipline. Every counterparty whose flow was touched by any of the three microbursts described above received the relevant detail in the next day's TCA report. The point of microsecond-budget failover is not to make the failure invisible — it is to make the failure auditable, attributable, and small. When something does affect a counterparty, the counterparty hears about it from us before anyone else.

The structural relationship between failover quality and counterparty diligence sits inside counterparty concentration risk; the engineering of synchronous state replication is the same engineering that supports sub-millisecond queue-position decisions.

Analyst Desk

Drovix Research Desk

Institutional Research

Drovix Research Desk publishes institutional-grade analysis covering macro events, cross-asset correlations, and execution insights for professional market participants.

Related Reads

Market Analysis

The Anatomy of an Effective Spread: A 2026 Microstructure Survey

Next Read

Market Analysis

Asymmetric Last Look: Where the Rejection Bias Hides in Plain Sight

Next Read

Market Analysis

The Half-Life of Information in FX Orders

Next Read

Back to Insights
Drovix.

Institutional-grade liquidity, connectivity, and analytics for professional market participants worldwide.

About

  • Why Drovix
  • Regulation
  • Technology
  • Contact
  • Insights
  • Execution Policy

Legal

  • Privacy
  • Terms and Agreements
  • Client Agreement
  • Risk Disclosure

Policies

  • AML/KYC Policy
  • Cookie Policy
  • Order Execution Policy
  • Complaints Handling
  • Reverse Solicitation
  • Site map

Risk Warning: Trading leveraged products such as Forex and CFDs carries a significant level of risk and may not be appropriate for all market participants. The value of derivative instruments can fluctuate rapidly, and losses may exceed initial margin. Institutional and professional clients should ensure they fully understand the risks associated with leveraged products before engaging in any transactions. Past performance is not indicative of future results. A significant proportion of professional client accounts incur losses when trading leveraged products. Prospective clients should ensure they have sufficient expertise and resources to bear the risks of leveraged trading. You should not commit capital that you cannot afford to lose.

Important — No Investor Compensation: Important: Client funds held with Drovix (MU) Ltd are not protected by any government deposit guarantee or investor compensation scheme. The Financial Services Commission (FSC) of Mauritius does not operate an investor compensation fund.

The information on this website is intended for institutional and professional clients only. It does not constitute investment advice, a solicitation, or a recommendation to enter into any transaction. Drovix does not provide services to retail clients.

Reverse Solicitation Notice: The information and services on this website are not directed at or intended for distribution to residents or nationals of any country or jurisdiction where such distribution or use would be contrary to local law or regulation. Institutional clients access Drovix (MU) Ltd services on their own initiative. It is the responsibility of each prospective institutional client to ensure compliance with the laws and regulations of their jurisdiction of incorporation or domicile.

This information is not intended for entities or persons in countries or jurisdictions under significant sanctions, including but not limited to Afghanistan, Barbados, Belarus, Burkina Faso, Cameroon, Central African Republic, Cuba, Democratic Republic of Congo, Haiti, Iran, Libya, Mali, Mozambique, Myanmar, Nicaragua, North Korea, Russia, Senegal, Sudan, Syria, Tanzania, Venezuela, Yemen, and Zimbabwe.

All information, products, and services offered on the Drovix website are not intended for entities or persons in Australia, Belgium, Canada, France, Japan, Malaysia, Poland, Ukraine, the United Kingdom, or the United States. The information on this website does not constitute investment advice or a recommendation or a solicitation to engage in any investment activity. Drovix services are available exclusively to eligible institutional and professional clients.

© 2026 Drovix (MU) Ltd. All rights reserved.

Drovix (MU) Ltd is authorised by the Financial Services Commission (FSC) in Mauritius under Investment Dealer (Full Service Dealer) excluding Underwriting licence No. GB21026813.

The company operates under www.drovix.com and is registered at C/o SALVUS (Mauritius) Ltd, Silver Bank Tower, Ground Floor, 18 Bank Street, Cybercity, Ebene 72201 Mauritius.