The 4 Classes of Faults on Mainnet
Why Tendermint chains can halt and when it’s considered normal
Currently, the market of existing blockchains primarily favors liveness, e.g. high availability chains, but when applied to Proof-of-Stake (PoS), research has shown that this “longest-chain PoS” model isn’t safe. Thus, it’s important to know that Tendermint-based chains (i.e. BFT-based chains) explore a different tradeoff, favoring safety above liveness. As such, this means that BFT-based chains are expected to halt if network conditions become overly asynchronous, if there is a network partition, or if a sufficient number of validators go offline. Furthermore, the Cosmos Hub state machine has been designed, at least in these early stages, to halt preemptively if it detects a possible error. This design decision helps ensure that the PoS chain maintains safety. If you’ve been following along with the Cosmos testnet program, you’ve seen this kind of thing happen many times.
Of course, a network halt doesn’t necessarily mean that a crisis is happening. We outline the four classes of faults that could occur on mainnet, some common scenarios where the network could halt, along with general mitigation procedures for each failure mode.
Blockchain Failure Modes
- Liveness fault (network halts)
- Safety fault (blockchain forks)
- Censorship fault (data withholding)
- Hard fork failure (invalid state transitions)
Liveness Fault (Network Halts)
+⅓ of the bonded stake goes offline at once
What: When+⅓ of the voting power drops offline for whatever reason, a Tendermint chain will stop making progress. In order to start making progress again, the network will need to wait for the +⅓ of voting power to come back online. If validators don’t come back online for whatever reason, they may need to be forked out via manual intervention from the community.
Example where +⅓ of the bonded stake went offline in past testnets:
- Tendermint Core (TMC) Issue #3089: Tangential issue caused by +⅓ going offline
- Liveness failure when validator had +⅔ voting power: Game of Stakes
Invariant checker triggered
What: There are four invariants that are checked at runtime for the Cosmos Hub daemon gaiad
: NonnegativeBalanceInvariant
, NonnegativeOutstandingInvariant
, SupplyInvariants
, and NonnegativePowerInvariant
. If any one of these invariants are caught by these sanity checks, it will cause an auto-shutdown on all nodes running gaiad
. In this scenario, the bugs that caused the crash will need to be fixed. Then the state of the old chain can be dumped and a new chain restarted from it.
Example where invariant checker was triggered in past testnets:
- Cosmos SDK Issue #1197: panicked after the wrong key was computed and subsequently left a record that should’ve gotten removed
- Cosmos SDK Issue #3019: Gaia-9002 crash log
- Gaia-7003 Postmortem
Liveness failure from bug
What: There could be a bug in Tendermint Core that, for instance, causes certain messages not to be sent or received, thus leaving the chain stuck in an abnormal consensus state where it cannot make progress even if it should be able to (ie +2/3 of the network is still online). This kind of thing has happened a few times on testnets and have been addressed. If this happens again, it will require a fix to be released and for nodes to update their software. Tendermint bugs won’t necessarily require a dump-state-and-import restart of the blockchain, though in rare cases they may.
Example of Tendermint halt in past testnets:
- TMC Issue #3302
- TMC Issue #3199
- TMC Issue #3003
- Tendermint
v0.28.1
that fixed a consensus halt from proposing blocks with too much evidence
Safety Faults (Blockchain Forks)
There are three scenarios in which the Tendermint chain could fork. Tendermint chains aren’t designed to fork, so if there are forks, then we have a safety fault and extra-protocol manual intervention must be taken.
+⅓ of the bonded stake equivocates
What: In the event that more than ⅓ of the voting power becomes Byzantine (malicious) and they decide to equivocate (double spend), we would need to kick into manual recovery mode. Because this fault is actually attributable, validators should broadcast all the votes that they’ve seen. In order to share all the votes that they’ve seen, they need to retain the consensus WAL log file. When the blockchain forks, the consensus WAL file may be required to join the new reorganized chain (re-org), and validators who can’t provide their WAL files may be slashed partially. [Note however, that if ≤ ⅔ of bonded stake is Byzantine, they cannot get arbitrary changes to state committed.] Once an analysis is complete, provided there was evidence of which keys double signed, the rogue validators would get forked out, then the network would restart.
+⅔ of the bonded stake equivocates
What: This is the worst possible situation to be in. If we see this happen on mainnet, then the chain should be considered completely corrupted. To illustrate, let’s say that there are four “generals” (validators). If 3 out of the 4 ( +⅔ voting power) are colluding, then they control the source of truth, as they can lie to the last honest general and give him whatever invalid information that they want.
Similarly, like the “+⅓ of the bonded stake equivocates” scenario above, if 2 out of the 4 (+⅓ voting power) generals are colluding, then they could achieve a similar outcome by giving the other 2 generals conflicting information, causing the chain to fork into two chains (two variations of truth).
However, unlike the “+⅓ of the bonded stake equivocates”, with +2/3, this fault is non-attributable in that we’re facing arbitrary state changes, e.g. changing truth. The path to recovery could follow much the same protocol as the “+⅓ of the bonded stake equivocates” to slash those who are responsible for equivocation via a chain re-org. Whether recovery is possible or not would depend largely on “subjectivity” on the extra-protocol social layer, which is extremely suboptimal and should be avoided.
Safety failure from bug
What: There could be a bug in Tendermint Core that causes two blocks to be committed at the same block height, or there could be a bug in Cosmos SDK that causes different state roots to be computed for the same block. In this scenario, the bug(s) must be patched and the chain rolled back to a previous block height.
Censorship Fault
What: When a single validator or a cartel of validators accumulate more than ⅓ of the voting power, we might start to see censorship attacks on the chain. Unlike equivocation attacks, which are attributable, here Byzantine validators could collude to prevent certain proposals and/or precommits from getting committed. Manual intervention may be needed to mitigate further harm to the chain.
Hard Fork Failures (Invalid State Transitions)
Hard fork upgrades
What: Somewhere between Phases I and III of mainnet, hard forks are expected to be introduced and coordinated through governance as the codebase matures. Such hard forks may include breaking changes to the state machine and Tendermint to provide technical upgrades.
Hard fork upgrades are technically considered invalid state transitions, but due to the “benign” nature of these upgrades, they are considered “consensual” hard forks.
The Takeaway
If any of these scenarios manifest on the Cosmos Hub mainnet, updates and actionable recourse will be communicated on our official communications channels:
- Cosmos Network (twitter.com/cosmos)
- Cosmos GitHub (github.com/cosmos)
- Cosmos Blog (blog.cosmos.network)
Please be aware that the Cosmos forum, Riot chat groups, and Telegram group should not be treated as official news from Cosmos.
That said, responsibility for the network will ultimately rest with the larger community of operators, developers, and users. Please exercise extreme diligence and caution!
Further safety precautions are outlined in detail in this blog post.