il y a 10 mois · 7dbb6c3438
--- a/proposals/0046-optimistic-cluster-restart-automation.md
+++ b/proposals/0046-optimistic-cluster-restart-automation.md
@@ -0,0 +1,429 @@
 
				+---
			
 
				+simd: '0046'
			
 
				+title: Optimistic cluster restart automation
			
 
				+authors:
			
 
				+  - Wen Xu (Anza)
			
 
				+category: Standard
			
 
				+type: Core
			
 
				+status: Implemented
			
 
				+created: 2023-04-07
			
 
				+feature: N/A (gated by command line flag instead)
			
 
				+development:
			
 
				+  - Anza - implemented
			
 
				+  - Firedancer - implemented
			
 
				+---
			
 
				+
			
 
				+## Summary
			
 
				+
			
 
				+During a cluster restart following an outage, make validators enter a separate
			
 
				+recovery protocol that uses Gossip to exchange local status and automatically 
			
 
				+reach consensus on the block to restart from. Proceed to restart if validators
			
 
				+in the restart can reach agreement, or print debug information and halt
			
 
				+otherwise. To distinguish the new restart process from other operations, we
			
 
				+call the new process "Wen restart".
			
 
				+
			
 
				+## New Terminology
			
 
				+
			
 
				+* `cluster restart`: When there is an outage such that the whole cluster
			
 
				+stalls, human may need to restart most of the validators with a sane state so
			
 
				+that the cluster can continue to function. This is different from sporadic
			
 
				+single validator restart which does not impact the cluster. See
			
 
				+[`cluster restart`](https://docs.solana.com/running-validator/restart-cluster)
			
 
				+for details.
			
 
				+
			
 
				+* `cluster restart slot`: In current `cluster restart` scheme, human normally
			
 
				+decide on one block for all validators to restart from. This is very often the
			
 
				+highest `optimistically confirmed block`, because `optimistically confirmed
			
 
				+block` should never be rolled back. But it's also okay to start from a child of
			
 
				+the highest `optimistically confirmed block` as long as consensus can be
			
 
				+reached.
			
 
				+
			
 
				+* `optimistically confirmed block`: a block which gets the votes from the
			
 
				+majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to
			
 
				+guarantee that an optimistically confirmed block will never be rolled back.
			
 
				+
			
 
				+* `wen restart phase`: During the proposed optimistic `cluster restart`
			
 
				+automation process, the validators in restart will first spend some time to
			
 
				+exchange information, repair missing blocks, and finally reach consensus. The 
			
 
				+validators only continue normal block production and voting after consensus is
			
 
				+reached. We call this preparation phase where block production and voting are
			
 
				+paused the `wen restart phase`.
			
 
				+
			
 
				+* `wen restart shred version`: right now we update `shred_version` during a
			
 
				+`cluster restart`, it is used to verify received shreds and filter Gossip
			
 
				+peers. In the proposed optimistic `cluster restart` plan, we introduce a new
			
 
				+temporary shred version in the `wen restart phase` so validators in restart
			
 
				+don't interfere with those not in restart. Currently this `wen restart shred
			
 
				+version` is calculated using `(current_shred_version + 1) % 0xffff`.
			
 
				+
			
 
				+* `RESTART_STAKE_THRESHOLD`: We need enough validators to participate in a
			
 
				+restart so they can make decision for the whole cluster. If everything works
			
 
				+perfect, we only need 2/3 of the total stake. However, validators could die
			
 
				+or perform abnormally, so we currently set the `RESTART_STAKE_THRESHOLD` at
			
 
				+80%, which is the same as what we use now for `--wait_for_supermajority`.
			
 
				+
			
 
				+## Motivation
			
 
				+
			
 
				+Currently during a `cluster restart`, validator operators need to decide the
			
 
				+highest optimistically confirmed slot, then restart the validators with new
			
 
				+command-line arguments.
			
 
				+
			
 
				+The current process involves a lot of human intervention, if people make a
			
 
				+mistake in deciding the highest optimistically confirmed slot, it is
			
 
				+detrimental to the viability of the ecosystem.
			
 
				+
			
 
				+We aim to automate the negotiation of highest optimistically confirmed slot and
			
 
				+the distribution of all blocks on that fork, so that we can lower the 
			
 
				+possibility of human mistakes in the `cluster restart` process. This also
			
 
				+reduces the burden on validator operators, because they don't have to stay
			
 
				+around while the validators automatically try to reach consensus, the validator
			
 
				+will halt and print debug information if anything goes wrong, and operators can
			
 
				+set up their own monitoring accordingly.
			
 
				+
			
 
				+However, there are many ways an automatic restart can go wrong, mostly due to
			
 
				+unforseen situations or software bugs. To make things really safe, we apply
			
 
				+multiple checks during the restart, if any check fails, the automatic restart
			
 
				+is halted and debugging info printed, waiting for human intervention. Therefore
			
 
				+we say this is an optimistic cluster restart procedure.
			
 
				+
			
 
				+## Alternatives Considered
			
 
				+
			
 
				+### Automatically detect outage and perform `cluster restart`
			
 
				+
			
 
				+The reaction time of a human in case of emergency is measured in minutes,
			
 
				+while a `cluster restart` where human initiate validator restarts takes hours.
			
 
				+We considered various approaches to automatically detect outage and perform
			
 
				+`cluster restart`, which can reduce recovery speed to minutes or even seconds.
			
 
				+
			
 
				+However, automatically restarting the whole cluster seems risky. Because
			
 
				+if the recovery process itself doesn't work, it might be some time before
			
 
				+we can get human's attention. And it doesn't solve the cases where new binary
			
 
				+is needed. So for now we still plan to have human in the loop.
			
 
				+
			
 
				+After we gain more experience with the restart approach in this proposal, we
			
 
				+may slowly try to make the process more automatic to improve reliability.
			
 
				+
			
 
				+### Use Gossip and consensus to figure out restart slot before the restart
			
 
				+
			
 
				+The main difference between this and the current restart proposal is this 
			
 
				+alternative tries to make the cluster automatically enter restart preparation 
			
 
				+phase without human intervention.
			
 
				+
			
 
				+While getting humans out of the loop improves recovery speed, there are
			
 
				+concerns about recovery Gossip messages interfering with normal Gossip 
			
 
				+messages, and automatically start a new message in Gossip seems risky.
			
 
				+
			
 
				+### Automatically reduce block production in an outage
			
 
				+
			
 
				+Right now we have vote-only mode, a validator will only pack vote transactions
			
 
				+into new blocks if the tower distance (last_vote - local_root) is greater than
			
 
				+400 slots.
			
 
				+
			
 
				+Unfortunately in the previous outages vote-only mode isn't enough to save the
			
 
				+cluster. There are proposals of more aggressive block production reduction to
			
 
				+save the cluster. For example, a leader could produce only one block in four
			
 
				+consecutive slots allocated to it.
			
 
				+
			
 
				+However, this only solves the problem in specific type of outage, and it seems
			
 
				+risky to aggressively reduce block production, so we are not proceeding with
			
 
				+this proposal for now.
			
 
				+
			
 
				+## Detailed Design
			
 
				+
			
 
				+The new protocol tries to make all restarting validators get the same data
			
 
				+blocks and the same set of last votes, so that they will with high probability
			
 
				+converge on the same canonical fork and proceed.
			
 
				+
			
 
				+When the cluster is in need of a restart, we assume validators holding at least
			
 
				+`RESTART_STAKE_THRESHOLD` percentage of stakes will enter the restart mode.
			
 
				+Then the following steps will happen:
			
 
				+
			
 
				+1. The operator restarts the validator into the `wen restart phase` at boot,
			
 
				+where it will not make new blocks or vote. The validator propagates its local
			
 
				+voted fork information to all other validators in restart.
			
 
				+
			
 
				+2. While aggregating local vote information from all others in restart, the
			
 
				+validator repairs all blocks which could potentially have been optimistically
			
 
				+confirmed.
			
 
				+
			
 
				+3. After enough validators are in restart and repair is complete, the validator
			
 
				+counts votes on each fork and computes local heaviest fork.
			
 
				+
			
 
				+4. A coordinator which is configured on everyone's command line sends out its
			
 
				+heaviest fork to everyone.
			
 
				+
			
 
				+5. Each validator verifies that the coordinator's choice is reasonable:
			
 
				+
			
 
				+   1. If yes, proceed and restart
			
 
				+
			
 
				+   2. If no, print out what it thinks is wrong, halt and wait for human
			
 
				+
			
 
				+See each step explained in details below.
			
 
				+
			
 
				+We assume that as most 5% of the validators in restart can be malicious or
			
 
				+contains bugs, this number is consistent with other algorithms in the consensus
			
 
				+protocol. We call these `non-conforming` validators.
			
 
				+
			
 
				+### Wen restart phase
			
 
				+
			
 
				+1. **Gossip last vote and ancestors on that fork**
			
 
				+
			
 
				+   The main goal of this step is to propagate most recent ancestors on the last
			
 
				+   voted fork to all others in restart.
			
 
				+
			
 
				+   We use a new Gossip message `RestartLastVotedForkSlots`, its fields are:
			
 
				+
			
 
				+   * `last_voted_slot`: `u64` the slot last voted, this also serves as
			
 
				+   last_slot for the bit vector.
			
 
				+   * `last_voted_hash`: `Hash` the bank hash of the slot last voted slot.
			
 
				+   * `ancestors`: `Run-length encoding` compressed bit vector representing the
			
 
				+   slots on sender's last voted fork. the least significant bit is always
			
 
				+   `last_voted_slot`, most significant bit is `last_voted_slot-65535`.
			
 
				+
			
 
				+   The max distance between oldest ancestor slot and last voted slot is hard
			
 
				+   coded at 65535, because that's 400ms * 65535 = 7.3 hours, we assume that
			
 
				+   most validator administrators would have noticed an outage within 7 hours,
			
 
				+   and the optimistic confirmation must have halted within 64k slots of the
			
 
				+   last confirmed block. Also 65535 bits nicely fits into u16, which makes
			
 
				+   encoding more compact. If a validator restarts after 7 hours past the
			
 
				+   outage, it cannot join the restart this way. If enough validators failed to
			
 
				+   restart within 7 hours, then we fallback to the manual, interactive
			
 
				+   `cluster restart` method.
			
 
				+
			
 
				+   When a validator enters restart, it uses `wen restart shred version` to
			
 
				+   avoid interfering with those outside the restart. To be extra cautious, we
			
 
				+   will also filter out `RestartLastVotedForkSlots` and `RestartHeaviestFork`
			
 
				+   (described later) in Gossip if a validator is not in `wen restart phase`.
			
 
				+   There is a slight chance that the `wen restart shred version` would collide
			
 
				+   with the shred version after the `wen restart phase`, but with the filtering
			
 
				+   described above it should not be a problem.
			
 
				+
			
 
				+   When a validator receives `RestartLastVotedForkSlots` from someone else, it
			
 
				+   will discard all slots smaller than the local root. Because the local root
			
 
				+   should be an `optimistic confirmed` slot, it does not need to keep any slot
			
 
				+   older than local root.
			
 
				+
			
 
				+2. **Repair ledgers up to the restart slot**
			
 
				+
			
 
				+   The main goal of this step is to repair all blocks which could potentially
			
 
				+   be optimistically confirmed.
			
 
				+
			
 
				+   We need to prevent false negative at all costs, because we can't rollback an
			
 
				+   `optimistically confirmed block`. However, false positive is okay. Because
			
 
				+   when we select the heaviest fork in the next step, we should see all the
			
 
				+   potential candidates for optimistically confirmed slots, there we can count
			
 
				+   the votes and remove some false positive cases.
			
 
				+
			
 
				+   However, it's also overkill to repair every block presented by others. When
			
 
				+   `RestartLastVotedForkSlots` messages are being received and aggregated, a
			
 
				+   validator can categorize blocks missing locally into 2 categories: must-have
			
 
				+   and ignored.
			
 
				+
			
 
				+   We repairs all blocks with no less than 42% stake. The number is
			
 
				+   `67% - 5% - stake_on_validators_not_in_restart`. We require that at least 80%
			
 
				+   join the restart, any block with less than 67% - (100 - 80)% - 5% = 42% can
			
 
				+   never be optimistically confirmed before the restart.
			
 
				+   
			
 
				+   It's possible that different validators see different 80%, so their
			
 
				+   must-have blocks might be different, but there will be another repair round
			
 
				+   in the final step so this is fine. Whenever some block gets to 42%, repair
			
 
				+   could be started, because when more validators join the restart, this number
			
 
				+   will only go up but will never go down.
			
 
				+
			
 
				+   When a validator gets `RestartLastVotedForkSlots` from 80% of the stake, and
			
 
				+   all those "must-have" blocks are repaired, it can proceed to next step.
			
 
				+
			
 
				+3. **Calculate heaviest fork**
			
 
				+
			
 
				+   After receiving `RestartLastVotedForkSlots` from the validators holding
			
 
				+   stake more than `RESTART_STAKE_THRESHOLD` and repairing slots in "must-have"
			
 
				+   category, pick the heaviest fork like this:
			
 
				+
			
 
				+   1. Calculate the threshold for a block to be on the heaviest fork, the
			
 
				+   heaviest fork should have all blocks with possibility to be optimistically
			
 
				+   confirmed. The number is `67% - 5% - stake_on_validators_not_in_restart`.
			
 
				+
			
 
				+   For example, if 80% validators are in restart, the number would be
			
 
				+   `67% - 5% - (100-80)% = 42%`. If 90% validators are in restart, the number
			
 
				+   would be `67% - 5% - (100-90)% = 52%`.
			
 
				+
			
 
				+   2. Sort all blocks over the threshold by slot number, and verify that they
			
 
				+   form a single chain. The first block in the list should be the local root.
			
 
				+
			
 
				+   If any block does not satisfy above constraint, print the first offending
			
 
				+   block and exit.
			
 
				+
			
 
				+   The list should not be empty, it should contain at least the local root.
			
 
				+
			
 
				+   To see why the above algorithm is safe, we will prove that:
			
 
				+
			
 
				+   1. Any block optimistically confirmed before the restart will always be
			
 
				+   on the list:
			
 
				+
			
 
				+   Assume block A is one such block, it would have `67%` stake, discounting
			
 
				+   `5%` non-conforming and people not participating in wen_restart, it should
			
 
				+   have at least `67% - 5% - stake_on_validators_not_in_restart` stake, so it
			
 
				+   should pass the threshold and be in the list.
			
 
				+
			
 
				+   2. Any block in the list should only have at most one child in the list:
			
 
				+
			
 
				+   Let's use `X` to denote `stake_on_validators_not_in_restart` for brevity.
			
 
				+   Assuming a block has child `A` and `B` both on the list, the children's
			
 
				+   combined stake would be `2 * (67% - 5% - X)`. Because we only allow one
			
 
				+   RestartHeaviestFork per pubkey, every validator should select either `A`
			
 
				+   or `B`, it's easy to find and filter out vialators who selected both. So the
			
 
				+   children's total stake should be less than `100% - X`. We can calculate that
			
 
				+   if `124% - 2 * X < 100% - X`, then `X > 24%`, this is not possible when we
			
 
				+   have at least 80% of the validators in restart. So we prove any block in the
			
 
				+   list can have at most one child in the list by contradiction.
			
 
				+
			
 
				+   3. If a block not optimistically confirmed before the restart is on the
			
 
				+   list, it can only be at the end of the list and none of its siblings are
			
 
				+   on the list.
			
 
				+
			
 
				+   Let's say block D is the first not optimistically confirmed block on the
			
 
				+   list, its parent E is confirmed and on the list. We know from above point
			
 
				+   that E can only have 1 child on the list, therefore D must be at the end
			
 
				+   of the list while its siblings are not on the list.
			
 
				+
			
 
				+   Even if the last block D on the list may not be optimistically confirmed,
			
 
				+   it already has at least `42% - 5% = 37%` stake. Say F is its sibling with
			
 
				+   the most stake, F can only have less than `42%` stake because it's not on
			
 
				+   the list. So picking D over F is equal to the case where `5%` stake
			
 
				+   switched from fork F to fork D, 80% of the cluster can switch to fork D
			
 
				+   if that turns out to be the heaviest fork.
			
 
				+
			
 
				+   After picking the appropriate slot, replay the block and all its ancestors
			
 
				+   to get the bankhash for the picked slot.
			
 
				+
			
 
				+4. **Verify the heaviest fork of the coordinator**
			
 
				+
			
 
				+   There will be one coordinator specified on the command line of everyone's
			
 
				+   command line. Even though everyone will calculate its own heaviest fork in
			
 
				+   previous step, only the coordinator's heaviest fork will be checked and
			
 
				+   optionally accepted by others.
			
 
				+
			
 
				+   We use a new Gossip message `RestartHeaviestFork`, its fields are:
			
 
				+
			
 
				+   * `slot`: `u64` slot of the picked block.
			
 
				+   * `hash`: `Hash` bank hash of the picked block.
			
 
				+
			
 
				+   After deciding the heaviest block, the coordinator Gossip
			
 
				+   `RestartHeaviestFork(X.slot, X.hash)` out, where X is the block the
			
 
				+   coordinator picked locally in previous step. The coordinator will stay up
			
 
				+   until manually restarted by its operator.
			
 
				+
			
 
				+   For every non-coordinator validator, it will perform the following actions
			
 
				+   on the heaviest fork sent by the coordinator:
			
 
				+
			
 
				+   1. If the bank selected is missing locally, repair this slot and all slots
			
 
				+   with higher stake.
			
 
				+
			
 
				+   2. Check that the bankhash of selected slot matches the data locally.
			
 
				+
			
 
				+   3. Verify that the selected fork contains local root, and that its local
			
 
				+   heaviest fork slot is on the same fork as the coordinator's choice.
			
 
				+
			
 
				+   If any of the above repair or check fails, exit with error message, the
			
 
				+   coordinator may have made a mistake and this needs manual intervention.
			
 
				+
			
 
				+   When exiting this step, no matter what a non-coordinator validator chooses,
			
 
				+   it will send a `RestartHeaviestFork` back to leader to report its status.
			
 
				+   This reporting is just for ease of aggregating the cluster's status at the
			
 
				+   coordinator, it doesn't have other effects.
			
 
				+
			
 
				+5. **Generate incremental snapshot and exit**
			
 
				+
			
 
				+If the previous step succeeds, the validator immediately starts adding a hard
			
 
				+fork at the designated slot and perform `set_root`. Then it will start
			
 
				+generating an incremental snapshot at the agreed upon `cluster restart slot`.
			
 
				+After snapshot generation completes, the `--wait_for_supermajority` args with
			
 
				+correct shred version, restart slot, and expected bankhash will be printed to
			
 
				+the logs.
			
 
				+
			
 
				+After the snapshot generation is complete, a non coordinator then exits with
			
 
				+exit code `200` to indicate work is complete.
			
 
				+
			
 
				+A coordinator will stay up until restarted by the operator to make sure any
			
 
				+late comers get the `RestartHeaviestFork` message. It also aggregates the
			
 
				+`RestartHeaviestFork` messages sent by the non-coordinators to report on the
			
 
				+status of the cluster.
			
 
				+
			
 
				+## Impact
			
 
				+
			
 
				+This proposal adds a new `wen restart` mode to validators, under this mode the
			
 
				+validators will not participate in normal cluster activities. Compared to
			
 
				+today's `cluster restart`, the new mode may mean more network bandwidth and
			
 
				+memory on the restarting validators, but it guarantees the safety of
			
 
				+optimistically confirmed user transactions, and validator operators don't need
			
 
				+to manually generate and download snapshots during a `cluster restart`. 
			
 
				+
			
 
				+## Security Considerations
			
 
				+
			
 
				+The two added Gossip messages `RestartLastVotedForkSlots` and
			
 
				+`RestartHeaviestFork` will only be sent and processed when the validator is
			
 
				+restarted in `wen restart` mode. So random validator restarting in the new
			
 
				+mode will not clutter the Gossip CRDS table of a normal system.
			
 
				+
			
 
				+Non-conforming validators could send out wrong `RestartLastVotedForkSlots`
			
 
				+messages to mess with `cluster restart`s, these should be included in the
			
 
				+Slashing rules in the future.
			
 
				+
			
 
				+### Handling oscillating votes
			
 
				+
			
 
				+Non-conforming validators could change their last votes back and forth, this
			
 
				+could lead to instability in the system. We forbid any change of slot or hash
			
 
				+in `RestartLastVotedForkSlots` or `RestartHeaviestFork`, everyone will stick
			
 
				+with the first value received, and discrepancies will be recorded in the proto
			
 
				+file for later slashing.
			
 
				+
			
 
				+### Handling multiple epochs
			
 
				+
			
 
				+Even though it's not very common that an outage happens across an epoch
			
 
				+boundary, we do need to prepare for this rare case. Because the main purpose
			
 
				+of `wen restart` is to make everyone reach aggrement, the following choices
			
 
				+are made:
			
 
				+
			
 
				+* Every validator only handles 2 epochs, any validator will discard slots
			
 
				+which belong to an epoch which is > 1 epoch away from its root. If a validator
			
 
				+has very old root so it can't proceed, it will exit and report error. Since
			
 
				+we assume an outage will be discovered within 7 hours and one epoch is roughly
			
 
				+two days, handling 2 epochs should be enough.
			
 
				+
			
 
				+* The stake weight of each slot is calculated using the epoch the slot is in.
			
 
				+Because right now epoch stakes are calculated 1 epoch ahead of time, and we
			
 
				+only handle 2 epochs, the local root bank should have the epoch stakes for all
			
 
				+epochs we need.
			
 
				+
			
 
				+* When aggregating `RestartLastVotedForkSlots`, for any epoch with validators
			
 
				+voting for any slot in this epoch having at least 33% stake, calculate the
			
 
				+stake of active validators in this epoch. Only exit this stage if all epochs
			
 
				+reaching the above bar has > 80% stake. This is a bit restrictive, but it
			
 
				+guarantees that whichever slot we select for HeaviestFork, we have enough
			
 
				+validators in the restart. Note that the epoch containing local root should
			
 
				+always be considered, because root should have > 33% stake.
			
 
				+
			
 
				+Now we prove this is safe, whenever there is a slot being optimistically
			
 
				+confirmed in the new epoch, we will only exit the aggregating of
			
 
				+`RestartLastVotedForkSlots` stage if > 80% in the new epoch joined:
			
 
				+
			
 
				+1. Assume slot `X` is optimistically confirmed in the new epoch, it has >67%
			
 
				+stake in the new epoch.
			
 
				+
			
 
				+2. Our stake warmup/cooldown limit is at 9% currently, so at least
			
 
				+67% - 9% = 58% of the stake were from the old epoch.
			
 
				+
			
 
				+3. We always have >80% stake of the old epoch, so at least
			
 
				+58% - 20% = 38% of the stake were in restart. Excluding non-conforming
			
 
				+stake, at least 38% - 5% = 33% should be in the restart and they
			
 
				+should at least report they voted for `X` which is in the new epoch.
			
 
				+
			
 
				+4. According to the above rule we will require >80% stake in the new
			
 
				+epoch as well.
			
 
				+
			
 
				+## Backwards Compatibility
			
 
				+
			
 
				+This change is backward compatible with previous versions, because validators
			
 
				+only enter the new mode during new restart mode which is controlled by a
			
 
				+command line argument. All current restart arguments like
			
 
				+`--wait-for-supermajority` and `--expected-bank-hash` will be kept as is.