Parcourir la source

SIMD-0046: Optimistic cluster restart automation

* Add repair and restart proposal.

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Add protocol overview and lint changes.

* Change threshold value from 47% to 34%.

* Add introduction, and update default slots to send.

* Remove snapshot generation from the new restart protocol and lint changes.

* Change must have block threshold.

* Update the proposal to reflect changes in discussion.

* Add the wait before restart.

* Change Heaviest selection algorithm.

* Make linter happy.

* Shorten title to make linter happy.

* Add details of messages and change command line.

* Fix typos on numbers.

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Make linter happy.

* All messages need to keep flowing before restart.

* A snapshot should be generated first in a restart.

* Use Gossip instead of direct messaging in restart.

* Require 80% of the people receive 80% of Heaviest.

* Add security check and some other changes.

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Add some terminologies.

* Rewording a few paragraphs to make things clear.

* Fix a few small sentences.

* Remove .bak file.

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Update proposals/0024-repair-and-restart.md

Co-authored-by: mvines <mvines@gmail.com>

* Fix a few wordings.

* This proposal is actually proposal 46.

* Make linter happy.

* Fixes.

* Add description of when to enter next step.

* Make linter happy.

* Make linter happy.

* Update proposals/0046-optimistic-cluster-restart-automation.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Try indent some paragraphs.

* Backtick all new terminologies.

* Make linter happy.

* Update proposals/0046-optimistic-cluster-restart-automation.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Remove unnecessary paragraph.

* Update proposals/0046-optimistic-cluster-restart-automation.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Update proposals/0046-optimistic-cluster-restart-automation.md

Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>

* Change percent from u8 to u16.

* Make linter happy.

* Remove command line reference.

* Revise the threshold for block repair.

* Make linter happy again.

* Remove 80% reference when we mean RESTART_STAKE_THRESHOLD.

* Rename HeaviestFork to RestartHeaviestFork.

* Rename LastVotedForkSlots to RestartLastVotedForkSlots.

* Change format of examples.

* Change format of the bullet list.

* Change reasoning of 81000 slots.

* Replace silent repair with new name "wen restart".

* Try to make linter happy.

* Make linter happy again.

* Back to the title linter likes.

* Add cluster restart slot to the doc.

* Small fixes.

* Add handling for oscillating info.

* Make linter happy.

* Add epoch boundary handling.

* Add cluster wide threshold calculation across Epoch boundary.

* Update cross epoch stake selection.

* Correct mistake in description.

* Make it clear we are generating incremental snapshot.

* Fix typo

* Add more reasoning about how HeaviestFork is picked.

* Make linter happy.

* Change indent.

* Make linter happy.

* Rework the proof.

* Update proposals/0046-optimistic-cluster-restart-automation.md

Co-authored-by: mvines <mvines@gmail.com>

* Explain 81000 slots and issue hard fork before snapshot generation.

* Use a hard limit for must-have blocks and accept new
RestartLastVotedForkSlots.

* Reverse the order of bits to be consistent with EpochSlots.

* Update restart descriptions.

* Update 81k to 64k.

* Update the find heaviest algorithm and proof.

* Update the proof for heaviest fork, we don't need to check stakes.

* Update notations in proof.

* Explain the 42% constant.

* Explain 5% as well.

* Small fixes.

* Update stake calculation when crossing Epoch boundaries.

* Update exit criteria when crossing Epoch boundary.

* Add RestartHeaviestFork round 2.

* Make linter happy.

* Use round 0 and round 1 instead of round 1 and 2.

* Replace previous HeaviestFork stage with a leader based design.

* Update the abstract as well.

* Update wording.

* Update company info.

* Update the exit condition of step 2.

* Clarify step 4.

* Fix typo.

* Rename the leader to coordinator. Add the final HeaviestFork aggregation.

* Fix the correctness proof.

* Fix the correctness proof.

* Clarify that we pick the slot first then replay to get hash.

* Change status to Review

* Some small fixes.

* Fix typo.

* Add proof for the 33% limit.

* Make linter happy.

* Make linter happy.

* Change status to implemented, code complete on both Anza and Firedancer.

---------

Co-authored-by: mvines <mvines@gmail.com>
Co-authored-by: Trent Nelson <trent.a.b.nelson@gmail.com>
Wen il y a 10 mois
Parent
commit
7dbb6c3438
1 fichiers modifiés avec 429 ajouts et 0 suppressions
  1. 429 0
      proposals/0046-optimistic-cluster-restart-automation.md

+ 429 - 0
proposals/0046-optimistic-cluster-restart-automation.md

@@ -0,0 +1,429 @@
+---
+simd: '0046'
+title: Optimistic cluster restart automation
+authors:
+  - Wen Xu (Anza)
+category: Standard
+type: Core
+status: Implemented
+created: 2023-04-07
+feature: N/A (gated by command line flag instead)
+development:
+  - Anza - implemented
+  - Firedancer - implemented
+---
+
+## Summary
+
+During a cluster restart following an outage, make validators enter a separate
+recovery protocol that uses Gossip to exchange local status and automatically 
+reach consensus on the block to restart from. Proceed to restart if validators
+in the restart can reach agreement, or print debug information and halt
+otherwise. To distinguish the new restart process from other operations, we
+call the new process "Wen restart".
+
+## New Terminology
+
+* `cluster restart`: When there is an outage such that the whole cluster
+stalls, human may need to restart most of the validators with a sane state so
+that the cluster can continue to function. This is different from sporadic
+single validator restart which does not impact the cluster. See
+[`cluster restart`](https://docs.solana.com/running-validator/restart-cluster)
+for details.
+
+* `cluster restart slot`: In current `cluster restart` scheme, human normally
+decide on one block for all validators to restart from. This is very often the
+highest `optimistically confirmed block`, because `optimistically confirmed
+block` should never be rolled back. But it's also okay to start from a child of
+the highest `optimistically confirmed block` as long as consensus can be
+reached.
+
+* `optimistically confirmed block`: a block which gets the votes from the
+majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to
+guarantee that an optimistically confirmed block will never be rolled back.
+
+* `wen restart phase`: During the proposed optimistic `cluster restart`
+automation process, the validators in restart will first spend some time to
+exchange information, repair missing blocks, and finally reach consensus. The 
+validators only continue normal block production and voting after consensus is
+reached. We call this preparation phase where block production and voting are
+paused the `wen restart phase`.
+
+* `wen restart shred version`: right now we update `shred_version` during a
+`cluster restart`, it is used to verify received shreds and filter Gossip
+peers. In the proposed optimistic `cluster restart` plan, we introduce a new
+temporary shred version in the `wen restart phase` so validators in restart
+don't interfere with those not in restart. Currently this `wen restart shred
+version` is calculated using `(current_shred_version + 1) % 0xffff`.
+
+* `RESTART_STAKE_THRESHOLD`: We need enough validators to participate in a
+restart so they can make decision for the whole cluster. If everything works
+perfect, we only need 2/3 of the total stake. However, validators could die
+or perform abnormally, so we currently set the `RESTART_STAKE_THRESHOLD` at
+80%, which is the same as what we use now for `--wait_for_supermajority`.
+
+## Motivation
+
+Currently during a `cluster restart`, validator operators need to decide the
+highest optimistically confirmed slot, then restart the validators with new
+command-line arguments.
+
+The current process involves a lot of human intervention, if people make a
+mistake in deciding the highest optimistically confirmed slot, it is
+detrimental to the viability of the ecosystem.
+
+We aim to automate the negotiation of highest optimistically confirmed slot and
+the distribution of all blocks on that fork, so that we can lower the 
+possibility of human mistakes in the `cluster restart` process. This also
+reduces the burden on validator operators, because they don't have to stay
+around while the validators automatically try to reach consensus, the validator
+will halt and print debug information if anything goes wrong, and operators can
+set up their own monitoring accordingly.
+
+However, there are many ways an automatic restart can go wrong, mostly due to
+unforseen situations or software bugs. To make things really safe, we apply
+multiple checks during the restart, if any check fails, the automatic restart
+is halted and debugging info printed, waiting for human intervention. Therefore
+we say this is an optimistic cluster restart procedure.
+
+## Alternatives Considered
+
+### Automatically detect outage and perform `cluster restart`
+
+The reaction time of a human in case of emergency is measured in minutes,
+while a `cluster restart` where human initiate validator restarts takes hours.
+We considered various approaches to automatically detect outage and perform
+`cluster restart`, which can reduce recovery speed to minutes or even seconds.
+
+However, automatically restarting the whole cluster seems risky. Because
+if the recovery process itself doesn't work, it might be some time before
+we can get human's attention. And it doesn't solve the cases where new binary
+is needed. So for now we still plan to have human in the loop.
+
+After we gain more experience with the restart approach in this proposal, we
+may slowly try to make the process more automatic to improve reliability.
+
+### Use Gossip and consensus to figure out restart slot before the restart
+
+The main difference between this and the current restart proposal is this 
+alternative tries to make the cluster automatically enter restart preparation 
+phase without human intervention.
+
+While getting humans out of the loop improves recovery speed, there are
+concerns about recovery Gossip messages interfering with normal Gossip 
+messages, and automatically start a new message in Gossip seems risky.
+
+### Automatically reduce block production in an outage
+
+Right now we have vote-only mode, a validator will only pack vote transactions
+into new blocks if the tower distance (last_vote - local_root) is greater than
+400 slots.
+
+Unfortunately in the previous outages vote-only mode isn't enough to save the
+cluster. There are proposals of more aggressive block production reduction to
+save the cluster. For example, a leader could produce only one block in four
+consecutive slots allocated to it.
+
+However, this only solves the problem in specific type of outage, and it seems
+risky to aggressively reduce block production, so we are not proceeding with
+this proposal for now.
+
+## Detailed Design
+
+The new protocol tries to make all restarting validators get the same data
+blocks and the same set of last votes, so that they will with high probability
+converge on the same canonical fork and proceed.
+
+When the cluster is in need of a restart, we assume validators holding at least
+`RESTART_STAKE_THRESHOLD` percentage of stakes will enter the restart mode.
+Then the following steps will happen:
+
+1. The operator restarts the validator into the `wen restart phase` at boot,
+where it will not make new blocks or vote. The validator propagates its local
+voted fork information to all other validators in restart.
+
+2. While aggregating local vote information from all others in restart, the
+validator repairs all blocks which could potentially have been optimistically
+confirmed.
+
+3. After enough validators are in restart and repair is complete, the validator
+counts votes on each fork and computes local heaviest fork.
+
+4. A coordinator which is configured on everyone's command line sends out its
+heaviest fork to everyone.
+
+5. Each validator verifies that the coordinator's choice is reasonable:
+
+   1. If yes, proceed and restart
+
+   2. If no, print out what it thinks is wrong, halt and wait for human
+
+See each step explained in details below.
+
+We assume that as most 5% of the validators in restart can be malicious or
+contains bugs, this number is consistent with other algorithms in the consensus
+protocol. We call these `non-conforming` validators.
+
+### Wen restart phase
+
+1. **Gossip last vote and ancestors on that fork**
+
+   The main goal of this step is to propagate most recent ancestors on the last
+   voted fork to all others in restart.
+
+   We use a new Gossip message `RestartLastVotedForkSlots`, its fields are:
+
+   * `last_voted_slot`: `u64` the slot last voted, this also serves as
+   last_slot for the bit vector.
+   * `last_voted_hash`: `Hash` the bank hash of the slot last voted slot.
+   * `ancestors`: `Run-length encoding` compressed bit vector representing the
+   slots on sender's last voted fork. the least significant bit is always
+   `last_voted_slot`, most significant bit is `last_voted_slot-65535`.
+
+   The max distance between oldest ancestor slot and last voted slot is hard
+   coded at 65535, because that's 400ms * 65535 = 7.3 hours, we assume that
+   most validator administrators would have noticed an outage within 7 hours,
+   and the optimistic confirmation must have halted within 64k slots of the
+   last confirmed block. Also 65535 bits nicely fits into u16, which makes
+   encoding more compact. If a validator restarts after 7 hours past the
+   outage, it cannot join the restart this way. If enough validators failed to
+   restart within 7 hours, then we fallback to the manual, interactive
+   `cluster restart` method.
+
+   When a validator enters restart, it uses `wen restart shred version` to
+   avoid interfering with those outside the restart. To be extra cautious, we
+   will also filter out `RestartLastVotedForkSlots` and `RestartHeaviestFork`
+   (described later) in Gossip if a validator is not in `wen restart phase`.
+   There is a slight chance that the `wen restart shred version` would collide
+   with the shred version after the `wen restart phase`, but with the filtering
+   described above it should not be a problem.
+
+   When a validator receives `RestartLastVotedForkSlots` from someone else, it
+   will discard all slots smaller than the local root. Because the local root
+   should be an `optimistic confirmed` slot, it does not need to keep any slot
+   older than local root.
+
+2. **Repair ledgers up to the restart slot**
+
+   The main goal of this step is to repair all blocks which could potentially
+   be optimistically confirmed.
+
+   We need to prevent false negative at all costs, because we can't rollback an
+   `optimistically confirmed block`. However, false positive is okay. Because
+   when we select the heaviest fork in the next step, we should see all the
+   potential candidates for optimistically confirmed slots, there we can count
+   the votes and remove some false positive cases.
+
+   However, it's also overkill to repair every block presented by others. When
+   `RestartLastVotedForkSlots` messages are being received and aggregated, a
+   validator can categorize blocks missing locally into 2 categories: must-have
+   and ignored.
+
+   We repairs all blocks with no less than 42% stake. The number is
+   `67% - 5% - stake_on_validators_not_in_restart`. We require that at least 80%
+   join the restart, any block with less than 67% - (100 - 80)% - 5% = 42% can
+   never be optimistically confirmed before the restart.
+   
+   It's possible that different validators see different 80%, so their
+   must-have blocks might be different, but there will be another repair round
+   in the final step so this is fine. Whenever some block gets to 42%, repair
+   could be started, because when more validators join the restart, this number
+   will only go up but will never go down.
+
+   When a validator gets `RestartLastVotedForkSlots` from 80% of the stake, and
+   all those "must-have" blocks are repaired, it can proceed to next step.
+
+3. **Calculate heaviest fork**
+
+   After receiving `RestartLastVotedForkSlots` from the validators holding
+   stake more than `RESTART_STAKE_THRESHOLD` and repairing slots in "must-have"
+   category, pick the heaviest fork like this:
+
+   1. Calculate the threshold for a block to be on the heaviest fork, the
+   heaviest fork should have all blocks with possibility to be optimistically
+   confirmed. The number is `67% - 5% - stake_on_validators_not_in_restart`.
+
+   For example, if 80% validators are in restart, the number would be
+   `67% - 5% - (100-80)% = 42%`. If 90% validators are in restart, the number
+   would be `67% - 5% - (100-90)% = 52%`.
+
+   2. Sort all blocks over the threshold by slot number, and verify that they
+   form a single chain. The first block in the list should be the local root.
+
+   If any block does not satisfy above constraint, print the first offending
+   block and exit.
+
+   The list should not be empty, it should contain at least the local root.
+
+   To see why the above algorithm is safe, we will prove that:
+
+   1. Any block optimistically confirmed before the restart will always be
+   on the list:
+
+   Assume block A is one such block, it would have `67%` stake, discounting
+   `5%` non-conforming and people not participating in wen_restart, it should
+   have at least `67% - 5% - stake_on_validators_not_in_restart` stake, so it
+   should pass the threshold and be in the list.
+
+   2. Any block in the list should only have at most one child in the list:
+
+   Let's use `X` to denote `stake_on_validators_not_in_restart` for brevity.
+   Assuming a block has child `A` and `B` both on the list, the children's
+   combined stake would be `2 * (67% - 5% - X)`. Because we only allow one
+   RestartHeaviestFork per pubkey, every validator should select either `A`
+   or `B`, it's easy to find and filter out vialators who selected both. So the
+   children's total stake should be less than `100% - X`. We can calculate that
+   if `124% - 2 * X < 100% - X`, then `X > 24%`, this is not possible when we
+   have at least 80% of the validators in restart. So we prove any block in the
+   list can have at most one child in the list by contradiction.
+
+   3. If a block not optimistically confirmed before the restart is on the
+   list, it can only be at the end of the list and none of its siblings are
+   on the list.
+
+   Let's say block D is the first not optimistically confirmed block on the
+   list, its parent E is confirmed and on the list. We know from above point
+   that E can only have 1 child on the list, therefore D must be at the end
+   of the list while its siblings are not on the list.
+
+   Even if the last block D on the list may not be optimistically confirmed,
+   it already has at least `42% - 5% = 37%` stake. Say F is its sibling with
+   the most stake, F can only have less than `42%` stake because it's not on
+   the list. So picking D over F is equal to the case where `5%` stake
+   switched from fork F to fork D, 80% of the cluster can switch to fork D
+   if that turns out to be the heaviest fork.
+
+   After picking the appropriate slot, replay the block and all its ancestors
+   to get the bankhash for the picked slot.
+
+4. **Verify the heaviest fork of the coordinator**
+
+   There will be one coordinator specified on the command line of everyone's
+   command line. Even though everyone will calculate its own heaviest fork in
+   previous step, only the coordinator's heaviest fork will be checked and
+   optionally accepted by others.
+
+   We use a new Gossip message `RestartHeaviestFork`, its fields are:
+
+   * `slot`: `u64` slot of the picked block.
+   * `hash`: `Hash` bank hash of the picked block.
+
+   After deciding the heaviest block, the coordinator Gossip
+   `RestartHeaviestFork(X.slot, X.hash)` out, where X is the block the
+   coordinator picked locally in previous step. The coordinator will stay up
+   until manually restarted by its operator.
+
+   For every non-coordinator validator, it will perform the following actions
+   on the heaviest fork sent by the coordinator:
+
+   1. If the bank selected is missing locally, repair this slot and all slots
+   with higher stake.
+
+   2. Check that the bankhash of selected slot matches the data locally.
+
+   3. Verify that the selected fork contains local root, and that its local
+   heaviest fork slot is on the same fork as the coordinator's choice.
+
+   If any of the above repair or check fails, exit with error message, the
+   coordinator may have made a mistake and this needs manual intervention.
+
+   When exiting this step, no matter what a non-coordinator validator chooses,
+   it will send a `RestartHeaviestFork` back to leader to report its status.
+   This reporting is just for ease of aggregating the cluster's status at the
+   coordinator, it doesn't have other effects.
+
+5. **Generate incremental snapshot and exit**
+
+If the previous step succeeds, the validator immediately starts adding a hard
+fork at the designated slot and perform `set_root`. Then it will start
+generating an incremental snapshot at the agreed upon `cluster restart slot`.
+After snapshot generation completes, the `--wait_for_supermajority` args with
+correct shred version, restart slot, and expected bankhash will be printed to
+the logs.
+
+After the snapshot generation is complete, a non coordinator then exits with
+exit code `200` to indicate work is complete.
+
+A coordinator will stay up until restarted by the operator to make sure any
+late comers get the `RestartHeaviestFork` message. It also aggregates the
+`RestartHeaviestFork` messages sent by the non-coordinators to report on the
+status of the cluster.
+
+## Impact
+
+This proposal adds a new `wen restart` mode to validators, under this mode the
+validators will not participate in normal cluster activities. Compared to
+today's `cluster restart`, the new mode may mean more network bandwidth and
+memory on the restarting validators, but it guarantees the safety of
+optimistically confirmed user transactions, and validator operators don't need
+to manually generate and download snapshots during a `cluster restart`. 
+
+## Security Considerations
+
+The two added Gossip messages `RestartLastVotedForkSlots` and
+`RestartHeaviestFork` will only be sent and processed when the validator is
+restarted in `wen restart` mode. So random validator restarting in the new
+mode will not clutter the Gossip CRDS table of a normal system.
+
+Non-conforming validators could send out wrong `RestartLastVotedForkSlots`
+messages to mess with `cluster restart`s, these should be included in the
+Slashing rules in the future.
+
+### Handling oscillating votes
+
+Non-conforming validators could change their last votes back and forth, this
+could lead to instability in the system. We forbid any change of slot or hash
+in `RestartLastVotedForkSlots` or `RestartHeaviestFork`, everyone will stick
+with the first value received, and discrepancies will be recorded in the proto
+file for later slashing.
+
+### Handling multiple epochs
+
+Even though it's not very common that an outage happens across an epoch
+boundary, we do need to prepare for this rare case. Because the main purpose
+of `wen restart` is to make everyone reach aggrement, the following choices
+are made:
+
+* Every validator only handles 2 epochs, any validator will discard slots
+which belong to an epoch which is > 1 epoch away from its root. If a validator
+has very old root so it can't proceed, it will exit and report error. Since
+we assume an outage will be discovered within 7 hours and one epoch is roughly
+two days, handling 2 epochs should be enough.
+
+* The stake weight of each slot is calculated using the epoch the slot is in.
+Because right now epoch stakes are calculated 1 epoch ahead of time, and we
+only handle 2 epochs, the local root bank should have the epoch stakes for all
+epochs we need.
+
+* When aggregating `RestartLastVotedForkSlots`, for any epoch with validators
+voting for any slot in this epoch having at least 33% stake, calculate the
+stake of active validators in this epoch. Only exit this stage if all epochs
+reaching the above bar has > 80% stake. This is a bit restrictive, but it
+guarantees that whichever slot we select for HeaviestFork, we have enough
+validators in the restart. Note that the epoch containing local root should
+always be considered, because root should have > 33% stake.
+
+Now we prove this is safe, whenever there is a slot being optimistically
+confirmed in the new epoch, we will only exit the aggregating of
+`RestartLastVotedForkSlots` stage if > 80% in the new epoch joined:
+
+1. Assume slot `X` is optimistically confirmed in the new epoch, it has >67%
+stake in the new epoch.
+
+2. Our stake warmup/cooldown limit is at 9% currently, so at least
+67% - 9% = 58% of the stake were from the old epoch.
+
+3. We always have >80% stake of the old epoch, so at least
+58% - 20% = 38% of the stake were in restart. Excluding non-conforming
+stake, at least 38% - 5% = 33% should be in the restart and they
+should at least report they voted for `X` which is in the new epoch.
+
+4. According to the above rule we will require >80% stake in the new
+epoch as well.
+
+## Backwards Compatibility
+
+This change is backward compatible with previous versions, because validators
+only enter the new mode during new restart mode which is controlled by a
+command line argument. All current restart arguments like
+`--wait-for-supermajority` and `--expected-bank-hash` will be kept as is.