futreall 007194391c typo fixes (#4118)		il y a 11 mois
..
remote	a722d09453 wen_restart: Add wen_restart_coordinator argument. (#2975)	il y a 1 an
scripts	e28c4df2fa Address regressions introduced in gce test net setup with image update to latest ubuntu (#3926)	il y a 11 mois
.gitignore	807af8670e Clean up net logs (#6813)	il y a 6 ans
README.md	007194391c typo fixes (#4118)	il y a 11 mois
azure.sh	4e7e5ace9d Add support for Azure instances in testnet creation (#3905)	il y a 6 ans
colo.sh	2636a9c9f1 Add script for managing colo resourse ala gce.sh (#5854)	il y a 6 ans
common.sh	676e3e1926 fix: remove unnecessary --allow-private-addr setup (#30552)	il y a 2 ans
ec2.sh	f89f121d2b Add AWS EC2 support	il y a 7 ans
gce.sh	e28c4df2fa Address regressions introduced in gce test net setup with image update to latest ubuntu (#3926)	il y a 11 mois
init-metrics.sh	d26a5a5c48 Make gce.sh and init_metrics.sh metrics configuration more intuitive (#3961)	il y a 11 mois
net.sh	1c9cf0f439 bump the supported version in net.sh (#3792)	il y a 11 mois
scp.sh	c07d09c011 Add net/scp.sh for easier file transfer to/from network nodes	il y a 7 ans
ssh.sh	eb1acaf927 Remove archiver and storage program (#9992)	il y a 5 ans

Test Network Management

The ./net/ directory in the monorepo contains scripts useful for creation and manipulation of a test network. The test network allows you to run a fully isolated set of validators and clients on a configurable hardware setup. It's intended to be both dev and CD friendly.

Cloud account prerequisites

The test networks to be created can run in GCP, AWS or colo. Whichever cloud provider you choose, you will need the credentials set up on your machine.

GCP

You will need a working gcloud command from google SDK, if you do not have it follow the guide in https://cloud.google.com/sdk?hl=en

Before running any scripts, authenticate with

$ gcloud auth login

If you are running the scripts on a headless machine, you can use curl to issue requests to confirm your auth.

If you are doing it the first time, you might need to set up project

gcloud config set project principal-lane-200702

AWS

Obtain your credentials from the AWS IAM Console and configure the AWS CLI with

$ aws configure

More information on AWS CLI configuration can be found here

Metrics configuration (Optional)

Metrics collection relies on 2 environment variables that are patched to the remote nodes by net.sh:

RUST_LOG to enable metrics reporting in principle
SOLANA_METRICS_CONFIG to tell agave where to log the metrics

Preparation

[!NOTE] Anza employees should follow the guide in notion to set up the influxDB account.

Ensure that ${host} is the host name of the InfluxDB you can access, for example https://internal-metrics.solana.com:8086
Ensure that ${user} is the name of an InfluxDB user account with enough rights to create a new InfluxDB database, for example solana.

To set up the metrics

You will normally only need to do this once. Once this is done, you will be able to save the metrics configuration and load it later from the environment.

Go to ./net/ in agave repo
Run ./init-metrics.sh -c testnet-dev-${user} ${user}
- Script will ask for a password, it is the same one you’ve created when making a user in the InfluxDB UI
- Put the username you have used in preparation, not your login user name
- If you need to set influxDb host, edit the script
The script will configure the database (recreating one if necessary) and append a config line in the very end of net/config/config file like the following:
- export SOLANA_METRICS_CONFIG="host=${host},db=testnet-dev-${user},u=${user},p=some_secret"
- You can store that line somewhere and append it to the config file when you need to reuse the database.
- You can also store it into your shell’s environment so you can run ./init-metrics.sh -e to quickly load it
- Alternatively, you'll need to run ./init-metrics.sh with appropriate arguments every time you set up a new cluster
Assuming no errors, your influxDB setup is now done.
For simple cases, storing SOLANA_METRICS_CONFIG in your env is appropriate, but you may want to use different databases for different runs of net.sh
- You can call ./init-metrics.sh before you call net.sh start, this will change the metrics config for a particular run.
- You can manually write SOLANA_METRICS_CONFIG in the ./net/config/config file
By default, metrics are only logged by agave if RUST_LOG is set to info or higher. You can provide it as environment for ./net.sh start command, or set this in your shell environment.
```
RUST_LOG="info,solana_runtime=debug"
```

To validate that your database and environment variables are set up 100% correctly

Note: this only works if you store SOLANA_METRICS_CONFIG in your shell environment

  cd ./scripts/
  source  ./configure-metrics.sh
    INFLUX_HOST=https://internal-metrics.solana.com:8086
    INFLUX_DATABASE=testnet-dev-solana
    INFLUX_USERNAME=solana
    INFLUX_PASSWORD=********
  ./metrics-write-datapoint.sh "testnet-deploy net-create-begin=1"

All commands should complete with no errors, this indicates your influxDB config is usable
Ensure that RUST_LOG is set to info or debug

Quick Start

NOTE: This example uses GCE. If you are using AWS EC2, replace ./gce.sh with ./ec2.sh in the commands.

# In Agave repo
cd net/

# Create a GCE testnet with 4 additional validator nodes (beyond the bootstrap node) and 1 client (billing starts here)
./gce.sh create -n 4 -c 1

# Configure the metrics database and validate credentials using environment variable `SOLANA_METRICS_CONFIG` (skip this if you are not using metrics)
./init-metrics.sh -c testnet-dev-${USER} ${USER}

# Deploy the network from the local workspace and start processes on all nodes including bench-tps on the client node
RUST_LOG=info ./net.sh start

# Show a help to ssh into any testnet node to access logs/etc
./ssh.sh

# Stop running processes on all nodes
./net.sh stop

# Dispose of the network (billing stops here)
./gce.sh delete

Full guide

If you expect metrics to work, make sure you have configured them before proceeding
Go to ./net/ directory in agave repo
./gce.sh command controls creation and destruction of the nodes in the test net. It does not actually run any software.
- ./gce.sh create \-n 4 \-c 2 creates cluster with 4 validators and 1 node for load generation, this is minimal viable setup for all solana features to work
- If the creation succeeds, net/config/config will contain the config file of the testnet just created
- If you do not have SOLANA_METRICS_CONFIG set in your shell env, gce.sh may complain about metrics not being configured, this is perfectly fine
- ./gce.sh info lists active test cluster nodes, this allows you to get their IP addresses for SSH access and/or debugging
- ./gce.sh delete destroys the nodes (save the electricity and $$$ - destroy your test nets the moment you no longer need them).
- On GCE, if you do not delete nodes, they will self-destruct in 8 hours anyway, you can configure self-destruct timer by supplying --self-destruct-hours=N argument to gce.sh
- On other cloud platforms the testnet will not self-destruct!
To enable metrics in the testnet, at this point you need to either:
- ./init-metrics.sh -c testnet-dev-${user} ${user} to create a new metrics database from scratch
- Manually set SOLANA_METRICS_CONFIG in ./net/config/config (which is exactly what init-metrics.sh does for you)
- ./init-metrics.sh -e to load metrics config from SOLANA_METRICS_CONFIG into the testnet config file or

./net.sh controls the payload on the testnet nodes, i.e. bootstrapping, the validators and bench-tps. In principle, you can run everything by hand, but ./net.sh makes it easier.

./net.sh start to actually run the test network.
This will actually upload your current sources to the bootstrap host, build them there and upload the result to all the nodes

The script will take 5-10 of minutes to run, in the end it should print something like

--- Deployment Successful
Bootstrap validator deployment took 164 seconds
`Additional validator deployment (5 validators, 0 blockstreamer nodes) took 120 seconds
Client deployment (1 instances) took 11 seconds
Network start logs in /home/sol/agave/net/log

You can also make sure it logs successful test transfers:

* `./net.sh start  bench-tps=2="--tx_count 2500"` will start 2 clients with bench-tps workload sending 2500 transactions per batch.
* --tx_count argument is passed to the bench-tps program, see its manual for more options
* `./net.sh sanity`  to test the deployment, it is also run by start command
* `./net.sh stop`  to stop the validators and client. This does not kill the machines, so you can study the logs etc.
* `./net.sh start --nobuild` will skip the source compilation, you will generally want that if you are only changing configuration files rather than code, or just want to re-run the last test.
* To connect to the nodes:
* `./gce.sh info ` to get the public IPs
* `./ssh.sh <IP> ` to get a shell on the node
* `sudo su` will give you root access on the nodes
* Nodes run latest ubuntu LTS image
* You can also interact with the nodes using solana cli:

bash

source ip list use as ${validatorIpList[4]}

source net/config/config

airdrop

../target/release/solana -u http://${validatorIpList[1]}:8899 airdrop 1

check feature

../target/release/solana -u http://${validatorIpList[1]}:8899 feature status

activate a feature

../target/release/solana -u http://${validatorIpList[1]}:8899 feature activate

check the stakes on current validators

../target/release/solana --url http://${validatorIpList[0]}:8899 validators


## Tips

### Automation
You will want to have a script like this pretty much immediately to avoid making mistakes in the init process:

bash

Create the testnet with reasonable node sizes for a small test

./gce.sh create -n4 -c2 --custom-machine-type "--machine-type n1-standard-16" --client-machine-type "--machine-type n1-standard-4"

Patch metrics config from env into config file

./init-metrics.sh -e

Enable metrics and start the network (this will also build software)

RUST_LOG=info ./net.sh start -c bench-tps=2="--tx_count 25000"


### Inscrutable "nothing works everything times out state"
 Note that net.sh and `gce.sh info` commands do not actually check if all the nodes are still alive in gcloud,
 they just assume the config file information is correct. So if your nodes got killed/timed out they will lie to you. In such case, just use `gce.sh delete` to reset.

### Running the network over public IP addresses
By default private IP addresses are used with all instances in the same
availability zone to avoid GCE network egress charges. However to run the
network over public IP addresses:

bash $ ./gce.sh create -P ...

or

bash $ ./ec2.sh create -P ...


### Deploying a tarball-based network
To deploy the latest pre-built `edge` channel tarball (ie, latest from the `master`
branch), once the testnet has been created run:

bash $ ./net.sh start -t edge


### Enabling CUDA
> [!NOTE]
> CUDA is currently not available on GCE
First ensure the network instances are created with GPU enabled:

bash $ ./gce.sh create -g ...

or

bash $ ./ec2.sh create -g ...


If deploying a tarball-based network nothing further is required, as GPU presence
is detected at runtime and the CUDA build is auto selected.

### Partition testing

To induce the partition `net.sh netem --config-file <config file path>`
To remove partition `net.sh netem --config-file <config file path> --netem-cmd cleanup`
The partitioning is also removed if you do `net.sh stop` or `restart`.

An example config that produces 3 almost equal partitions:

{

  "partitions":[
     34,
     33,
     33
  ],
  "interconnects":[
     {
        "a":0,
        "b":1,
        "config":"loss 15% delay 25ms"
     },
     {
        "a":1,
        "b":0,
        "config":"loss 15% delay 25ms"
     },
     {
        "a":0,
        "b":2,
        "config":"loss 10% delay 15ms"
     },
     {
        "a":2,
        "b":0,
        "config":"loss 10% delay 15ms"
     },
     {
        "a":2,
        "b":1,
        "config":"loss 5% delay 5ms"
     },
     {
        "a":1,
        "b":2,
        "config":"loss 5% delay 5ms"
     }
  ]

} ```

README.md