|
|
5 月之前 | |
|---|---|---|
| .. | ||
| README.md | 7 月之前 | |
| additional_experiments.py | 5 月之前 | |
| gpt_download.py | 8 月之前 | |
| previous_chapters.py | 1 年之前 | |
The table below adds experiments to answer additional questions about various design choices. The first row uses the same settings as the main chapter and is used as a reference. For example,
| Model | Weights | Trainable token position | Trainable layers | Context length | Training acc | Validation acc | Test acc | Training time | CPU/GPU | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120) | 96.63% | 99.33% | 95.00% | 0.28 min | A100 |
| 2 | gpt2-small (124M) | pretrained | first | last_block | longest train ex. (120) | 78.46% | 80.54% | 75.00% | 0.28 min | A100 |
| 3 | gpt2-small (124M) | pretrained | last | last_layer | longest train ex. (120) | 78.65% | 79.87% | 72.00% | 0.25 min | A100 |
| 4 | gpt2-small (124M) | pretrained | last | last_two_blocks | longest train ex. (120) | 98.85% | 98.66% | 98.33% | 0.33 min | A100 |
| 5 | gpt2-small (124M) | pretrained | last | all | longest train ex. (120) | 99.62% | 96.64% | 96.67% | 0.69 min | A100 |
| 6 | gpt2-medium (355M) | pretrained | last | last_block | longest train ex. (120) | 87.50% | 91.28% | 84.67% | 0.75 min | A100 |
| 7 | gpt2-large (774M) | pretrained | last | last_block | longest train ex. (120) | 99.52% | 98.66% | 96.67% | 1.50 min | A100 |
| 8 | gpt2-xl (1558M) | pretrained | last | last_block | longest train ex. (120) | 99.81% | 99.81% | 98.33% | 2.83 min | A100 |
| 9 | gpt2-xl (1558M) | pretrained | last | all | longest train ex. (120) | 100.00% | 98.66% | 98.67% | 8.12 min | A100 |
| 10 | gpt2-small (124M) | random | last | all | longest train ex. (120) | 100.00% | 96.64% | 93.67% | 0.69 min | A100 |
| 11 | gpt2-small (124M) | pretrained | last | LoRA | longest train ex. (120) | 100.00% | 97.32% | 96.67% | 0.75 min | A100 |
| 12 | gpt2-xl (1558M) | pretrained | last | LoRA | longest train ex. (120) | 100.00% | 98.66% | 98.33% | 5.79 min | A100 |
| 13 | gpt2-small (124M) | pretrained | last | last_block | context length (1024) | 83.08% | 87.92% | 78.33% | 2.46 min | A100 |
| 14 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 1) | 100.00% | 98.66% | 98.00% | 1.75 min | A100 |
| 15 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 8) | 99.33% | 98.66% | 98.33% | 1.70 min | A100 |
| 16 | gpt2-small (124M) | pretrained | last | last_block | flexible (last non-padding position) | 99.42% | 98.66% | 98.33% | 0.30 min | A100 |
| 17 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120); but no causal mask | 99.23% | 98.66% | 95.33% | 0.29 min | A100 |
| 18 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120) and ignore_index for padding |
96.63% | 99.33% | 95.00% | 0.28 min | A100 |
| 19 | gpt2-small (124M) | pretrained | last + pooled embeddings | last_block | longest train ex. (120) | 97.79% | 99.33% | 96.33% | 0.32 min | A100 |
You can use the following code to reproduce the experiments:
python additional_experiments.pypython additional_experiments.py --trainable_token_pos firstpython additional_experiments.py --trainable_layers last_layerpython additional_experiments.py --trainable_layers last_two_blockspython additional_experiments.py --trainable_layers allpython additional_experiments.py --model_size "gpt2-medium (355M)"python additional_experiments.py --model_size "gpt2-large (774M)"python additional_experiments.py --model_size "gpt2-xl (1558M)"python additional_experiments.py --model_size "gpt2-xl (1558M)"--trainable_layers allpython additional_experiments.py --weights random --trainable_layers allpython additional_experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 16python additional_experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 8 --model_size "gpt2-xl (1558M)"python additional_experiments.py --context_length "model_context_length"python additional_experiments.py --no_padding --batch_size 1python additional_experiments.py --no_padding --batch_size 1 --accumulation_steps 8python additional_experiments.py --trainable_token_pos "flexible"python additional_experiments.py --disable_causal_maskpython additional_experiments.py --ignore_index 50256python additional_experiments.py --average_embeddingsI've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes (for the default setting) in case you don't have access to a GPU.
--no_padding option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 15, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy. In row 16, padding is applied, but the token position is selected based on the last non-padding token. Row 16 should be mathematically similar to row 15, which uses gradient accumulation. However, due to some challenges with gradient accumulation in cases of unequal token counts, there may be small discrepancies (this is discussed in this blog post).--ignore_index 50256 excludes the <|endoftext|> padding tokens in the cross_entropy loss function in PyTorch. In this case, it does not have any effect because we replaced the output layers so that the token IDs are either 0 or 1 for the binary classification example. However, this setting is useful when instruction finetuning models in chapter 7.--average_embeddings will average the embeddings over all tokens. If this option is not used (the default), only the output embeddings at the chosen token position (specified by --trainable_token_pos) are considered; for example, the embeddings of the last token. Enabling --average_embeddings will mean-pool the embeddings of all tokens into the position chosen by --trainable_token_pos (the last token by default). As we can see, this improves the performance from 95.00% to 96.33% with only a minimal increase in run time (0.28 min to 0.32 min) and might be worthwhile considering in practice.