|
|
4 сар өмнө | |
|---|---|---|
| .. | ||
| 00_orig.py | 9 сар өмнө | |
| 01_opt_single_gpu.py | 5 сар өмнө | |
| 02_opt_multi_gpu_ddp.py | 4 сар өмнө | |
| README.md | 8 сар өмнө | |
Note that the book is written for education purposes, meaning the original code is kept purposefully simple. This is to aid readability and ensure compatibility across different hardware, including CPUs and GPUs. However, you might be curious about some more advanced PyTorch and GPU features to make the LLM training more performant.
This folder contains three code files that demonstrate performance optimizations for the LLM and the training function introduced in Chapter 5:
00_orig.py: The original Chapter 5 code for CPU and single-GPU training.
➤ Run via: python 00_orig.py
01_opt_single_gpu.py: An optimized version for single-GPU training.
➤ Run via: python 01_opt_single_gpu.py
02_opt_multi_gpu_ddp.py: An optimized version for multi-GPU training using Distributed Data Parallel (DDP).
➤ Run via: torchrun --nproc_per_node=4 02_opt_multi_gpu_ddp.py
(Note: To keep the changes minimal compared to 01_opt_single_gpu.py, this script supports multi-processing only via torchrun as shown above. This means multi-GPU support is not supported via python 02_opt_multi_gpu_ddp.py)
Note that these modifications take the training speed from 12,525 tokens per second (single A100) to 142,156 tokens per second (single A100) and 419,259 tokens per second (4x A100s).
I plan to expand on the differences in a more detailed write-up sometime in the future. For now, the easiest way to see what improvements have been added to the code is to open the files in Visual Studio Code and look at the differences via the "Compare Selected" feature.
As mentioned above, I plan to elaborate more on the changes in the future. For now, this section contains a simple performance overview in terms of tokens/second for each modification. All experiments were run on A100 GPUs.
Note that 00_orig.py servers as the baseline and contains no significant modification and uses the code from Chapter 5 as is besides the following:
00_orig.py compared to Chapter 5);00_orig.py);The hyperparameters are not very optimized for minimizing loss and reducing overfitting, and the text generated by the LLM at the very end may not be super sophisticated; however, this shouldn't matter as the main takeaway is the tok/sec metric that serves as a speed reference here (higher is better).
ubuntu@159-13-52-60:~$ python 00_orig.py
PyTorch version: 2.6.0+cu124
Using cuda
CUDA version: 12.4
Ep 1, Step 000000, Train: 9.535, Val: 9.609, Step tok/sec: 7238, Avg tok/sec: 0
Ep 1, Step 000015, Train: 6.201, Val: 6.152, Step tok/sec: 12545, Avg tok/sec: 12545
Ep 1, Step 000030, Train: 5.663, Val: 5.688, Step tok/sec: 12490, Avg tok/sec: 12517
Ep 1, Step 000045, Train: 5.316, Val: 5.362, Step tok/sec: 12541, Avg tok/sec: 12525
Every effort moves you, and's, and I am not be a
...
Ep 15, Step 000735, Train: 0.227, Val: 6.818, Step tok/sec: 11599, Avg tok/sec: 12248
Ep 15, Step 000750, Train: 0.300, Val: 6.895, Step tok/sec: 12530, Avg tok/sec: 12253
Ep 15, Step 000765, Train: 0.150, Val: 6.914, Step tok/sec: 12532, Avg tok/sec: 12259
Every effort moves you like best to think which he held in the room in him, the interest was the night, the realities of the affairs Bulstrode's duty, now!' the fact is another man, conquests
Allocated memory: 2.5069 GB
Reserved memory: 26.2617 GB
Note that 01_opt_single_gpu.py contains all the modifications listed sequentially below.
The comparison is always based on the average tok/sec and allocated memory after the first epoch from the previous section.
Before:
Avg tok/sec: 12525Reserved memory: 26.2617 GBAfter:
Avg tok/sec: 12526Reserved memory: 26.2422 GB
Before:
Avg tok/sec: 12526Reserved memory: 26.2422 GBAfter:
Avg tok/sec: 27648Reserved memory: 26.2422 GB
AdamW by setting fused=TrueBefore:
Avg tok/sec: 27648Reserved memory: 26.2422 GBAfter:
Avg tok/sec: 28399Reserved memory: 26.2422 GB
pin_memory=True in the data loaders to pre-allocate and re-use GPU memoryBefore:
Avg tok/sec: 28399Reserved memory: 26.2422 GBAfter:
Avg tok/sec: 28402Reserved memory: 26.2422 GB
Before:
Avg tok/sec: 28402Reserved memory: 26.2422 GBAfter:
Avg tok/sec: 45486Reserved memory: 13.7871 GB
Before:
Avg tok/sec: 45486Reserved memory: 13.7871 GBAfter:
Avg tok/sec: 55256Reserved memory: 11.5645 GB
Before:
Avg tok/sec: 55256Reserved memory: 11.5645 GBAfter:
Avg tok/sec: 91901Reserved memory: 5.9004 GB
pytorch.compiletorch.compile(model). Note that the first iterations are always slow before it picks up speed. Since the Avg tok/sec measurement only includes the first row from the average calculation, we now use the Step tok/sec at the end of epoch 1.Before:
Avg tok/sec: 91901Reserved memory: 5.9004 GBAfter:
Step tok/sec: 112046Reserved memory: 6.1875 GB
torch.compile as mentioned by Bertrand Maher. A good resource for this are NVIDIA's guidelines on tensor shapes, where batch sizes and linear layer dimensions are commonly chosen as multiples of certain values. Furthermore, the vocab-padding trick was described by NVIDIA's Megatron team a long time ago (see the 2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper).Before:
Step tok/sec: 112046Reserved memory: 6.1875 GBAfter:
Step tok/sec: 127345Reserved memory: 5.8906 GB
Before:
Step tok/sec: 127345Reserved memory: 5.8906 GBAfter:
Step tok/sec: 142156Reserved memory: 22.5078 GB
This may not be an entirely fair comparison as we now use 4 GPUs instead of 1, but using distributed data parallelism, the fastest multi-GPU technique that can be used if the training is not bottle-necked by limited GPU memory, can, of course, result in noticeable speed-ups:
Before (single GPU):
Step tok/sec: 142156Reserved memory: 22.5078 GBAfter (4 GPUs):
Step tok/sec: 419259Reserved memory: 22.7969 GB