1 jaar geleden · 44c0494406
--- a/.devcontainer/devcontainer.json
+++ b/.devcontainer/devcontainer.json
@@ -11,7 +11,7 @@
 
				         "ms-python.python",
			
 
				         "ms-azuretools.vscode-docker",
			
 
				         "ms-toolsai.jupyter",
			
 
				-        "esbenp.prettier-vscode"
			
 
				+        "yahyabatulu.vscode-markdown-alert"
			
 
				       ]
			
 
				     }
			
 
				   }
			
--- a/.gitignore
+++ b/.gitignore
@@ -12,7 +12,10 @@ ch05/01_main-chapter-code/gpt2/
 
				 ch05/02_alternative_weight_loading/checkpoints
			
 
				 ch05/01_main-chapter-code/model.pth
			
 
				 ch05/01_main-chapter-code/model_and_optimizer.pth
			
 
				+ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints
			
 
				 
			
 
				+# Preprocessing output folders
			
 
				+ch05/03_bonus_pretraining_on_gutenberg/gutenberg_preprocessed
			
 
				 
			
 
				 # Temporary OS-related files
			
 
				 .DS_Store
			
--- a/ch05/03_bonus_pretraining_on_gutenberg/README.md
+++ b/ch05/03_bonus_pretraining_on_gutenberg/README.md
@@ -23,16 +23,35 @@ As of this writing, this will require approximately 50 GB of disk space, but it
 
				 
			
 
				 Linux and macOS users can follow these steps to download the dataset (if you are a Windows user, please see the note below):
			
 
				 
			
 
				+Set the `03_bonus_pretraining_on_gutenberg` folder as working directory to clone the `gutenberg` repository locally in this folder (this is necessary to run the provided scripts `prepare_dataset.py` and `pretraining_simple.py`). For instance, when being in the `LLMs-from-scratch` repository's folder, navigate into the *03_bonus_pretraining_on_gutenberg* folder via:
			
 
				+```bash
			
 
				+cd ch05/03_bonus_pretraining_on_gutenberg
			
 
				+```
			
 
				 
			
 
				-1. `git clone https://github.com/pgcorpus/gutenberg.git`
			
 
				-
			
 
				-2. `cd gutenberg`
			
 
				+2. Clone the `gutenberg` repository in there:
			
 
				+```bash
			
 
				+git clone https://github.com/pgcorpus/gutenberg.git
			
 
				+```
			
 
				 
			
 
				-3. `pip install -r requirements.txt`
			
 
				+3. Navigate into the locally cloned `gutenberg` repository's folder:
			
 
				+```bash
			
 
				+cd gutenberg
			
 
				+```
			
 
				 
			
 
				-4. `python get_data.py`
			
 
				+4. Install the required packages defined in *requirements.txt* from the `gutenberg` repository's folder:
			
 
				+```bash
			
 
				+pip install -r requirements.txt
			
 
				+```
			
 
				+ 
			
 
				+5. Download the data:
			
 
				+```bash
			
 
				+python get_data.py
			
 
				+```
			
 
				 
			
 
				-5. `cd ..`
			
 
				+6. Go back into the `03_bonus_pretraining_on_gutenberg` folder
			
 
				+```bash
			
 
				+cd ..
			
 
				+```
			
 
				 
			
 
				 &nbsp;
			
 
				 #### Special instructions for Windows users
			
@@ -54,14 +73,14 @@ sudo apt-get install -y rsync && \
 
				 > [!NOTE]
			
 
				 > Instructions about how to set up Python and installing packages can be found in  [Appendix A: Optional Python Setup Preferences](../../appendix-A/01_optional-python-setup-preferences/README.md) and [Appendix A: Installing Python Libraries](../../appendix-A/02_installing-python-libraries/README.md).
			
 
				 >
			
 
				-> Optionally, a Docker image running Ubuntu is provided with this repository. When having cloned the [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) GitHub repository, copy the *.devcontainer* folder of this `LLMs-from-scratch` repository and paste it to the locally cloned `gutenberg` repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).
			
 
				+> Optionally, a Docker image running Ubuntu is provided with this repository. Instructions about how to run a container with the provided Docker image can be found in [Appendix A: Optional Docker Environment](../../appendix-A/04_optional-docker-environment/README.md).
			
 
				 
			
 
				 &nbsp;
			
 
				 ### 2) Prepare the dataset
			
 
				 
			
 
				 Next, run the `prepare_dataset.py` script, which concatenates the (as of this writing, 60,173) text files into fewer larger files so that they can be more efficiently transferred and accessed:
			
 
				 
			
 
				-```
			
 
				+```bash
			
 
				 python prepare_dataset.py \
			
 
				   --data_dir gutenberg/data \
			
 
				   --max_size_mb 500 \
			
@@ -90,34 +109,32 @@ python pretraining_simple.py \
 
				 
			
 
				 The output will be formatted in the following way:
			
 
				 
			
 
				-```
			
 
				-Total files: 3
			
 
				-Tokenizing file 1 of 3: data_small/combined_1.txt
			
 
				-Training ...
			
 
				-Ep 1 (Step 0): Train loss 9.694, Val loss 9.724
			
 
				-Ep 1 (Step 100): Train loss 6.672, Val loss 6.683
			
 
				-Ep 1 (Step 200): Train loss 6.543, Val loss 6.434
			
 
				-Ep 1 (Step 300): Train loss 5.772, Val loss 6.313
			
 
				-Ep 1 (Step 400): Train loss 5.547, Val loss 6.249
			
 
				-Ep 1 (Step 500): Train loss 6.182, Val loss 6.155
			
 
				-Ep 1 (Step 600): Train loss 5.742, Val loss 6.122
			
 
				-Ep 1 (Step 700): Train loss 6.309, Val loss 5.984
			
 
				-Ep 1 (Step 800): Train loss 5.435, Val loss 5.975
			
 
				-Ep 1 (Step 900): Train loss 5.582, Val loss 5.935
			
 
				-...
			
 
				-Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946
			
 
				-Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939
			
 
				-Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961
			
 
				-Saved model_checkpoints/model_pg_32188.pth
			
 
				-Book processed 3h 46m 55s 
			
 
				-Total time elapsed 3h 46m 55s 
			
 
				-ETA for remaining books: 7h 33m 50s
			
 
				-Tokenizing file 2 of 3: data_small/combined_2.txt
			
 
				-Training ...
			
 
				-Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094
			
 
				-Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097
			
 
				-...
			
 
				-```
			
 
				+> Total files: 3  
			
 
				+> Tokenizing file 1 of 3: data_small/combined_1.txt  
			
 
				+> Training ...  
			
 
				+> Ep 1 (Step 0): Train loss 9.694, Val loss 9.724  
			
 
				+> Ep 1 (Step 100): Train loss 6.672, Val loss 6.683  
			
 
				+> Ep 1 (Step 200): Train loss 6.543, Val loss 6.434  
			
 
				+> Ep 1 (Step 300): Train loss 5.772, Val loss 6.313  
			
 
				+> Ep 1 (Step 400): Train loss 5.547, Val loss 6.249  
			
 
				+> Ep 1 (Step 500): Train loss 6.182, Val loss 6.155  
			
 
				+> Ep 1 (Step 600): Train loss 5.742, Val loss 6.122  
			
 
				+> Ep 1 (Step 700): Train loss 6.309, Val loss 5.984  
			
 
				+> Ep 1 (Step 800): Train loss 5.435, Val loss 5.975  
			
 
				+> Ep 1 (Step 900): Train loss 5.582, Val loss 5.935  
			
 
				+> ...  
			
 
				+> Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946  
			
 
				+> Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939  
			
 
				+> Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961  
			
 
				+> Saved model_checkpoints/model_pg_32188.pth  
			
 
				+> Book processed 3h 46m 55s   
			
 
				+> Total time elapsed 3h 46m 55s   
			
 
				+> ETA for remaining books: 7h 33m 50s  
			
 
				+> Tokenizing file 2 of 3: data_small/combined_2.txt  
			
 
				+> Training ...  
			
 
				+> Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094  
			
 
				+> Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097  
			
 
				+> ...
			
 
				 
			
 
				 
			
 
				 &nbsp;