9 ay önce · f12b899d96
--- a/README.md
+++ b/README.md
@@ -48,7 +48,7 @@ You can alternatively view this and other files on GitHub at [https://github.com
 
				 <br>
			
 
				 <!--  -->
			
 
				 
			
 
				-> [!TIP]
			
 
				+> **Tip:**
			
 
				 > If you're seeking guidance on installing Python and Python packages and setting up your code environment, I suggest reading the [README.md](setup/README.md) file located in the [setup](setup) directory.
			
 
				 
			
 
				 <br>
			
--- a/ch05/03_bonus_pretraining_on_gutenberg/README.md
+++ b/ch05/03_bonus_pretraining_on_gutenberg/README.md
@@ -2,9 +2,9 @@
 
				 
			
 
				 The code in this directory contains code for training a small GPT model on the free books provided by Project Gutenberg.
			
 
				 
			
 
				-As the Project Gutenberg website states, "the vast majority of Project Gutenberg eBooks are in the public domain in the US." 
			
 
				+As the Project Gutenberg website states, "the vast majority of Project Gutenberg eBooks are in the public domain in the US."
			
 
				 
			
 
				-Please read the [Project Gutenberg Permissions, Licensing and other Common Requests](https://www.gutenberg.org/policy/permission.html) page for more information about using the resources provided by Project Gutenberg. 
			
 
				+Please read the [Project Gutenberg Permissions, Licensing and other Common Requests](https://www.gutenberg.org/policy/permission.html) page for more information about using the resources provided by Project Gutenberg.
			
 
				 
			
 
				 &nbsp;
			
 
				 ## How to Use This Code
			
@@ -56,9 +56,9 @@ cd ..
 
				 &nbsp;
			
 
				 #### Special instructions for Windows users
			
 
				 
			
 
				-The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`. 
			
 
				+The [`pgcorpus/gutenberg`](https://github.com/pgcorpus/gutenberg) code is compatible with both Linux and macOS. However, Windows users would have to make small adjustments, such as adding `shell=True` to the `subprocess` calls and replacing `rsync`.
			
 
				 
			
 
				-Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" (WSL) feature, which allows users to run a Linux environment using Ubuntu in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/). 
			
 
				+Alternatively, an easier way to run this code on Windows is by using the "Windows Subsystem for Linux" (WSL) feature, which allows users to run a Linux environment using Ubuntu in Windows. For more information, please read [Microsoft's official installation instruction](https://learn.microsoft.com/en-us/windows/wsl/install) and [tutorial](https://learn.microsoft.com/en-us/training/modules/wsl-introduction/).
			
 
				 
			
 
				 When using WSL, please make sure you have Python 3 installed (check via `python3 --version`, or install it for instance with `sudo apt-get install -y python3.10` for Python 3.10) and install following packages there:
			
 
				 
			
@@ -70,7 +70,7 @@ sudo apt-get install -y python-is-python3 && \
 
				 sudo apt-get install -y rsync
			
 
				 ```
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > Instructions about how to set up Python and installing packages can be found in [Optional Python Setup Preferences](../../setup/01_optional-python-setup-preferences/README.md) and [Installing Python Libraries](../../setup/02_installing-python-libraries/README.md).
			
 
				 >
			
 
				 > Optionally, a Docker image running Ubuntu is provided with this repository. Instructions about how to run a container with the provided Docker image can be found in [Optional Docker Environment](../../setup/03_optional-docker-environment/README.md).
			
@@ -94,10 +94,10 @@ Skipping gutenberg/data/raw/PG29836_raw.txt as it does not contain primarily Eng
 
				 ```
			
 
				 
			
 
				 
			
 
				-> [!TIP] 
			
 
				+> **Tip:**
			
 
				 > Note that the produced files are stored in plaintext format and are not pre-tokenized for simplicity. However, you may want to update the codes to store the dataset in a pre-tokenized form to save computation time if you are planning to use the dataset more often or train for multiple epochs. See the *Design Decisions and Improvements* at the bottom of this page for more information.
			
 
				 
			
 
				-> [!TIP]
			
 
				+> **Tip:**
			
 
				 > You can choose smaller file sizes, for example, 50 MB. This will result in more files but might be useful for quicker pretraining runs on a small number of files for testing purposes.
			
 
				 
			
 
				 
			
@@ -116,36 +116,36 @@ python pretraining_simple.py \
 
				 
			
 
				 The output will be formatted in the following way:
			
 
				 
			
 
				-> Total files: 3  
			
 
				-> Tokenizing file 1 of 3: data_small/combined_1.txt  
			
 
				-> Training ...  
			
 
				-> Ep 1 (Step 0): Train loss 9.694, Val loss 9.724  
			
 
				-> Ep 1 (Step 100): Train loss 6.672, Val loss 6.683  
			
 
				-> Ep 1 (Step 200): Train loss 6.543, Val loss 6.434  
			
 
				-> Ep 1 (Step 300): Train loss 5.772, Val loss 6.313  
			
 
				-> Ep 1 (Step 400): Train loss 5.547, Val loss 6.249  
			
 
				-> Ep 1 (Step 500): Train loss 6.182, Val loss 6.155  
			
 
				-> Ep 1 (Step 600): Train loss 5.742, Val loss 6.122  
			
 
				-> Ep 1 (Step 700): Train loss 6.309, Val loss 5.984  
			
 
				-> Ep 1 (Step 800): Train loss 5.435, Val loss 5.975  
			
 
				-> Ep 1 (Step 900): Train loss 5.582, Val loss 5.935  
			
 
				-> ...  
			
 
				-> Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946  
			
 
				-> Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939  
			
 
				-> Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961  
			
 
				-> Saved model_checkpoints/model_pg_32188.pth  
			
 
				-> Book processed 3h 46m 55s   
			
 
				-> Total time elapsed 3h 46m 55s   
			
 
				-> ETA for remaining books: 7h 33m 50s  
			
 
				-> Tokenizing file 2 of 3: data_small/combined_2.txt  
			
 
				-> Training ...  
			
 
				-> Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094  
			
 
				-> Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097  
			
 
				+> Total files: 3
			
 
				+> Tokenizing file 1 of 3: data_small/combined_1.txt
			
 
				+> Training ...
			
 
				+> Ep 1 (Step 0): Train loss 9.694, Val loss 9.724
			
 
				+> Ep 1 (Step 100): Train loss 6.672, Val loss 6.683
			
 
				+> Ep 1 (Step 200): Train loss 6.543, Val loss 6.434
			
 
				+> Ep 1 (Step 300): Train loss 5.772, Val loss 6.313
			
 
				+> Ep 1 (Step 400): Train loss 5.547, Val loss 6.249
			
 
				+> Ep 1 (Step 500): Train loss 6.182, Val loss 6.155
			
 
				+> Ep 1 (Step 600): Train loss 5.742, Val loss 6.122
			
 
				+> Ep 1 (Step 700): Train loss 6.309, Val loss 5.984
			
 
				+> Ep 1 (Step 800): Train loss 5.435, Val loss 5.975
			
 
				+> Ep 1 (Step 900): Train loss 5.582, Val loss 5.935
			
 
				+> ...
			
 
				+> Ep 1 (Step 31900): Train loss 3.664, Val loss 3.946
			
 
				+> Ep 1 (Step 32000): Train loss 3.493, Val loss 3.939
			
 
				+> Ep 1 (Step 32100): Train loss 3.940, Val loss 3.961
			
 
				+> Saved model_checkpoints/model_pg_32188.pth
			
 
				+> Book processed 3h 46m 55s
			
 
				+> Total time elapsed 3h 46m 55s
			
 
				+> ETA for remaining books: 7h 33m 50s
			
 
				+> Tokenizing file 2 of 3: data_small/combined_2.txt
			
 
				+> Training ...
			
 
				+> Ep 1 (Step 32200): Train loss 2.982, Val loss 4.094
			
 
				+> Ep 1 (Step 32300): Train loss 3.920, Val loss 4.097
			
 
				 > ...
			
 
				 
			
 
				 
			
 
				 &nbsp;
			
 
				-> [!TIP] 
			
 
				+> **Tip:**
			
 
				 > In practice, if you are using macOS or Linux, I recommend using the `tee` command to save the log outputs to a `log.txt` file in addition to printing them on the terminal:
			
 
				 
			
 
				 ```bash
			
@@ -153,8 +153,8 @@ python -u pretraining_simple.py | tee log.txt
 
				 ```
			
 
				 
			
 
				 &nbsp;
			
 
				-> [!WARNING]  
			
 
				-> Note that training on 1 of the ~500 Mb text files in the `gutenberg_preprocessed` folder will take approximately 4 hours on a V100 GPU. 
			
 
				+> **Warning:**
			
 
				+> Note that training on 1 of the ~500 Mb text files in the `gutenberg_preprocessed` folder will take approximately 4 hours on a V100 GPU.
			
 
				 > The folder contains 47 files and will take approximately 200 hours (more than 1 week) to complete. You may want to run it on a smaller number of files.
			
 
				 
			
 
				 
			
--- a/setup/01_optional-python-setup-preferences/README.md
+++ b/setup/01_optional-python-setup-preferences/README.md
@@ -6,7 +6,8 @@ There are several ways to install Python and set up your computing environment.
 
				 
			
 
				 <br>
			
 
				 
			
 
				-> [!NOTE] If you are running any of the notebooks on Google Colab and want to install the dependencies, simply run the following code in a new cell at the top of the notebook and skip the rest of this tutorial:
			
 
				+> **Note:** 
			
 
				+> If you are running any of the notebooks on Google Colab and want to install the dependencies, simply run the following code in a new cell at the top of the notebook and skip the rest of this tutorial:
			
 
				 > `pip install uv && uv pip install --system -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/requirements.txt`
			
 
				 
			
 
				 The remaining sections below describe how you can manage your Python environment and packages on your local machine.
			
@@ -24,7 +25,7 @@ In this tutorial, I am using a computer running macOS, but this workflow is simi
 
				 This section guides you through the Python setup and package installation procedure using `uv` via its `uv pip` interface. The `uv pip` interface may feel more familiar to most Python users who have used pip before than the native `uv` commands.
			
 
				 
			
 
				 &nbsp;
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > There are alternative ways to install Python and use `uv`. For example, you can install Python directly via `uv` and use `uv add` instead of `uv pip install` for even faster package management.
			
 
				 >
			
 
				 > If you are a macOS or Linux user and prefer the native `uv` commands, refer to the [./native-uv.md tutorial](./native-uv.md). I also recommend checking the official [`uv` documentation](https://docs.astral.sh/uv/).
			
@@ -49,7 +50,11 @@ python --version
 
				 If it returns 3.10 or newer, no further action is required.
			
 
				 
			
 
				 &nbsp;
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				+> If `python --version` indicates that no Python version is installed, you may also want to check `python3 --version` since your system might be configured to use the `python3` command instead.
			
 
				+
			
 
				+&nbsp;
			
 
				+> **Note:**
			
 
				 > I recommend installing a Python version that is at least 2 versions older than the most recent release to ensure PyTorch compatibility. For example, if the most recent version is Python 3.13, I recommend installing version 3.10 or 3.11.
			
 
				 
			
 
				 Otherwise, if Python is not installed or is an older version, you can install it for your operating system as described below.
			
@@ -118,7 +123,7 @@ source .venv/bin/activate
 
				 ```
			
 
				 
			
 
				 &nbsp;
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > If you are using Windows, you may have to replace the command above by `source .venv/Scripts/activate` or `.venv/Scripts/activate`.
			
 
				 
			
 
				 
			
@@ -157,7 +162,7 @@ uv pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs
 
				 
			
 
				 &nbsp;
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > If you have problems with the following commands above due to certain dependencies (for example, if you are using Windows), you can always fall back to using regular pip:
			
 
				 > `pip install -r requirements.txt`
			
 
				 > or
			
--- a/setup/01_optional-python-setup-preferences/native-pixi.md
+++ b/setup/01_optional-python-setup-preferences/native-pixi.md
@@ -1,10 +1,10 @@
 
				 # Native pixi Python and package management
			
 
				 
			
 
				-This tutorial is an alternative to the [`./native-uv.md`](native-uv.md) document for those who prefer `pixi`'s native commands over traditional environment and package managers like `conda` and `pip`. 
			
 
				+This tutorial is an alternative to the [`./native-uv.md`](native-uv.md) document for those who prefer `pixi`'s native commands over traditional environment and package managers like `conda` and `pip`.
			
 
				 
			
 
				-Note that pixi uses `uv add` under the hood, as described in [`./native-uv.md`](native-uv.md). 
			
 
				+Note that pixi uses `uv add` under the hood, as described in [`./native-uv.md`](native-uv.md).
			
 
				 
			
 
				-Pixi and uv are both modern package and environment management tools for Python, but pixi is a polyglot package manager designed for managing not just Python but also other languages (similar to conda), while uv is a Python-specific tool optimized for ultra-fast dependency resolution and package installation. 
			
 
				+Pixi and uv are both modern package and environment management tools for Python, but pixi is a polyglot package manager designed for managing not just Python but also other languages (similar to conda), while uv is a Python-specific tool optimized for ultra-fast dependency resolution and package installation.
			
 
				 
			
 
				 Someone might choose pixi over uv if they need a polyglot package manager that supports multiple languages (not just Python) or prefer a declarative environment management approach similar to conda. For more information, please visit the official [pixi documentation](https://pixi.sh/latest/).
			
 
				 
			
@@ -37,7 +37,7 @@ wget -qO- https://pixi.sh/install.sh | sh
 
				 powershell -ExecutionPolicy ByPass -c "irm -useb https://pixi.sh/install.ps1 | iex"
			
 
				 ```
			
 
				 
			
 
				-> [!NOTE] 
			
 
				+> **Note:**
			
 
				 > For more installation options, please refer to the official [pixi documentation](https://pixi.sh/latest/).
			
 
				 
			
 
				 
			
@@ -50,7 +50,7 @@ You can install Python using pixi:
 
				 pixi add python=3.10
			
 
				 ```
			
 
				 
			
 
				-> [!NOTE] 
			
 
				+> **Note:**
			
 
				 > I recommend installing a Python version that is at least 2 versions older than the most recent release to ensure PyTorch compatibility. For example, if the most recent version is Python 3.13, I recommend installing version 3.10 or 3.11. You can find out the most recent Python version by visiting [python.org](https://www.python.org).
			
 
				 
			
 
				 &nbsp;
			
@@ -62,7 +62,7 @@ To install all required packages from a `pixi.toml` file (such as the one locate
 
				 pixi install
			
 
				 ```
			
 
				 
			
 
				-> [!NOTE] 
			
 
				+> **Note:**
			
 
				 > If you encounter issues with dependencies (for example, if you are using Windows), you can always fall back to pip: `pixi run pip install -U -r requirements.txt`
			
 
				 
			
 
				 By default, `pixi install` will create a separate virtual environment specific to the project.
			
--- a/setup/01_optional-python-setup-preferences/native-uv.md
+++ b/setup/01_optional-python-setup-preferences/native-uv.md
@@ -49,7 +49,7 @@ powershell -c "irm https://astral.sh/uv/install.ps1 | more"
 
				 
			
 
				 &nbsp;
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > For more installation options, please refer to the official [uv documentation](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer).
			
 
				 
			
 
				 &nbsp;
			
@@ -61,12 +61,11 @@ To install all required packages from a `pyproject.toml` file (such as the one l
 
				 uv sync --dev --python 3.11
			
 
				 ```
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > If you do not have Python 3.11 available on your system, uv will download and install it for you.
			
 
				-> 
			
 
				 > I recommend using a Python version that is at least 1-3 versions older than the most recent release to ensure PyTorch compatibility. For example, if the most recent version is Python 3.13, I recommend using version 3.10, 3.11, 3.12. You can find out the most recent Python version by visiting [python.org](https://www.python.org/downloads/).
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > If you have problems with the following commands above due to certain dependencies (for example, if you are using Windows), you can always fall back to regular pip:
			
 
				 > `uv add pip`
			
 
				 > `uv run python -m pip install -U -r requirements.txt`
			
@@ -84,7 +83,7 @@ You can install new packages, that are not specified in the `pyproject.toml` via
 
				 uv add packaging
			
 
				 ```
			
 
				 
			
 
				-And you can remove packages via `uv remove`, for example,	
			
 
				+And you can remove packages via `uv remove`, for example,
			
 
				 
			
 
				 ```bash
			
 
				 uv remove packaging
			
@@ -136,13 +135,13 @@ On Windows (PowerShell):
 
				 .venv\Scripts\activate
			
 
				 ```
			
 
				 
			
 
				-Then, you can run scripts via 
			
 
				+Then, you can run scripts via
			
 
				 
			
 
				 ```bash
			
 
				 python script.py
			
 
				 ```
			
 
				 
			
 
				-and launch JupyterLab via 
			
 
				+and launch JupyterLab via
			
 
				 
			
 
				 ```bash
			
 
				 juputer lab
			
--- a/setup/02_installing-python-libraries/README.md
+++ b/setup/02_installing-python-libraries/README.md
@@ -6,7 +6,7 @@ I used the following libraries listed [here](https://github.com/rasbt/LLMs-from-
 
				 
			
 
				 
			
 
				 
			
 
				-> [!NOTE]
			
 
				+> **Note:**
			
 
				 > If you you are using `uv` as described in [Option 1: Using uv](../01_optional-python-setup-preferences/README.md), you can replace `pip` via `pip uv` in the commands below. For example, `pip install -r requirements.txt` becomes `uv pip install -r requirements.txt`
			
 
				 
			
 
				 
			
--- a/setup/03_optional-docker-environment/README.md
+++ b/setup/03_optional-docker-environment/README.md
@@ -86,7 +86,7 @@ The entire process is automated and might take a few minutes, depending on your
 
				 
			
 
				 Once completed, VS Code will automatically connect to the container and reopen the project within the newly created Docker development environment. You will be able to write, execute, and debug code as if it were running on your local machine, but with the added benefits of Docker's isolation and consistency.
			
 
				 
			
 
				-> [!WARNING]
			
 
				+> **Warning:**
			
 
				 > If you are encountering an error during the build process, this is likely because your machine does not support NVIDIA container toolkit because your machine doesn't have a compatible GPU. In this case, edit the `devcontainer.json` file to remove the `"runArgs": ["--runtime=nvidia", "--gpus=all"],` line and run the "Reopen Dev Container" procedure again.
			
 
				 
			
 
				 9. Finished.
			
--- a/setup/README.md
+++ b/setup/README.md
@@ -15,7 +15,7 @@ pip install -r requirements.txt
 
				 
			
 
				 <br>
			
 
				 
			
 
				-> [!NOTE] If you are running any of the notebooks on Google Colab and want to install the dependencies, simply run the following code in a new cell at the top of the notebook:
			
 
				+> **Note:** If you are running any of the notebooks on Google Colab and want to install the dependencies, simply run the following code in a new cell at the top of the notebook:
			
 
				 > `pip install uv && uv pip install --system -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/requirements.txt`