11 years ago · 90278470de
--- a/paper/gfs.tex
+++ b/paper/gfs.tex
@@ -1,179 +1,262 @@
 
				-\documentclass{sig-alternate}

			
 
				-

			
 
				-\begin{document}

			
 
				-

			
 
				-\conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}

			
 
				-

			
 
				-\title{Gallactic File System}

			
 
				-\subtitle{}

			
 
				-

			
 
				-\numberofauthors{1}

			
 
				-

			
 
				-\author{

			
 
				-% You can go ahead and credit any number of authors here,

			
 
				-% e.g. one 'row of three' or two rows (consisting of one row of three

			
 
				-% and a second row of one, two or three).

			
 
				-%

			
 
				-% The command \alignauthor (no curly braces needed) should

			
 
				-% precede each author name, affiliation/snail-mail address and

			
 
				-% e-mail address. Additionally, tag each line of

			
 
				-% affiliation/address with \affaddr, and tag the

			
 
				-% e-mail address with \email.

			
 
				-%

			
 
				-% 1st. author

			
 
				-\alignauthor

			
 
				-Juan Benet\\

			
 
				-       \affaddr{athena.ai}\\

			
 
				-       \affaddr{498 Walsh Rd}\\

			
 
				-       \affaddr{Atherton, CA, USA}\\

			
 
				-       \email{juan@benet.ai}

			
 
				-}

			
 
				-

			
 
				-\maketitle

			
 
				-\begin{abstract}

			
 
				-The Gallactic File System is a peer-to-peer distributed file system capable of

			
 
				-sharing the same files with millions of nodes. GFS combines a distributed

			
 
				-hashtable, cryptographic techniques, merkle trees, content-addressable

			
 
				-storage, bittorrent, and tag-based filesystems to build a single massive

			
 
				-file system shared between peers. GFS has no single point of failure, and

			
 
				-nodes do not need to trust each other.

			
 
				-\end{abstract}

			
 
				-

			
 
				-\section{Introduction}

			
 
				-

			
 
				-\section{GFS Overview}

			
 
				-

			
 
				-GFS is a distributed file system where all nodes are the same. Together, the

			
 
				-nodes store the GFS files in local storage, and send the files to each other.

			
 
				-GFS implements its features by combining three well-known systems:

			
 
				-

			
 
				-\begin{enumerate}

			
 
				-  \item A Git-like \textbf{Object Model} to represent the filesystem.

			
 
				-  \item A Kademlia-based \textbf{Distributed Hash Table} to coordinate the retrieval of files.

			
 
				-  \item A Bittorrent-like peer-to-peer data \textbf{Chunk Exchange}.

			
 
				-\end{enumerate}

			
 
				-

			
 
				-

			
 
				-\subsection{Object Model}

			
 
				-

			
 
				-Files are represented as a collection of inter-related objects, like in the

			
 
				-version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:

			
 
				-

			
 
				-\begin{enumerate}

			
 
				-  \item \texttt{chunk}: a variable-size block of data.

			
 
				-  \item \texttt{list}: a collection of chunks or other lists.

			
 
				-  \item \texttt{tree}: a collection of chunks, lists, or other trees.

			
 
				-\end{enumerate}

			
 
				-

			
 
				-\subsubsection{Block Object}

			
 
				-

			
 
				-The \texttt{Block} object contains an addressable unit of data, and

			
 
				-represents a file.

			
 
				-GFS Blocks are like Git blobs or filesystem data blocks. They store the

			
 
				-users' data. (The name \textit{block} is preferred over \textit{blob}, as the

			
 
				-Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.

			
 
				-GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)

			
 
				-Format:

			
 
				-\begin{verbatim}

			
 
				-block <size>

			
 
				-<block data bytes>

			
 
				-...

			
 
				-\end{verbatim}

			
 
				-

			
 
				-

			
 
				-\subsubsection{List Object}

			
 
				-

			
 
				-The \texttt{List} object represents a (large) file made up of several

			
 
				-GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain

			
 
				-an ordered sequence of \texttt{block} or \texttt{list} objects.

			
 
				-In a sense, the GFS \texttt{List} functions like a filesystem file with

			
 
				-indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).

			
 
				-Format:

			
 
				-\begin{verbatim}

			
 
				-blob <num objects> <size>

			
 
				-<list or block> <checksum> <size>

			
 
				-<list or block> <checksum> <size>

			
 
				-...

			
 
				-\end{verbatim}

			
 
				-

			
 
				-

			
 
				-\subsubsection{Tree Object}

			
 
				-

			
 
				-The \texttt{tree} object in GFS is similar to Git trees: it represents a

			
 
				-directory, a list of checksums and names. The checksums reference \texttt{blob}

			
 
				-or other \texttt{tree} objects. Note that traditional path naming

			
 
				-is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and

			
 
				-\texttt{lists} are only addressed by their \texttt{checksums}.

			
 
				-% Unlike in Git, GFS trees include file-system metadata such as file

			
 
				-%permissions.

			
 
				-Format:

			
 
				-\begin{verbatim}

			
 
				-tree <num objects> <size>

			
 
				-<tree or list or block> <checksum> <size> <name>

			
 
				-<tree or list or block> <checksum> <size> <name>

			
 
				-...

			
 
				-\end{verbatim}

			
 
				-

			
 
				-\subsubsection{Commit Object}

			
 
				-

			
 
				-The \texttt{commit} object in GFS is similar to Git's. It represents a

			
 
				-snapshot in the version history of a \texttt{tree}.

			
 
				-

			
 
				-\begin{verbatim}

			
 
				-commit <size>

			
 
				-parent <commit checksum>

			
 
				-tree <tree checksum>

			
 
				-author Full Name <email@address.com> <ISO UTC date>

			
 
				-committer Full Name <email@address.com> <ISO UTC date>

			
 
				-<commit message>

			
 
				-\end{verbatim}

			
 
				-

			
 
				-\subsubsection{Version control}

			
 
				-

			
 
				-\subsubsection{Signed Objects}

			
 
				-

			
 
				-All objects can be signed. Add signature to bottom of object.

			
 
				-(yes, this changes the hash, as it should)

			
 
				-

			
 
				-\subsubsection{Merkle Trees}

			
 
				-

			
 
				-The object model in GFS forms a \textit{Merkle Tree}, where every object

			
 
				-contains hashes of its children. This provides GFS with the useful properties

			
 
				-of merkle trees:

			
 
				-

			
 
				-\begin{enumerate}

			
 
				-  \item Tamper resistance

			
 
				-\end{enumerate}

			
 
				-

			
 
				-\subsubsection{Published Branches}

			
 
				-

			
 
				-Users can publish branches (filesystems) with:

			
 
				-publickey -> signed tree of branches

			
 
				-

			
 
				-

			
 
				-\subsection{Distributed Hash Table}

			
 
				-\subsection{Chunk Exchange}

			
 
				-

			
 
				-\subsection{Object Distribution}

			
 
				-

			
 
				-\subsubsection{Spreading Objects}

			
 
				-

			
 
				-DHash spread along the DHT nodes?

			
 
				-Mainline DHT peer registry?

			
 
				-

			
 
				-\subsubsection{Pinning Objects}

			
 
				-

			
 
				-

			
 
				-\section{Conclusions}

			
 
				-

			
 
				-

			
 
				-

			
 
				-

			
 
				-%\section{Acknowledgments}

			
 
				-

			
 
				-

			
 
				-%\bibliographystyle{abbrv}

			
 
				-%\bibliography{gfs}

			
 
				-%\balancecolumns

			
 
				-%\subsection{References}

			
 
				-\end{document}

			
 
				+\documentclass{sig-alternate}
			
 
				+
			
 
				+\usepackage{mathtools}
			
 
				+\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
			
 
				+
			
 
				+\begin{document}
			
 
				+
			
 
				+% \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
			
 
				+
			
 
				+\title{Galactic File System}
			
 
				+\subtitle{}
			
 
				+
			
 
				+\numberofauthors{1}
			
 
				+
			
 
				+\author{
			
 
				+% You can go ahead and credit any number of authors here,
			
 
				+% e.g. one 'row of three' or two rows (consisting of one row of three
			
 
				+% and a second row of one, two or three).
			
 
				+%
			
 
				+% The command \alignauthor (no curly braces needed) should
			
 
				+% precede each author name, affiliation/snail-mail address and
			
 
				+% e-mail address. Additionally, tag each line of
			
 
				+% affiliation/address with \affaddr, and tag the
			
 
				+% e-mail address with \email.
			
 
				+%
			
 
				+% 1st. author
			
 
				+\alignauthor
			
 
				+  Juan Benet\\
			
 
				+  \email{juan@benet.ai}
			
 
				+}
			
 
				+
			
 
				+\maketitle
			
 
				+\begin{abstract}
			
 
				+The Galactic File System is a peer-to-peer distributed file system capable of
			
 
				+sharing the same files with millions of nodes. GFS combines a distributed
			
 
				+hashtable, cryptographic techniques, merkle trees, content-addressable
			
 
				+storage, bittorrent, and tag-based filesystems to build a single massive
			
 
				+file system shared between peers. GFS has no single point of failure, and
			
 
				+nodes do not need to trust each other.
			
 
				+\end{abstract}
			
 
				+
			
 
				+\section{Introduction}
			
 
				+
			
 
				+Cite:
			
 
				+CFS
			
 
				+Kademlia
			
 
				+Bittorrent
			
 
				+Chord
			
 
				+DHash
			
 
				+SFS
			
 
				+Ori
			
 
				+
			
 
				+\section{GFS Overview}
			
 
				+
			
 
				+GFS is a distributed file system where all nodes are the same. Together, the
			
 
				+nodes store the GFS files in local storage, and send the files to each other.
			
 
				+GFS implements its features by combining several subsystems with many
			
 
				+desirable properties:
			
 
				+
			
 
				+\begin{enumerate}
			
 
				+  \item A Coral-based \textbf{Distributed Sloppy Hash Table} (DSHT) to link and
			
 
				+        coordinate peer-to-peer nodes.
			
 
				+  \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
			
 
				+        Blocks efficiently, and to incentivize replication.
			
 
				+  \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
			
 
				+  \item An SFS-based self-certifying name system.
			
 
				+\end{enumerate}
			
 
				+
			
 
				+
			
 
				+These subsystems are not independent. They are well integrated and leverage
			
 
				+their blended properties. However, it is useful to describe them separately,
			
 
				+building the system from the bottom up. Note that all GFS nodes are identical,
			
 
				+and run the same program.
			
 
				+
			
 
				+\subsection{Distributed Sloppy Hash Table}
			
 
				+
			
 
				+First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
			
 
				+and identify which nodes can serve a particular block of data.
			
 
				+
			
 
				+\subsubsection{Kademlia DHT}
			
 
				+
			
 
				+Kademlia is a DHT that provides:
			
 
				+
			
 
				+\begin{enumerate}
			
 
				+
			
 
				+  \item Efficient lookup through massive networks:
			
 
				+        queries on average contact $ \ceil{log_2 (n)} $ nodes.
			
 
				+        (e.g. $20$ hops for a network of $10000000$ nodes).
			
 
				+
			
 
				+  \item Low coordination overhead: it optimizes the number of
			
 
				+        control messages it sends to other nodes.
			
 
				+
			
 
				+  \item Resistance to various attacks, by preferring nodes who have been
			
 
				+        part of the DHT longer.
			
 
				+
			
 
				+  \item wide useage in peer-to-peer applications, including Gnutella and
			
 
				+        Bittorrent, forming networks of over 100 million nodes.
			
 
				+
			
 
				+ \end{enumerate}
			
 
				+
			
 
				+While some peer-to-peer filesystems store data blocks directly in DHTs,
			
 
				+this ``wastes storage and bandwidth, as data must be stored at nodes where it
			
 
				+is not needed''. Instead, GFS stores a list of peers that can provide the data block.
			
 
				+
			
 
				+\subsubsection{Coral DSHT}
			
 
				+
			
 
				+Coral extends Kademlia in three particularly important ways:
			
 
				+
			
 
				+\begin{enumerate}
			
 
				+
			
 
				+  \item Kademlia stores values in nodes whose ids are ``nearest'' (using
			
 
				+        XOR-distance) to the key. This does not take into account application
			
 
				+        data locality, ignores ``far'' nodes who may already have the data, and
			
 
				+        forces ``nearest'' nodes to store it, whether they need it or not.
			
 
				+        This wastes significant storage and bandwith. Instead, Coral stores
			
 
				+        addresses to peers who can provide the data blocks.
			
 
				+
			
 
				+  \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
			
 
				+        \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
			
 
				+        This still works since Coral users only need a single (working) peer,
			
 
				+        not the complete list. In return, Coral can distribute only subsets of
			
 
				+        the values to the ``nearest'' nodes, avoiding hot-spots (overloading
			
 
				+        \textit{all the nearest nodes} when a key becomes popular).
			
 
				+
			
 
				+  \item Additionally, Coral organizes a hierarchy of separate DSHTs called
			
 
				+        \textit{clusters} depending on region and size. This enables nodes to
			
 
				+        query peers in their region first, ``finding nearby data without
			
 
				+        querying distant nodes'' and greatly reducing the latency of
			
 
				+        lookups.
			
 
				+
			
 
				+\end{enumerate}
			
 
				+
			
 
				+
			
 
				+\subsubsection{GFS DSHT}
			
 
				+
			
 
				+The GFS DSHT supports four RPC calls:
			
 
				+
			
 
				+
			
 
				+
			
 
				+
			
 
				+\subsection{Object Model}
			
 
				+
			
 
				+Files are represented as a collection of inter-related objects, like in the
			
 
				+version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
			
 
				+
			
 
				+\begin{enumerate}
			
 
				+  \item \texttt{chunk}: a variable-size block of data.
			
 
				+  \item \texttt{list}: a collection of chunks or other lists.
			
 
				+  \item \texttt{tree}: a collection of chunks, lists, or other trees.
			
 
				+\end{enumerate}
			
 
				+
			
 
				+\subsubsection{Block Object}
			
 
				+
			
 
				+The \texttt{Block} object contains an addressable unit of data, and
			
 
				+represents a file.
			
 
				+GFS Blocks are like Git blobs or filesystem data blocks. They store the
			
 
				+users' data. (The name \textit{block} is preferred over \textit{blob}, as the
			
 
				+Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
			
 
				+GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
			
 
				+Format:
			
 
				+\begin{verbatim}
			
 
				+block <size>
			
 
				+<block data bytes>
			
 
				+...
			
 
				+\end{verbatim}
			
 
				+
			
 
				+
			
 
				+\subsubsection{List Object}
			
 
				+
			
 
				+The \texttt{List} object represents a (large) file made up of several
			
 
				+GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
			
 
				+an ordered sequence of \texttt{block} or \texttt{list} objects.
			
 
				+In a sense, the GFS \texttt{List} functions like a filesystem file with
			
 
				+indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
			
 
				+Format:
			
 
				+\begin{verbatim}
			
 
				+blob <num objects> <size>
			
 
				+<list or block> <checksum> <size>
			
 
				+<list or block> <checksum> <size>
			
 
				+...
			
 
				+\end{verbatim}
			
 
				+
			
 
				+
			
 
				+\subsubsection{Tree Object}
			
 
				+
			
 
				+The \texttt{tree} object in GFS is similar to Git trees: it represents a
			
 
				+directory, a list of checksums and names. The checksums reference \texttt{blob}
			
 
				+or other \texttt{tree} objects. Note that traditional path naming
			
 
				+is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
			
 
				+\texttt{lists} are only addressed by their \texttt{checksums}.
			
 
				+% Unlike in Git, GFS trees include file-system metadata such as file
			
 
				+%permissions.
			
 
				+Format:
			
 
				+\begin{verbatim}
			
 
				+tree <num objects> <size>
			
 
				+<tree or list or block> <checksum> <size> <name>
			
 
				+<tree or list or block> <checksum> <size> <name>
			
 
				+...
			
 
				+\end{verbatim}
			
 
				+
			
 
				+\subsubsection{Commit Object}
			
 
				+
			
 
				+The \texttt{commit} object in GFS is similar to Git's. It represents a
			
 
				+snapshot in the version history of a \texttt{tree}.
			
 
				+
			
 
				+\begin{verbatim}
			
 
				+commit <size>
			
 
				+parent <commit checksum>
			
 
				+tree <tree checksum>
			
 
				+author Full Name <email@address.com> <ISO UTC date>
			
 
				+committer Full Name <email@address.com> <ISO UTC date>
			
 
				+<commit message>
			
 
				+\end{verbatim}
			
 
				+
			
 
				+\subsubsection{Version control}
			
 
				+
			
 
				+\subsubsection{Signed Objects}
			
 
				+
			
 
				+All objects can be signed. Add signature to bottom of object.
			
 
				+(yes, this changes the hash, as it should)
			
 
				+
			
 
				+\subsubsection{Merkle Trees}
			
 
				+
			
 
				+The object model in GFS forms a \textit{Merkle Tree}, where every object
			
 
				+contains hashes of its children. This provides GFS with the useful properties
			
 
				+of merkle trees:
			
 
				+
			
 
				+\begin{enumerate}
			
 
				+  \item Tamper resistance
			
 
				+\end{enumerate}
			
 
				+
			
 
				+\subsubsection{Published Branches}
			
 
				+
			
 
				+Users can publish branches (filesystems) with:
			
 
				+publickey -> signed tree of branches
			
 
				+
			
 
				+
			
 
				+\subsection{Chunk Exchange}
			
 
				+
			
 
				+\subsection{Object Distribution}
			
 
				+
			
 
				+\subsubsection{Spreading Objects}
			
 
				+
			
 
				+DHash spread along the DHT nodes?
			
 
				+Mainline DHT peer registry?
			
 
				+
			
 
				+\subsubsection{Pinning Objects}
			
 
				+
			
 
				+
			
 
				+\section{Conclusions}
			
 
				+
			
 
				+
			
 
				+
			
 
				+
			
 
				+%\section{Acknowledgments}
			
 
				+
			
 
				+
			
 
				+%\bibliographystyle{abbrv}
			
 
				+%\bibliography{gfs}
			
 
				+%\balancecolumns
			
 
				+%\subsection{References}
			
 
				+\end{document}