|
@@ -1,179 +1,262 @@
|
|
|
-\documentclass{sig-alternate}
|
|
|
-
|
|
|
-\begin{document}
|
|
|
-
|
|
|
-\conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
|
|
|
-
|
|
|
-\title{Gallactic File System}
|
|
|
-\subtitle{}
|
|
|
-
|
|
|
-\numberofauthors{1}
|
|
|
-
|
|
|
-\author{
|
|
|
-% You can go ahead and credit any number of authors here,
|
|
|
-% e.g. one 'row of three' or two rows (consisting of one row of three
|
|
|
-% and a second row of one, two or three).
|
|
|
-%
|
|
|
-% The command \alignauthor (no curly braces needed) should
|
|
|
-% precede each author name, affiliation/snail-mail address and
|
|
|
-% e-mail address. Additionally, tag each line of
|
|
|
-% affiliation/address with \affaddr, and tag the
|
|
|
-% e-mail address with \email.
|
|
|
-%
|
|
|
-% 1st. author
|
|
|
-\alignauthor
|
|
|
-Juan Benet\\
|
|
|
- \affaddr{athena.ai}\\
|
|
|
- \affaddr{498 Walsh Rd}\\
|
|
|
- \affaddr{Atherton, CA, USA}\\
|
|
|
- \email{juan@benet.ai}
|
|
|
-}
|
|
|
-
|
|
|
-\maketitle
|
|
|
-\begin{abstract}
|
|
|
-The Gallactic File System is a peer-to-peer distributed file system capable of
|
|
|
-sharing the same files with millions of nodes. GFS combines a distributed
|
|
|
-hashtable, cryptographic techniques, merkle trees, content-addressable
|
|
|
-storage, bittorrent, and tag-based filesystems to build a single massive
|
|
|
-file system shared between peers. GFS has no single point of failure, and
|
|
|
-nodes do not need to trust each other.
|
|
|
-\end{abstract}
|
|
|
-
|
|
|
-\section{Introduction}
|
|
|
-
|
|
|
-\section{GFS Overview}
|
|
|
-
|
|
|
-GFS is a distributed file system where all nodes are the same. Together, the
|
|
|
-nodes store the GFS files in local storage, and send the files to each other.
|
|
|
-GFS implements its features by combining three well-known systems:
|
|
|
-
|
|
|
-\begin{enumerate}
|
|
|
- \item A Git-like \textbf{Object Model} to represent the filesystem.
|
|
|
- \item A Kademlia-based \textbf{Distributed Hash Table} to coordinate the retrieval of files.
|
|
|
- \item A Bittorrent-like peer-to-peer data \textbf{Chunk Exchange}.
|
|
|
-\end{enumerate}
|
|
|
-
|
|
|
-
|
|
|
-\subsection{Object Model}
|
|
|
-
|
|
|
-Files are represented as a collection of inter-related objects, like in the
|
|
|
-version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
|
|
|
-
|
|
|
-\begin{enumerate}
|
|
|
- \item \texttt{chunk}: a variable-size block of data.
|
|
|
- \item \texttt{list}: a collection of chunks or other lists.
|
|
|
- \item \texttt{tree}: a collection of chunks, lists, or other trees.
|
|
|
-\end{enumerate}
|
|
|
-
|
|
|
-\subsubsection{Block Object}
|
|
|
-
|
|
|
-The \texttt{Block} object contains an addressable unit of data, and
|
|
|
-represents a file.
|
|
|
-GFS Blocks are like Git blobs or filesystem data blocks. They store the
|
|
|
-users' data. (The name \textit{block} is preferred over \textit{blob}, as the
|
|
|
-Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
|
|
|
-GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
|
|
|
-Format:
|
|
|
-\begin{verbatim}
|
|
|
-block <size>
|
|
|
-<block data bytes>
|
|
|
-...
|
|
|
-\end{verbatim}
|
|
|
-
|
|
|
-
|
|
|
-\subsubsection{List Object}
|
|
|
-
|
|
|
-The \texttt{List} object represents a (large) file made up of several
|
|
|
-GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
|
|
|
-an ordered sequence of \texttt{block} or \texttt{list} objects.
|
|
|
-In a sense, the GFS \texttt{List} functions like a filesystem file with
|
|
|
-indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
|
|
|
-Format:
|
|
|
-\begin{verbatim}
|
|
|
-blob <num objects> <size>
|
|
|
-<list or block> <checksum> <size>
|
|
|
-<list or block> <checksum> <size>
|
|
|
-...
|
|
|
-\end{verbatim}
|
|
|
-
|
|
|
-
|
|
|
-\subsubsection{Tree Object}
|
|
|
-
|
|
|
-The \texttt{tree} object in GFS is similar to Git trees: it represents a
|
|
|
-directory, a list of checksums and names. The checksums reference \texttt{blob}
|
|
|
-or other \texttt{tree} objects. Note that traditional path naming
|
|
|
-is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
|
|
|
-\texttt{lists} are only addressed by their \texttt{checksums}.
|
|
|
-% Unlike in Git, GFS trees include file-system metadata such as file
|
|
|
-%permissions.
|
|
|
-Format:
|
|
|
-\begin{verbatim}
|
|
|
-tree <num objects> <size>
|
|
|
-<tree or list or block> <checksum> <size> <name>
|
|
|
-<tree or list or block> <checksum> <size> <name>
|
|
|
-...
|
|
|
-\end{verbatim}
|
|
|
-
|
|
|
-\subsubsection{Commit Object}
|
|
|
-
|
|
|
-The \texttt{commit} object in GFS is similar to Git's. It represents a
|
|
|
-snapshot in the version history of a \texttt{tree}.
|
|
|
-
|
|
|
-\begin{verbatim}
|
|
|
-commit <size>
|
|
|
-parent <commit checksum>
|
|
|
-tree <tree checksum>
|
|
|
-author Full Name <email@address.com> <ISO UTC date>
|
|
|
-committer Full Name <email@address.com> <ISO UTC date>
|
|
|
-<commit message>
|
|
|
-\end{verbatim}
|
|
|
-
|
|
|
-\subsubsection{Version control}
|
|
|
-
|
|
|
-\subsubsection{Signed Objects}
|
|
|
-
|
|
|
-All objects can be signed. Add signature to bottom of object.
|
|
|
-(yes, this changes the hash, as it should)
|
|
|
-
|
|
|
-\subsubsection{Merkle Trees}
|
|
|
-
|
|
|
-The object model in GFS forms a \textit{Merkle Tree}, where every object
|
|
|
-contains hashes of its children. This provides GFS with the useful properties
|
|
|
-of merkle trees:
|
|
|
-
|
|
|
-\begin{enumerate}
|
|
|
- \item Tamper resistance
|
|
|
-\end{enumerate}
|
|
|
-
|
|
|
-\subsubsection{Published Branches}
|
|
|
-
|
|
|
-Users can publish branches (filesystems) with:
|
|
|
-publickey -> signed tree of branches
|
|
|
-
|
|
|
-
|
|
|
-\subsection{Distributed Hash Table}
|
|
|
-\subsection{Chunk Exchange}
|
|
|
-
|
|
|
-\subsection{Object Distribution}
|
|
|
-
|
|
|
-\subsubsection{Spreading Objects}
|
|
|
-
|
|
|
-DHash spread along the DHT nodes?
|
|
|
-Mainline DHT peer registry?
|
|
|
-
|
|
|
-\subsubsection{Pinning Objects}
|
|
|
-
|
|
|
-
|
|
|
-\section{Conclusions}
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-
|
|
|
-%\section{Acknowledgments}
|
|
|
-
|
|
|
-
|
|
|
-%\bibliographystyle{abbrv}
|
|
|
-%\bibliography{gfs}
|
|
|
-%\balancecolumns
|
|
|
-%\subsection{References}
|
|
|
-\end{document}
|
|
|
+\documentclass{sig-alternate}
|
|
|
+
|
|
|
+\usepackage{mathtools}
|
|
|
+\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
|
|
|
+
|
|
|
+\begin{document}
|
|
|
+
|
|
|
+% \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
|
|
|
+
|
|
|
+\title{Galactic File System}
|
|
|
+\subtitle{}
|
|
|
+
|
|
|
+\numberofauthors{1}
|
|
|
+
|
|
|
+\author{
|
|
|
+% You can go ahead and credit any number of authors here,
|
|
|
+% e.g. one 'row of three' or two rows (consisting of one row of three
|
|
|
+% and a second row of one, two or three).
|
|
|
+%
|
|
|
+% The command \alignauthor (no curly braces needed) should
|
|
|
+% precede each author name, affiliation/snail-mail address and
|
|
|
+% e-mail address. Additionally, tag each line of
|
|
|
+% affiliation/address with \affaddr, and tag the
|
|
|
+% e-mail address with \email.
|
|
|
+%
|
|
|
+% 1st. author
|
|
|
+\alignauthor
|
|
|
+ Juan Benet\\
|
|
|
+ \email{juan@benet.ai}
|
|
|
+}
|
|
|
+
|
|
|
+\maketitle
|
|
|
+\begin{abstract}
|
|
|
+The Galactic File System is a peer-to-peer distributed file system capable of
|
|
|
+sharing the same files with millions of nodes. GFS combines a distributed
|
|
|
+hashtable, cryptographic techniques, merkle trees, content-addressable
|
|
|
+storage, bittorrent, and tag-based filesystems to build a single massive
|
|
|
+file system shared between peers. GFS has no single point of failure, and
|
|
|
+nodes do not need to trust each other.
|
|
|
+\end{abstract}
|
|
|
+
|
|
|
+\section{Introduction}
|
|
|
+
|
|
|
+Cite:
|
|
|
+CFS
|
|
|
+Kademlia
|
|
|
+Bittorrent
|
|
|
+Chord
|
|
|
+DHash
|
|
|
+SFS
|
|
|
+Ori
|
|
|
+
|
|
|
+\section{GFS Overview}
|
|
|
+
|
|
|
+GFS is a distributed file system where all nodes are the same. Together, the
|
|
|
+nodes store the GFS files in local storage, and send the files to each other.
|
|
|
+GFS implements its features by combining several subsystems with many
|
|
|
+desirable properties:
|
|
|
+
|
|
|
+\begin{enumerate}
|
|
|
+ \item A Coral-based \textbf{Distributed Sloppy Hash Table} (DSHT) to link and
|
|
|
+ coordinate peer-to-peer nodes.
|
|
|
+ \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
|
|
|
+ Blocks efficiently, and to incentivize replication.
|
|
|
+ \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
|
|
|
+ \item An SFS-based self-certifying name system.
|
|
|
+\end{enumerate}
|
|
|
+
|
|
|
+
|
|
|
+These subsystems are not independent. They are well integrated and leverage
|
|
|
+their blended properties. However, it is useful to describe them separately,
|
|
|
+building the system from the bottom up. Note that all GFS nodes are identical,
|
|
|
+and run the same program.
|
|
|
+
|
|
|
+\subsection{Distributed Sloppy Hash Table}
|
|
|
+
|
|
|
+First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
|
|
|
+and identify which nodes can serve a particular block of data.
|
|
|
+
|
|
|
+\subsubsection{Kademlia DHT}
|
|
|
+
|
|
|
+Kademlia is a DHT that provides:
|
|
|
+
|
|
|
+\begin{enumerate}
|
|
|
+
|
|
|
+ \item Efficient lookup through massive networks:
|
|
|
+ queries on average contact $ \ceil{log_2 (n)} $ nodes.
|
|
|
+ (e.g. $20$ hops for a network of $10000000$ nodes).
|
|
|
+
|
|
|
+ \item Low coordination overhead: it optimizes the number of
|
|
|
+ control messages it sends to other nodes.
|
|
|
+
|
|
|
+ \item Resistance to various attacks, by preferring nodes who have been
|
|
|
+ part of the DHT longer.
|
|
|
+
|
|
|
+ \item wide useage in peer-to-peer applications, including Gnutella and
|
|
|
+ Bittorrent, forming networks of over 100 million nodes.
|
|
|
+
|
|
|
+ \end{enumerate}
|
|
|
+
|
|
|
+While some peer-to-peer filesystems store data blocks directly in DHTs,
|
|
|
+this ``wastes storage and bandwidth, as data must be stored at nodes where it
|
|
|
+is not needed''. Instead, GFS stores a list of peers that can provide the data block.
|
|
|
+
|
|
|
+\subsubsection{Coral DSHT}
|
|
|
+
|
|
|
+Coral extends Kademlia in three particularly important ways:
|
|
|
+
|
|
|
+\begin{enumerate}
|
|
|
+
|
|
|
+ \item Kademlia stores values in nodes whose ids are ``nearest'' (using
|
|
|
+ XOR-distance) to the key. This does not take into account application
|
|
|
+ data locality, ignores ``far'' nodes who may already have the data, and
|
|
|
+ forces ``nearest'' nodes to store it, whether they need it or not.
|
|
|
+ This wastes significant storage and bandwith. Instead, Coral stores
|
|
|
+ addresses to peers who can provide the data blocks.
|
|
|
+
|
|
|
+ \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
|
|
|
+ \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
|
|
|
+ This still works since Coral users only need a single (working) peer,
|
|
|
+ not the complete list. In return, Coral can distribute only subsets of
|
|
|
+ the values to the ``nearest'' nodes, avoiding hot-spots (overloading
|
|
|
+ \textit{all the nearest nodes} when a key becomes popular).
|
|
|
+
|
|
|
+ \item Additionally, Coral organizes a hierarchy of separate DSHTs called
|
|
|
+ \textit{clusters} depending on region and size. This enables nodes to
|
|
|
+ query peers in their region first, ``finding nearby data without
|
|
|
+ querying distant nodes'' and greatly reducing the latency of
|
|
|
+ lookups.
|
|
|
+
|
|
|
+\end{enumerate}
|
|
|
+
|
|
|
+
|
|
|
+\subsubsection{GFS DSHT}
|
|
|
+
|
|
|
+The GFS DSHT supports four RPC calls:
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Object Model}
|
|
|
+
|
|
|
+Files are represented as a collection of inter-related objects, like in the
|
|
|
+version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
|
|
|
+
|
|
|
+\begin{enumerate}
|
|
|
+ \item \texttt{chunk}: a variable-size block of data.
|
|
|
+ \item \texttt{list}: a collection of chunks or other lists.
|
|
|
+ \item \texttt{tree}: a collection of chunks, lists, or other trees.
|
|
|
+\end{enumerate}
|
|
|
+
|
|
|
+\subsubsection{Block Object}
|
|
|
+
|
|
|
+The \texttt{Block} object contains an addressable unit of data, and
|
|
|
+represents a file.
|
|
|
+GFS Blocks are like Git blobs or filesystem data blocks. They store the
|
|
|
+users' data. (The name \textit{block} is preferred over \textit{blob}, as the
|
|
|
+Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
|
|
|
+GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
|
|
|
+Format:
|
|
|
+\begin{verbatim}
|
|
|
+block <size>
|
|
|
+<block data bytes>
|
|
|
+...
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+
|
|
|
+\subsubsection{List Object}
|
|
|
+
|
|
|
+The \texttt{List} object represents a (large) file made up of several
|
|
|
+GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
|
|
|
+an ordered sequence of \texttt{block} or \texttt{list} objects.
|
|
|
+In a sense, the GFS \texttt{List} functions like a filesystem file with
|
|
|
+indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
|
|
|
+Format:
|
|
|
+\begin{verbatim}
|
|
|
+blob <num objects> <size>
|
|
|
+<list or block> <checksum> <size>
|
|
|
+<list or block> <checksum> <size>
|
|
|
+...
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+
|
|
|
+\subsubsection{Tree Object}
|
|
|
+
|
|
|
+The \texttt{tree} object in GFS is similar to Git trees: it represents a
|
|
|
+directory, a list of checksums and names. The checksums reference \texttt{blob}
|
|
|
+or other \texttt{tree} objects. Note that traditional path naming
|
|
|
+is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
|
|
|
+\texttt{lists} are only addressed by their \texttt{checksums}.
|
|
|
+% Unlike in Git, GFS trees include file-system metadata such as file
|
|
|
+%permissions.
|
|
|
+Format:
|
|
|
+\begin{verbatim}
|
|
|
+tree <num objects> <size>
|
|
|
+<tree or list or block> <checksum> <size> <name>
|
|
|
+<tree or list or block> <checksum> <size> <name>
|
|
|
+...
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+\subsubsection{Commit Object}
|
|
|
+
|
|
|
+The \texttt{commit} object in GFS is similar to Git's. It represents a
|
|
|
+snapshot in the version history of a \texttt{tree}.
|
|
|
+
|
|
|
+\begin{verbatim}
|
|
|
+commit <size>
|
|
|
+parent <commit checksum>
|
|
|
+tree <tree checksum>
|
|
|
+author Full Name <email@address.com> <ISO UTC date>
|
|
|
+committer Full Name <email@address.com> <ISO UTC date>
|
|
|
+<commit message>
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+\subsubsection{Version control}
|
|
|
+
|
|
|
+\subsubsection{Signed Objects}
|
|
|
+
|
|
|
+All objects can be signed. Add signature to bottom of object.
|
|
|
+(yes, this changes the hash, as it should)
|
|
|
+
|
|
|
+\subsubsection{Merkle Trees}
|
|
|
+
|
|
|
+The object model in GFS forms a \textit{Merkle Tree}, where every object
|
|
|
+contains hashes of its children. This provides GFS with the useful properties
|
|
|
+of merkle trees:
|
|
|
+
|
|
|
+\begin{enumerate}
|
|
|
+ \item Tamper resistance
|
|
|
+\end{enumerate}
|
|
|
+
|
|
|
+\subsubsection{Published Branches}
|
|
|
+
|
|
|
+Users can publish branches (filesystems) with:
|
|
|
+publickey -> signed tree of branches
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Chunk Exchange}
|
|
|
+
|
|
|
+\subsection{Object Distribution}
|
|
|
+
|
|
|
+\subsubsection{Spreading Objects}
|
|
|
+
|
|
|
+DHash spread along the DHT nodes?
|
|
|
+Mainline DHT peer registry?
|
|
|
+
|
|
|
+\subsubsection{Pinning Objects}
|
|
|
+
|
|
|
+
|
|
|
+\section{Conclusions}
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+%\section{Acknowledgments}
|
|
|
+
|
|
|
+
|
|
|
+%\bibliographystyle{abbrv}
|
|
|
+%\bibliography{gfs}
|
|
|
+%\balancecolumns
|
|
|
+%\subsection{References}
|
|
|
+\end{document}
|