123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262 |
- \documentclass{sig-alternate}
- \usepackage{mathtools}
- \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
- \begin{document}
- % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
- \title{Galactic File System}
- \subtitle{}
- \numberofauthors{1}
- \author{
- % You can go ahead and credit any number of authors here,
- % e.g. one 'row of three' or two rows (consisting of one row of three
- % and a second row of one, two or three).
- %
- % The command \alignauthor (no curly braces needed) should
- % precede each author name, affiliation/snail-mail address and
- % e-mail address. Additionally, tag each line of
- % affiliation/address with \affaddr, and tag the
- % e-mail address with \email.
- %
- % 1st. author
- \alignauthor
- Juan Benet\\
- \email{juan@benet.ai}
- }
- \maketitle
- \begin{abstract}
- The Galactic File System is a peer-to-peer distributed file system capable of
- sharing the same files with millions of nodes. GFS combines a distributed
- hashtable, cryptographic techniques, merkle trees, content-addressable
- storage, bittorrent, and tag-based filesystems to build a single massive
- file system shared between peers. GFS has no single point of failure, and
- nodes do not need to trust each other.
- \end{abstract}
- \section{Introduction}
- Cite:
- CFS
- Kademlia
- Bittorrent
- Chord
- DHash
- SFS
- Ori
- \section{GFS Overview}
- GFS is a distributed file system where all nodes are the same. Together, the
- nodes store the GFS files in local storage, and send the files to each other.
- GFS implements its features by combining several subsystems with many
- desirable properties:
- \begin{enumerate}
- \item A Coral-based \textbf{Distributed Sloppy Hash Table} (DSHT) to link and
- coordinate peer-to-peer nodes.
- \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
- Blocks efficiently, and to incentivize replication.
- \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
- \item An SFS-based self-certifying name system.
- \end{enumerate}
- These subsystems are not independent. They are well integrated and leverage
- their blended properties. However, it is useful to describe them separately,
- building the system from the bottom up. Note that all GFS nodes are identical,
- and run the same program.
- \subsection{Distributed Sloppy Hash Table}
- First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
- and identify which nodes can serve a particular block of data.
- \subsubsection{Kademlia DHT}
- Kademlia is a DHT that provides:
- \begin{enumerate}
- \item Efficient lookup through massive networks:
- queries on average contact $ \ceil{log_2 (n)} $ nodes.
- (e.g. $20$ hops for a network of $10000000$ nodes).
- \item Low coordination overhead: it optimizes the number of
- control messages it sends to other nodes.
- \item Resistance to various attacks, by preferring nodes who have been
- part of the DHT longer.
- \item wide useage in peer-to-peer applications, including Gnutella and
- Bittorrent, forming networks of over 100 million nodes.
- \end{enumerate}
- While some peer-to-peer filesystems store data blocks directly in DHTs,
- this ``wastes storage and bandwidth, as data must be stored at nodes where it
- is not needed''. Instead, GFS stores a list of peers that can provide the data block.
- \subsubsection{Coral DSHT}
- Coral extends Kademlia in three particularly important ways:
- \begin{enumerate}
- \item Kademlia stores values in nodes whose ids are ``nearest'' (using
- XOR-distance) to the key. This does not take into account application
- data locality, ignores ``far'' nodes who may already have the data, and
- forces ``nearest'' nodes to store it, whether they need it or not.
- This wastes significant storage and bandwith. Instead, Coral stores
- addresses to peers who can provide the data blocks.
- \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
- \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
- This still works since Coral users only need a single (working) peer,
- not the complete list. In return, Coral can distribute only subsets of
- the values to the ``nearest'' nodes, avoiding hot-spots (overloading
- \textit{all the nearest nodes} when a key becomes popular).
- \item Additionally, Coral organizes a hierarchy of separate DSHTs called
- \textit{clusters} depending on region and size. This enables nodes to
- query peers in their region first, ``finding nearby data without
- querying distant nodes'' and greatly reducing the latency of
- lookups.
- \end{enumerate}
- \subsubsection{GFS DSHT}
- The GFS DSHT supports four RPC calls:
- \subsection{Object Model}
- Files are represented as a collection of inter-related objects, like in the
- version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
- \begin{enumerate}
- \item \texttt{chunk}: a variable-size block of data.
- \item \texttt{list}: a collection of chunks or other lists.
- \item \texttt{tree}: a collection of chunks, lists, or other trees.
- \end{enumerate}
- \subsubsection{Block Object}
- The \texttt{Block} object contains an addressable unit of data, and
- represents a file.
- GFS Blocks are like Git blobs or filesystem data blocks. They store the
- users' data. (The name \textit{block} is preferred over \textit{blob}, as the
- Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
- GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
- Format:
- \begin{verbatim}
- block <size>
- <block data bytes>
- ...
- \end{verbatim}
- \subsubsection{List Object}
- The \texttt{List} object represents a (large) file made up of several
- GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
- an ordered sequence of \texttt{block} or \texttt{list} objects.
- In a sense, the GFS \texttt{List} functions like a filesystem file with
- indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
- Format:
- \begin{verbatim}
- blob <num objects> <size>
- <list or block> <checksum> <size>
- <list or block> <checksum> <size>
- ...
- \end{verbatim}
- \subsubsection{Tree Object}
- The \texttt{tree} object in GFS is similar to Git trees: it represents a
- directory, a list of checksums and names. The checksums reference \texttt{blob}
- or other \texttt{tree} objects. Note that traditional path naming
- is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
- \texttt{lists} are only addressed by their \texttt{checksums}.
- % Unlike in Git, GFS trees include file-system metadata such as file
- %permissions.
- Format:
- \begin{verbatim}
- tree <num objects> <size>
- <tree or list or block> <checksum> <size> <name>
- <tree or list or block> <checksum> <size> <name>
- ...
- \end{verbatim}
- \subsubsection{Commit Object}
- The \texttt{commit} object in GFS is similar to Git's. It represents a
- snapshot in the version history of a \texttt{tree}.
- \begin{verbatim}
- commit <size>
- parent <commit checksum>
- tree <tree checksum>
- author Full Name <email@address.com> <ISO UTC date>
- committer Full Name <email@address.com> <ISO UTC date>
- <commit message>
- \end{verbatim}
- \subsubsection{Version control}
- \subsubsection{Signed Objects}
- All objects can be signed. Add signature to bottom of object.
- (yes, this changes the hash, as it should)
- \subsubsection{Merkle Trees}
- The object model in GFS forms a \textit{Merkle Tree}, where every object
- contains hashes of its children. This provides GFS with the useful properties
- of merkle trees:
- \begin{enumerate}
- \item Tamper resistance
- \end{enumerate}
- \subsubsection{Published Branches}
- Users can publish branches (filesystems) with:
- publickey -> signed tree of branches
- \subsection{Chunk Exchange}
- \subsection{Object Distribution}
- \subsubsection{Spreading Objects}
- DHash spread along the DHT nodes?
- Mainline DHT peer registry?
- \subsubsection{Pinning Objects}
- \section{Conclusions}
- %\section{Acknowledgments}
- %\bibliographystyle{abbrv}
- %\bibliography{gfs}
- %\balancecolumns
- %\subsection{References}
- \end{document}
|