123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382 |
- \documentclass{sig-alternate}
- \usepackage{array}
- \usepackage{amstext}
- \usepackage{mathtools}
- \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
- \begin{document}
- % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
- \title{Galactic File System}
- \subtitle{}
- \numberofauthors{1}
- \author{
- % You can go ahead and credit any number of authors here,
- % e.g. one 'row of three' or two rows (consisting of one row of three
- % and a second row of one, two or three).
- %
- % The command \alignauthor (no curly braces needed) should
- % precede each author name, affiliation/snail-mail address and
- % e-mail address. Additionally, tag each line of
- % affiliation/address with \affaddr, and tag the
- % e-mail address with \email.
- %
- % 1st. author
- \alignauthor
- Juan Benet\\
- \email{juan@benet.ai}
- }
- \maketitle
- \begin{abstract}
- The Galactic File System is a peer-to-peer distributed file system capable of
- sharing the same files with millions of nodes. GFS combines a distributed
- hashtable, cryptographic techniques, merkle trees, content-addressable
- storage, bittorrent, and tag-based filesystems to build a single massive
- file system shared between peers. GFS has no single point of failure, and
- nodes do not need to trust each other.
- \end{abstract}
- \section{Introduction}
- Cite:
- CFS
- Kademlia
- Bittorrent
- Chord
- DHash
- SFS
- Ori
- \section{Design}
- \subsection{GFS Nodes}
- GFS is a distributed file system where all nodes are the same. They are
- identified by a \texttt{NodeId}, the cryptographic hash of a public-key
- (note that \textit{checksum} will henceforth refer specifically to crypographic
- hashes of an object). Nodes also store their public + private keys. Clients are
- free to instatiate a new node on every launch, though that means losing any
- accrued benefits. It is recommended that nodes remain the same.
- \begin{verbatim}
- type Node struct {
- id NodeID
- pubkey PublicKey
- prikey PrivateKey
- }
- \end{verbatim}
- Together, the
- nodes store the GFS files in local storage, and send files to each other.
- GFS implements its features by combining several subsystems with many
- desirable properties:
- \begin{enumerate}
- \item A Coral-based \textbf{Distributed Sloppy Hash Table}\\
- (DSHT) to link and coordinate peer-to-peer nodes.
- \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
- Blocks efficiently, and to incentivize replication.
- \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
- \item An SFS-based self-certifying name system.
- \end{enumerate}
- These subsystems are not independent. They are well integrated and leverage
- their blended properties. However, it is useful to describe them separately,
- building the system from the bottom up. Note that all GFS nodes are identical,
- and run the same program.
- \subsection{Distributed Sloppy Hash Table}
- First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
- and identify which nodes can serve a particular block of data.
- \subsubsection{Kademlia DHT}
- Kademlia is a DHT that provides:
- \begin{enumerate}
- \item Efficient lookup through massive networks:
- queries on average contact $ \ceil{log_2 (n)} $ nodes.
- (e.g. $20$ hops for a network of $10000000$ nodes).
- \item Low coordination overhead: it optimizes the number of
- control messages it sends to other nodes.
- \item Resistance to various attacks, by preferring nodes who have been
- part of the DHT longer.
- \item wide useage in peer-to-peer applications, including Gnutella and
- Bittorrent, forming networks of over 100 million nodes.
- \end{enumerate}
- While some peer-to-peer filesystems store data blocks directly in DHTs,
- this ``wastes storage and bandwidth, as data must be stored at nodes where it
- is not needed''. Instead, GFS stores a list of peers that can provide the data block.
- \subsubsection{Coral DSHT}
- Coral extends Kademlia in three particularly important ways:
- \begin{enumerate}
- \item Kademlia stores values in nodes whose ids are ``nearest'' (using
- XOR-distance) to the key. This does not take into account application
- data locality, ignores ``far'' nodes who may already have the data, and
- forces ``nearest'' nodes to store it, whether they need it or not.
- This wastes significant storage and bandwith. Instead, Coral stores
- addresses to peers who can provide the data blocks.
- \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
- \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
- This still works since Coral users only need a single (working) peer,
- not the complete list. In return, Coral can distribute only subsets of
- the values to the ``nearest'' nodes, avoiding hot-spots (overloading
- \textit{all the nearest nodes} when a key becomes popular).
- \item Additionally, Coral organizes a hierarchy of separate DSHTs called
- \textit{clusters} depending on region and size. This enables nodes to
- query peers in their region first, ``finding nearby data without
- querying distant nodes'' and greatly reducing the latency of
- lookups.
- \end{enumerate}
- \subsubsection{GFS DSHT}
- The GFS DSHT supports four RPC calls:
- \subsection{Block Exchange - BitSwap Protocol}
- The exchange of data in GFS happens by exchanging blocks with peers using a
- BitTorrent inspired protocol: BitSwap. Like BitTorrent, BitSwap peers are
- looking to acquire a set of blocks, and have blocks to offer in exchange.
- Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent.
- BitSwap operates as a persistent marketplace where node can acquire the
- blocks they need, regardless of what files the blocks are part of. The
- blocks could come from completely unrelated files in the filesystem.
- But nodes come together to barter in the marketplace.
- While the notion of a barter system implies a virtual currency could be
- created, this would require a global ledger (blockchain) to track ownership
- and transfer of the currency. This will be explored in a future paper.
- Instead, BitSwap nodes have to provide direct value to each other
- in the form of blocks. This works fine when the distribution of blocks across
- nodes is such that they have the complements, what each other wants. This will
- seldom be the case. Instead, it is more likely that nodes must \textit{work}
- for their blocks. In the case that a node has nothing that its peers want (or
- nothing at all), it seeks the pieces its peers might want, with lower
- priority. This incentivizes nodes to cache and disseminate rare pieces, even
- if they are not interested in them directly.
- \subsubsection{BitSwap Credit}
- The protocol must also incentivize nodes to seed when they do not need
- anything in particular, as they might have the blocks others want. Thus,
- BitFlow nodes send blocks to their peers, optimistically expecting the debt to
- be repaid. But, leeches (free-loading nodes that never share) must be avoided. A simple credit-like system solves the problem:
- \begin{enumerate}
- \item Peers track their balance (in bytes verified) with other nodes.
- \item Peers send blocks to each other probabilistically, according to
- a function, that falls when owed and rises when owing.
- \item The sigmoid (scaled by a comparison of the ownership) provides such a
- function:
- \[ P(send) = \dfrac{1}{1 + exp(-r)} \]
- where the \textit{debt ratio} $ r $ is
- \[ r = \dfrac{\texttt{bytes\_recv} - \texttt{bytes\_sent}}{\texttt{bytes\_sent}} \]
- \end{enumerate}
- \begin{center}
- \begin{tabular}{ >{$}c<{$} >{$}c<{$}}
- P_{send}(\;\;\;r) =& likelihood \\
- \hline
- \hline
- P_{send}(-5) =& 0.01 \\
- P_{send}(-4) =& 0.02 \\
- P_{send}(-3) =& 0.05 \\
- P_{send}(-2) =& 0.12 \\
- P_{send}(-1) =& 0.27 \\
- P_{send}(\;\;\;0) =& 0.50 \\
- P_{send}(\;\;\;1) =& 0.73 \\
- P_{send}(\;\;\;2) =& 0.88 \\
- P_{send}(\;\;\;3) =& 0.95 \\
- P_{send}(\;\;\;4) =& 0.98 \\
- \end{tabular}
- \end{center}
- As you can see in Table 1, this function drops off quickly as the nodes' \
- \textit{debt ratio} surpasses twice the established credit.
- This \textit{debt ratio} is a measure of trust:
- lenient to debts between nodes that have previously exchanged lots of data
- successfully, and merciless to unknown, untrusted nodes. This
- (a) provides resistane to attackers who would create lots of new nodes,
- (b) protects previously successful trade relationships, even if one of the
- nodes is temporarily unable to provide value, and
- (c) eventually chokes relationships that have deteriorated until they
- improve.
- \subsubsection{BitSwap Ledger}
- BitSwap nodes keep ledgers accounting the transfers with other nodes.
- A ledger snapshot contains a pointer to the previous snapshot (its checksum),
- forming a hash-chain. This allows nodes to keep track of history, and to avoid
- tampering. At initializing, BitSwap nodes exchange their ledger information.
- If it does not match exactly, the ledger is reinitialized from scratch,
- loosing the accrued credit or debt. It is possible for malicious nodes to
- purposefully ``loose'' the Ledger, hoping the erase debts. It is unlikely that
- nodes will have accrued enough debt to warrant also losing the accrued trust,
- however the partner node is free to count it as \textit{misconduct} (discussed
- later).
- \begin{verbatim}
- var Ledgers = map[NodeId]Ledger
- type Ledger struct {
- parent Checksum
- owner NodeId
- partner NodeId
- bytes_sent int
- bytes_recv int
- }
- \end{verbatim}
- Nodes are free to keep the ledger history, though it is not necessary for
- correct operation. Only the current ledger entries are useful.
- \subsubsection{Protocol Specification}
- \subsection{Object Model}
- Files are represented as a collection of inter-related objects, like in the
- version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
- \begin{enumerate}
- \item \texttt{chunk}: a variable-size block of data.
- \item \texttt{list}: a collection of chunks or other lists.
- \item \texttt{tree}: a collection of chunks, lists, or other trees.
- \end{enumerate}
- \subsubsection{Block Object}
- The \texttt{Block} object contains an addressable unit of data, and
- represents a file.
- GFS Blocks are like Git blobs or filesystem data blocks. They store the
- users' data. (The name \textit{block} is preferred over \textit{blob}, as the
- Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
- GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
- Format:
- \begin{verbatim}
- block <size>
- <block data bytes>
- ...
- \end{verbatim}
- \subsubsection{List Object}
- The \texttt{List} object represents a (large) file made up of several
- GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
- an ordered sequence of \texttt{block} or \texttt{list} objects.
- In a sense, the GFS \texttt{List} functions like a filesystem file with
- indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
- Format:
- \begin{verbatim}
- blob <num objects> <size>
- <list or block> <checksum> <size>
- <list or block> <checksum> <size>
- ...
- \end{verbatim}
- \subsubsection{Tree Object}
- The \texttt{tree} object in GFS is similar to Git trees: it represents a
- directory, a list of checksums and names. The checksums reference \texttt{blob}
- or other \texttt{tree} objects. Note that traditional path naming
- is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
- \texttt{lists} are only addressed by their \texttt{checksums}.
- % Unlike in Git, GFS trees include file-system metadata such as file
- %permissions.
- Format:
- \begin{verbatim}
- tree <num objects> <size>
- <tree or list or block> <checksum> <size> <name>
- <tree or list or block> <checksum> <size> <name>
- ...
- \end{verbatim}
- \subsubsection{Commit Object}
- The \texttt{commit} object in GFS is similar to Git's. It represents a
- snapshot in the version history of a \texttt{tree}.
- \begin{verbatim}
- commit <size>
- parent <commit checksum>
- tree <tree checksum>
- author Full Name <email@address.com> <ISO UTC date>
- committer Full Name <email@address.com> <ISO UTC date>
- <commit message>
- \end{verbatim}
- \subsubsection{Version control}
- \subsubsection{Signed Objects}
- All objects can be signed. Add signature to bottom of object.
- (yes, this changes the hash, as it should)
- \subsubsection{Merkle Trees}
- The object model in GFS forms a \textit{Merkle Tree}, where every object
- contains hashes of its children. This provides GFS with the useful properties
- of merkle trees:
- \begin{enumerate}
- \item Tamper resistance
- \end{enumerate}
- \subsubsection{Published Branches}
- Users can publish branches (filesystems) with:
- publickey -> signed tree of branches
- \subsection{Object Distribution}
- \subsubsection{Spreading Objects}
- DHash spread along the DHT nodes?
- Mainline DHT peer registry?
- \subsubsection{Pinning Objects}
- \section{Conclusions}
- %\section{Acknowledgments}
- %\bibliographystyle{abbrv}
- %\bibliography{gfs}
- %\balancecolumns
- %\subsection{References}
- \end{document}
|