gfs.tex 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382
  1. \documentclass{sig-alternate}
  2. \usepackage{array}
  3. \usepackage{amstext}
  4. \usepackage{mathtools}
  5. \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
  6. \begin{document}
  7. % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
  8. \title{Galactic File System}
  9. \subtitle{}
  10. \numberofauthors{1}
  11. \author{
  12. % You can go ahead and credit any number of authors here,
  13. % e.g. one 'row of three' or two rows (consisting of one row of three
  14. % and a second row of one, two or three).
  15. %
  16. % The command \alignauthor (no curly braces needed) should
  17. % precede each author name, affiliation/snail-mail address and
  18. % e-mail address. Additionally, tag each line of
  19. % affiliation/address with \affaddr, and tag the
  20. % e-mail address with \email.
  21. %
  22. % 1st. author
  23. \alignauthor
  24. Juan Benet\\
  25. \email{juan@benet.ai}
  26. }
  27. \maketitle
  28. \begin{abstract}
  29. The Galactic File System is a peer-to-peer distributed file system capable of
  30. sharing the same files with millions of nodes. GFS combines a distributed
  31. hashtable, cryptographic techniques, merkle trees, content-addressable
  32. storage, bittorrent, and tag-based filesystems to build a single massive
  33. file system shared between peers. GFS has no single point of failure, and
  34. nodes do not need to trust each other.
  35. \end{abstract}
  36. \section{Introduction}
  37. Cite:
  38. CFS
  39. Kademlia
  40. Bittorrent
  41. Chord
  42. DHash
  43. SFS
  44. Ori
  45. \section{Design}
  46. \subsection{GFS Nodes}
  47. GFS is a distributed file system where all nodes are the same. They are
  48. identified by a \texttt{NodeId}, the cryptographic hash of a public-key
  49. (note that \textit{checksum} will henceforth refer specifically to crypographic
  50. hashes of an object). Nodes also store their public + private keys. Clients are
  51. free to instatiate a new node on every launch, though that means losing any
  52. accrued benefits. It is recommended that nodes remain the same.
  53. \begin{verbatim}
  54. type Node struct {
  55. id NodeID
  56. pubkey PublicKey
  57. prikey PrivateKey
  58. }
  59. \end{verbatim}
  60. Together, the
  61. nodes store the GFS files in local storage, and send files to each other.
  62. GFS implements its features by combining several subsystems with many
  63. desirable properties:
  64. \begin{enumerate}
  65. \item A Coral-based \textbf{Distributed Sloppy Hash Table}\\
  66. (DSHT) to link and coordinate peer-to-peer nodes.
  67. \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
  68. Blocks efficiently, and to incentivize replication.
  69. \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
  70. \item An SFS-based self-certifying name system.
  71. \end{enumerate}
  72. These subsystems are not independent. They are well integrated and leverage
  73. their blended properties. However, it is useful to describe them separately,
  74. building the system from the bottom up. Note that all GFS nodes are identical,
  75. and run the same program.
  76. \subsection{Distributed Sloppy Hash Table}
  77. First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
  78. and identify which nodes can serve a particular block of data.
  79. \subsubsection{Kademlia DHT}
  80. Kademlia is a DHT that provides:
  81. \begin{enumerate}
  82. \item Efficient lookup through massive networks:
  83. queries on average contact $ \ceil{log_2 (n)} $ nodes.
  84. (e.g. $20$ hops for a network of $10000000$ nodes).
  85. \item Low coordination overhead: it optimizes the number of
  86. control messages it sends to other nodes.
  87. \item Resistance to various attacks, by preferring nodes who have been
  88. part of the DHT longer.
  89. \item wide useage in peer-to-peer applications, including Gnutella and
  90. Bittorrent, forming networks of over 100 million nodes.
  91. \end{enumerate}
  92. While some peer-to-peer filesystems store data blocks directly in DHTs,
  93. this ``wastes storage and bandwidth, as data must be stored at nodes where it
  94. is not needed''. Instead, GFS stores a list of peers that can provide the data block.
  95. \subsubsection{Coral DSHT}
  96. Coral extends Kademlia in three particularly important ways:
  97. \begin{enumerate}
  98. \item Kademlia stores values in nodes whose ids are ``nearest'' (using
  99. XOR-distance) to the key. This does not take into account application
  100. data locality, ignores ``far'' nodes who may already have the data, and
  101. forces ``nearest'' nodes to store it, whether they need it or not.
  102. This wastes significant storage and bandwith. Instead, Coral stores
  103. addresses to peers who can provide the data blocks.
  104. \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
  105. \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
  106. This still works since Coral users only need a single (working) peer,
  107. not the complete list. In return, Coral can distribute only subsets of
  108. the values to the ``nearest'' nodes, avoiding hot-spots (overloading
  109. \textit{all the nearest nodes} when a key becomes popular).
  110. \item Additionally, Coral organizes a hierarchy of separate DSHTs called
  111. \textit{clusters} depending on region and size. This enables nodes to
  112. query peers in their region first, ``finding nearby data without
  113. querying distant nodes'' and greatly reducing the latency of
  114. lookups.
  115. \end{enumerate}
  116. \subsubsection{GFS DSHT}
  117. The GFS DSHT supports four RPC calls:
  118. \subsection{Block Exchange - BitSwap Protocol}
  119. The exchange of data in GFS happens by exchanging blocks with peers using a
  120. BitTorrent inspired protocol: BitSwap. Like BitTorrent, BitSwap peers are
  121. looking to acquire a set of blocks, and have blocks to offer in exchange.
  122. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent.
  123. BitSwap operates as a persistent marketplace where node can acquire the
  124. blocks they need, regardless of what files the blocks are part of. The
  125. blocks could come from completely unrelated files in the filesystem.
  126. But nodes come together to barter in the marketplace.
  127. While the notion of a barter system implies a virtual currency could be
  128. created, this would require a global ledger (blockchain) to track ownership
  129. and transfer of the currency. This will be explored in a future paper.
  130. Instead, BitSwap nodes have to provide direct value to each other
  131. in the form of blocks. This works fine when the distribution of blocks across
  132. nodes is such that they have the complements, what each other wants. This will
  133. seldom be the case. Instead, it is more likely that nodes must \textit{work}
  134. for their blocks. In the case that a node has nothing that its peers want (or
  135. nothing at all), it seeks the pieces its peers might want, with lower
  136. priority. This incentivizes nodes to cache and disseminate rare pieces, even
  137. if they are not interested in them directly.
  138. \subsubsection{BitSwap Credit}
  139. The protocol must also incentivize nodes to seed when they do not need
  140. anything in particular, as they might have the blocks others want. Thus,
  141. BitFlow nodes send blocks to their peers, optimistically expecting the debt to
  142. be repaid. But, leeches (free-loading nodes that never share) must be avoided. A simple credit-like system solves the problem:
  143. \begin{enumerate}
  144. \item Peers track their balance (in bytes verified) with other nodes.
  145. \item Peers send blocks to each other probabilistically, according to
  146. a function, that falls when owed and rises when owing.
  147. \item The sigmoid (scaled by a comparison of the ownership) provides such a
  148. function:
  149. \[ P(send) = \dfrac{1}{1 + exp(-r)} \]
  150. where the \textit{debt ratio} $ r $ is
  151. \[ r = \dfrac{\texttt{bytes\_recv} - \texttt{bytes\_sent}}{\texttt{bytes\_sent}} \]
  152. \end{enumerate}
  153. \begin{center}
  154. \begin{tabular}{ >{$}c<{$} >{$}c<{$}}
  155. P_{send}(\;\;\;r) =& likelihood \\
  156. \hline
  157. \hline
  158. P_{send}(-5) =& 0.01 \\
  159. P_{send}(-4) =& 0.02 \\
  160. P_{send}(-3) =& 0.05 \\
  161. P_{send}(-2) =& 0.12 \\
  162. P_{send}(-1) =& 0.27 \\
  163. P_{send}(\;\;\;0) =& 0.50 \\
  164. P_{send}(\;\;\;1) =& 0.73 \\
  165. P_{send}(\;\;\;2) =& 0.88 \\
  166. P_{send}(\;\;\;3) =& 0.95 \\
  167. P_{send}(\;\;\;4) =& 0.98 \\
  168. \end{tabular}
  169. \end{center}
  170. As you can see in Table 1, this function drops off quickly as the nodes' \
  171. \textit{debt ratio} surpasses twice the established credit.
  172. This \textit{debt ratio} is a measure of trust:
  173. lenient to debts between nodes that have previously exchanged lots of data
  174. successfully, and merciless to unknown, untrusted nodes. This
  175. (a) provides resistane to attackers who would create lots of new nodes,
  176. (b) protects previously successful trade relationships, even if one of the
  177. nodes is temporarily unable to provide value, and
  178. (c) eventually chokes relationships that have deteriorated until they
  179. improve.
  180. \subsubsection{BitSwap Ledger}
  181. BitSwap nodes keep ledgers accounting the transfers with other nodes.
  182. A ledger snapshot contains a pointer to the previous snapshot (its checksum),
  183. forming a hash-chain. This allows nodes to keep track of history, and to avoid
  184. tampering. At initializing, BitSwap nodes exchange their ledger information.
  185. If it does not match exactly, the ledger is reinitialized from scratch,
  186. loosing the accrued credit or debt. It is possible for malicious nodes to
  187. purposefully ``loose'' the Ledger, hoping the erase debts. It is unlikely that
  188. nodes will have accrued enough debt to warrant also losing the accrued trust,
  189. however the partner node is free to count it as \textit{misconduct} (discussed
  190. later).
  191. \begin{verbatim}
  192. var Ledgers = map[NodeId]Ledger
  193. type Ledger struct {
  194. parent Checksum
  195. owner NodeId
  196. partner NodeId
  197. bytes_sent int
  198. bytes_recv int
  199. }
  200. \end{verbatim}
  201. Nodes are free to keep the ledger history, though it is not necessary for
  202. correct operation. Only the current ledger entries are useful.
  203. \subsubsection{Protocol Specification}
  204. \subsection{Object Model}
  205. Files are represented as a collection of inter-related objects, like in the
  206. version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
  207. \begin{enumerate}
  208. \item \texttt{chunk}: a variable-size block of data.
  209. \item \texttt{list}: a collection of chunks or other lists.
  210. \item \texttt{tree}: a collection of chunks, lists, or other trees.
  211. \end{enumerate}
  212. \subsubsection{Block Object}
  213. The \texttt{Block} object contains an addressable unit of data, and
  214. represents a file.
  215. GFS Blocks are like Git blobs or filesystem data blocks. They store the
  216. users' data. (The name \textit{block} is preferred over \textit{blob}, as the
  217. Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
  218. GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
  219. Format:
  220. \begin{verbatim}
  221. block <size>
  222. <block data bytes>
  223. ...
  224. \end{verbatim}
  225. \subsubsection{List Object}
  226. The \texttt{List} object represents a (large) file made up of several
  227. GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
  228. an ordered sequence of \texttt{block} or \texttt{list} objects.
  229. In a sense, the GFS \texttt{List} functions like a filesystem file with
  230. indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
  231. Format:
  232. \begin{verbatim}
  233. blob <num objects> <size>
  234. <list or block> <checksum> <size>
  235. <list or block> <checksum> <size>
  236. ...
  237. \end{verbatim}
  238. \subsubsection{Tree Object}
  239. The \texttt{tree} object in GFS is similar to Git trees: it represents a
  240. directory, a list of checksums and names. The checksums reference \texttt{blob}
  241. or other \texttt{tree} objects. Note that traditional path naming
  242. is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
  243. \texttt{lists} are only addressed by their \texttt{checksums}.
  244. % Unlike in Git, GFS trees include file-system metadata such as file
  245. %permissions.
  246. Format:
  247. \begin{verbatim}
  248. tree <num objects> <size>
  249. <tree or list or block> <checksum> <size> <name>
  250. <tree or list or block> <checksum> <size> <name>
  251. ...
  252. \end{verbatim}
  253. \subsubsection{Commit Object}
  254. The \texttt{commit} object in GFS is similar to Git's. It represents a
  255. snapshot in the version history of a \texttt{tree}.
  256. \begin{verbatim}
  257. commit <size>
  258. parent <commit checksum>
  259. tree <tree checksum>
  260. author Full Name <email@address.com> <ISO UTC date>
  261. committer Full Name <email@address.com> <ISO UTC date>
  262. <commit message>
  263. \end{verbatim}
  264. \subsubsection{Version control}
  265. \subsubsection{Signed Objects}
  266. All objects can be signed. Add signature to bottom of object.
  267. (yes, this changes the hash, as it should)
  268. \subsubsection{Merkle Trees}
  269. The object model in GFS forms a \textit{Merkle Tree}, where every object
  270. contains hashes of its children. This provides GFS with the useful properties
  271. of merkle trees:
  272. \begin{enumerate}
  273. \item Tamper resistance
  274. \end{enumerate}
  275. \subsubsection{Published Branches}
  276. Users can publish branches (filesystems) with:
  277. publickey -> signed tree of branches
  278. \subsection{Object Distribution}
  279. \subsubsection{Spreading Objects}
  280. DHash spread along the DHT nodes?
  281. Mainline DHT peer registry?
  282. \subsubsection{Pinning Objects}
  283. \section{Conclusions}
  284. %\section{Acknowledgments}
  285. %\bibliographystyle{abbrv}
  286. %\bibliography{gfs}
  287. %\balancecolumns
  288. %\subsection{References}
  289. \end{document}