gfs.tex 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590
  1. \documentclass{sig-alternate}
  2. \usepackage{array}
  3. \usepackage{amstext}
  4. \usepackage{mathtools}
  5. \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
  6. \begin{document}
  7. % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
  8. \title{Galactic File System}
  9. \subtitle{}
  10. \numberofauthors{1}
  11. \author{
  12. % You can go ahead and credit any number of authors here,
  13. % e.g. one 'row of three' or two rows (consisting of one row of three
  14. % and a second row of one, two or three).
  15. %
  16. % The command \alignauthor (no curly braces needed) should
  17. % precede each author name, affiliation/snail-mail address and
  18. % e-mail address. Additionally, tag each line of
  19. % affiliation/address with \affaddr, and tag the
  20. % e-mail address with \email.
  21. %
  22. % 1st. author
  23. \alignauthor
  24. Juan Benet\\
  25. \email{juan@benet.ai}
  26. }
  27. \maketitle
  28. \begin{abstract}
  29. The Galactic File System is a peer-to-peer distributed file system capable of
  30. sharing the same files with millions of nodes. GFS combines a distributed
  31. hashtable, cryptographic techniques, merkle trees, content-addressable
  32. storage, bittorrent, and tag-based filesystems to build a single massive
  33. file system shared between peers. GFS has no single point of failure, and
  34. nodes do not need to trust each other.
  35. \end{abstract}
  36. \section{Introduction}
  37. [Motivate GFS. Introduce problems. Describe BitTorrent existing problems (
  38. multiple files. one swarm. sloppy dht implementation.) Describe version
  39. control efforts. Propose potential combinations of good ideas.]
  40. [Cite:
  41. CFS,
  42. Kademlia,
  43. Bittorrent,
  44. Chord,
  45. DHash,
  46. SFS,
  47. Ori,
  48. Coral]
  49. This paper introduces
  50. GFS, a novel peer-to-peer version-controlled filesystem;
  51. and BitSwap, the novel peer-to-peer block exchange protocol serving GFS.
  52. The rest of the paper is organized as follows.
  53. Section 2 describes the design of the filesystem.
  54. Section 3 evaluates various facets of the system under benchmark and common
  55. workloads.
  56. Section 4 presents and evaluates a world-wide deployment of GFS.
  57. Section 5 describes existing and potential applications of GFS.
  58. Section 6 discusses related and future work.
  59. Notation Notes:
  60. (a) data structures are specified in Go syntax,
  61. (b) rpc protocols are specified in capnp interface,
  62. (c) object formats are specified in text with <fields>.
  63. \section{Design}
  64. \subsection{GFS Nodes}
  65. GFS is a distributed file system where all nodes are the same. They are
  66. identified by a \texttt{NodeId}, the cryptographic hash of a public-key
  67. (note that \textit{checksum} will henceforth refer specifically to crypographic
  68. hashes of an object). Nodes also store their public and private keys. Clients
  69. are free to instatiate a new node on every launch, though that means losing any
  70. accrued benefits. It is recommended that nodes remain the same.
  71. \begin{verbatim}
  72. type Checksum string
  73. type PublicKey string
  74. type PrivateKey string
  75. type NodeId Checksum
  76. type Node struct {
  77. nodeid NodeID
  78. pubkey PublicKey
  79. prikey PrivateKey
  80. }
  81. \end{verbatim}
  82. Together, the
  83. nodes store the GFS files in local storage, and send files to each other.
  84. GFS implements its features by combining several subsystems with many
  85. desirable properties:
  86. \begin{enumerate}
  87. \item A Coral-based \textbf{Distributed Sloppy Hash Table}\\
  88. (DSHT) to link and coordinate peer-to-peer nodes.
  89. Described in Section 2.2.
  90. \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
  91. Blocks efficiently, and to incentivize replication.
  92. Described in Section 2.3.
  93. \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
  94. Described in Section 2.4.
  95. \item An SFS-based self-certifying name system.
  96. Described in Section 2.5.
  97. \end{enumerate}
  98. These subsystems are not independent. They are well integrated and leverage
  99. their blended properties. However, it is useful to describe them separately,
  100. building the system from the bottom up. Note that all GFS nodes are identical,
  101. and run the same program.
  102. \subsection{Distributed Sloppy Hash Table}
  103. First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
  104. and identify which nodes can serve a particular block of data.
  105. \subsubsection{Kademlia DHT}
  106. Kademlia is a DHT that provides:
  107. \begin{enumerate}
  108. \item Efficient lookup through massive networks:
  109. queries on average contact $ \ceil{log_2 (n)} $ nodes.
  110. (e.g. $20$ hops for a network of $10000000$ nodes).
  111. \item Low coordination overhead: it optimizes the number of
  112. control messages it sends to other nodes.
  113. \item Resistance to various attacks, by preferring nodes who have been
  114. part of the DHT longer.
  115. \item wide useage in peer-to-peer applications, including \\
  116. Gnutella and Bittorrent, forming networks of over 100 million nodes.
  117. \end{enumerate}
  118. While some peer-to-peer filesystems store data blocks directly in DHTs,
  119. this ``wastes storage and bandwidth, as data must be stored at nodes where it
  120. is not needed''. Instead, GFS stores a list of peers that can provide the data block.
  121. \subsubsection{Coral DSHT}
  122. Coral extends Kademlia in three particularly important ways:
  123. \begin{enumerate}
  124. \item Kademlia stores values in nodes whose ids are ``nearest'' (using
  125. XOR-distance) to the key. This does not take into account application
  126. data locality, ignores ``far'' nodes who may already have the data, and
  127. forces ``nearest'' nodes to store it, whether they need it or not.
  128. This wastes significant storage and bandwith. Instead, Coral stores
  129. addresses to peers who can provide the data blocks.
  130. \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
  131. \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
  132. This still works since Coral users only need a single (working) peer,
  133. not the complete list. In return, Coral can distribute only subsets of
  134. the values to the ``nearest'' nodes, avoiding hot-spots (overloading
  135. \textit{all the nearest nodes} when a key becomes popular).
  136. \item Additionally, Coral organizes a hierarchy of separate DSHTs called
  137. \textit{clusters} depending on region and size. This enables nodes to
  138. query peers in their region first, ``finding nearby data without
  139. querying distant nodes'' and greatly reducing the latency of
  140. lookups.
  141. \end{enumerate}
  142. \subsubsection{GFS DSHT}
  143. The GFS DSHT supports four RPC calls:
  144. \subsection{Block Exchange - BitSwap Protocol}
  145. The exchange of data in GFS happens by exchanging blocks with peers using a
  146. BitTorrent inspired protocol: BitSwap. Like BitTorrent, BitSwap peers are
  147. looking to acquire a set of blocks, and have blocks to offer in exchange.
  148. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent.
  149. BitSwap operates as a persistent marketplace where node can acquire the
  150. blocks they need, regardless of what files the blocks are part of. The
  151. blocks could come from completely unrelated files in the filesystem.
  152. But nodes come together to barter in the marketplace.
  153. While the notion of a barter system implies a virtual currency could be
  154. created, this would require a global ledger (blockchain) to track ownership
  155. and transfer of the currency. This will be explored in a future paper.
  156. Instead, BitSwap nodes have to provide direct value to each other
  157. in the form of blocks. This works fine when the distribution of blocks across
  158. nodes is such that they have the complements, what each other wants. This will
  159. seldom be the case. Instead, it is more likely that nodes must \textit{work}
  160. for their blocks. In the case that a node has nothing that its peers want (or
  161. nothing at all), it seeks the pieces its peers might want, with lower
  162. priority. This incentivizes nodes to cache and disseminate rare pieces, even
  163. if they are not interested in them directly.
  164. \subsubsection{BitSwap Credit}
  165. The protocol must also incentivize nodes to seed when they do not need
  166. anything in particular, as they might have the blocks others want. Thus,
  167. BitFlow nodes send blocks to their peers, optimistically expecting the debt to
  168. be repaid. But, leeches (free-loading nodes that never share) must be avoided. A simple credit-like system solves the problem:
  169. \begin{enumerate}
  170. \item Peers track their balance (in bytes verified) with other nodes.
  171. \item Peers send blocks to debtor peers probabilistically, according to
  172. a function that falls as debt increases.
  173. \end{enumerate}
  174. Note that if a peer decides not to send, the peer subsequently ignores the
  175. other node for an \texttt{ignore\_cooldown} timeout. This prevents senders
  176. from trying to game the probability by just causing more dice-rolls.
  177. (Default BitSwap is 10 seconds).
  178. \subsubsection{BitSwap Strategy}
  179. The differing strategies that BitSwap peers might employ have wildly different
  180. effects on the performance of the exchange as a whole. In BitTorrent,
  181. while a standard strategy is specified (tit-for-tat), a variety of others have
  182. been implemented, ranging from BitTyrant (sharing the least-possible),
  183. to BitThief (exploiting a vulnerability and never share), to PropShare (
  184. sharing proportionally). A range of strategies (good and malicious) could
  185. similarly be implemented by BitSwap peers. The choice of function, then, should
  186. aim to:
  187. \begin{enumerate}
  188. \item maximize the trade performance for the node, and the whole exchange
  189. \item prevent freeloaders from exploiting and degrading the exchange
  190. \item be effective with and resistant to other, unknown strategies
  191. \item be lenient to trusted peers
  192. \end{enumerate}
  193. The exploration of the space of such strategies is future work.
  194. One choice of function that works in practice is the sigmoid, scaled by a
  195. \textit{debt retio}:
  196. Let the \textit{debt ratio} $ r $ between a node and its peer be:
  197. \[ r = \dfrac{\texttt{bytes\_sent}}{\texttt{bytes\_recv}} \]
  198. Given $r$, let the probability of sending to a debtor be:
  199. \[ 1 - P\Big( \; send \; | \; r \;\Big) = \dfrac{1}{1 + exp(6-3r)} \]
  200. As you can see in Table 1, this function drops off quickly as the nodes' \
  201. \textit{debt ratio} surpasses twice the established credit.
  202. The \textit{debt ratio} is a measure of trust:
  203. lenient to debts between nodes that have previously exchanged lots of data
  204. successfully, and merciless to unknown, untrusted nodes. This
  205. (a) provides resistance to attackers who would create lots of new nodes
  206. (sybill attacks),
  207. (b) protects previously successful trade relationships, even if one of the
  208. nodes is temporarily unable to provide value, and
  209. (c) eventually chokes relationships that have deteriorated until they
  210. improve.
  211. \begin{center}
  212. \begin{tabular}{ >{$}c<{$} >{$}c<{$}}
  213. P(\;send\;|\quad r) =& likelihood \\
  214. \hline
  215. \hline
  216. P(\;send\;|\;0.0) =& 1.00 \\
  217. P(\;send\;|\;0.5) =& 1.00 \\
  218. P(\;send\;|\;1.0) =& 0.98 \\
  219. P(\;send\;|\;1.5) =& 0.92 \\
  220. P(\;send\;|\;2.0) =& 0.73 \\
  221. P(\;send\;|\;2.5) =& 0.38 \\
  222. P(\;send\;|\;3.0) =& 0.12 \\
  223. P(\;send\;|\;3.5) =& 0.03 \\
  224. P(\;send\;|\;4.0) =& 0.01 \\
  225. P(\;send\;|\;4.5) =& 0.00 \\
  226. \end{tabular}
  227. \end{center}
  228. % TODO look into computing share of the bandwidth, as described in propshare.
  229. \subsubsection{BitSwap Ledger}
  230. BitSwap nodes keep ledgers accounting the transfers with other nodes.
  231. A ledger snapshot contains a pointer to the previous snapshot (its checksum),
  232. forming a hash-chain. This allows nodes to keep track of history, and to avoid
  233. tampering. When activating a connection, BitSwap nodes exchange their ledger
  234. information.
  235. If it does not match exactly, the ledger is reinitialized from scratch,
  236. loosing the accrued credit or debt. It is possible for malicious nodes to
  237. purposefully ``loose'' the Ledger, hoping the erase debts. It is unlikely that
  238. nodes will have accrued enough debt to warrant also losing the accrued trust,
  239. however the partner node is free to count it as \textit{misconduct} (discussed
  240. later).
  241. \begin{verbatim}
  242. type Ledger struct {
  243. parent Checksum
  244. owner NodeId
  245. partner NodeId
  246. bytes_sent int
  247. bytes_recv int
  248. timestamp Timestamp
  249. }
  250. \end{verbatim}
  251. Nodes are free to keep the ledger history, though it is not necessary for
  252. correct operation. Only the current ledger entries are useful. Nodes are
  253. also free to garbage collect ledgers as necessary, starting with the less
  254. useful ledgers: the old (peers may not exist anymore) and small.
  255. \subsubsection{BitSwap Specification}
  256. BitSwap nodes follow a simple protocol.
  257. \begin{verbatim}
  258. # Additional state kept:
  259. type BitSwap struct {
  260. ledgers map[NodeId]Ledger
  261. // Ledgers known to this node, inc inactive
  262. active map[NodeId]Peer
  263. // currently open connections to other nodes
  264. need_list []Checksum
  265. // checksums of blocks this node needs
  266. have_list []Checksum
  267. // checksums of blocks this node has
  268. }
  269. type Peer struct {
  270. nodeid NodeId
  271. ledger Ledger
  272. // Ledger between the node and this peer
  273. last_seen Timestamp
  274. // timestamp of last received message
  275. want_list []Checksum
  276. // checksums of all blocks wanted by peer
  277. // includes blocks wanted by peer's peers
  278. }
  279. # Protocol interface:
  280. interface Peer {
  281. open (nodeid :NodeId, ledger :Ledger);
  282. send_want_list (want_list :WantList);
  283. send_block (block :Block) -> (complete :Bool);
  284. close (final :Bool);
  285. }
  286. \end{verbatim}
  287. Sketch of the lifetime of a peer connection:
  288. \begin{enumerate}
  289. \item Open: peers send \texttt{ledgers} until they agree.
  290. \item Sending: peers exchange \texttt{want\_lists} and \texttt{blocks}.
  291. \item Close: peers deactivate a connection.
  292. \item Ignored: (special) a peer is ignored (for the duration of a timeout)
  293. if a node's strategy avoids sending
  294. \end{enumerate}
  295. \paragraph{Peer.open(NodeId, Ledger)}
  296. When connecting, a node initializes a connection with a
  297. \texttt{Ledger}, either stored from a connection in the past or a new one
  298. zeroed out. Then, sends an Open message with the \texttt{Ledger} to the peer.
  299. Upon receiving an \texttt{Open} message, a peer chooses whether to activate
  300. the connection. If -- acording to the receiver's \texttt{Ledger} -- the sender
  301. is not a trusted agent (transmission below zero, or large outstanding debt) the
  302. receiver may opt to ignore the request. This should be done probabilistically
  303. with an \texttt{ignore\_cooldown} timeout, as to allow errors to be corrected
  304. and attackers to be thwarted. BitSwap
  305. If activating the connection, the receiver initializes a Peer object, with the
  306. local version of the \texttt{Ledger}, and setting the \texttt{last\_seen}
  307. timestamp). Then, it compares the received
  308. \texttt{Ledger} with its own. If they match exactly, the connections have
  309. opened. If they do not match, the peer creates a new zeroed out
  310. \texttt{Ledger}, and sends it.
  311. \paragraph{Peer.send\_want\_list(WantList)}
  312. While the connection is open, nodes advertise their
  313. \texttt{want\_list} to all connected peers. This is done (a) upon opening the
  314. connection, (b) after a randomized periodic timeout, (c) after a change in
  315. the \texttt{want\_list} and (d) after receiving a new block.
  316. Upon receiving a \texttt{want\_list}, a node stores it. Then, it checks whether
  317. it has any of the wanted blocks. If so, it sends them according to the
  318. \textit{BitSwap Strategy} above.
  319. \paragraph{Peer.send\_block(Block)}
  320. Sending a block is straightforward. The node simply transmits the block of
  321. data. Upon receiving all the data, the receiver computes the Checksum to
  322. verify it matches the expected one, and returns confirmation.
  323. Upon finalizing the correct transmission of a block, the receiver moves the
  324. block from \texttt{need\_list} to \texttt{have\_list}, and both the receiver
  325. and sender update their ledgers to reflect the additional bytes transmitted.
  326. If a transmission verfication fails, the receiver instead \textit{penalizes}
  327. the sender. Both receiver and sender should update their ledgers accordingly,
  328. though the sender is either malfunctioning or attacking the receiver. Note that
  329. BitSwap expects to operate on a reliable transmission channel, so data errors
  330. -- which could lead to incorrect penalization of an honest sender -- are
  331. expected to be caught before the data is given to BitSwap. GFS uses the uTP
  332. protocol.
  333. \paragraph{Peer.close(Bool)}
  334. The \texttt{final} parameter to \texttt{close} signals whether the intention
  335. to tear down the connection is the sender's or not. If false, the receiver
  336. may opt to re-open the connection immediatelty. This avoids premature
  337. closes.
  338. A peer connection should be closed under two conditions:
  339. \begin{itemize}
  340. \item a \texttt{silence\_wait} timeout has expired without receiving any
  341. messages from the peer (default BitSwap uses 30 seconds).
  342. In this case, the node issues a \texttt{Peer.close(false)} message.
  343. \item the node is exiting and BitSwap is being shut down.
  344. In this case, the node issues a \texttt{Peer.close(true)} message.
  345. \end{itemize}
  346. After a \texttt{close} message, both receiver and sender tear down the
  347. connection, clearing any state stored. The \texttt{Ledger} may be stored for
  348. the future, if it is useful to do so.
  349. \paragraph{Notes}
  350. \begin{itemize}
  351. \item Non-\texttt{open} messages on an inactive connection should be ignored.
  352. In case of a \texttt{send\_block} message, the receiver may check
  353. the block to see if it is needed and correct, and if so, use it.
  354. Regardless, all such out-of-order messages trigger a
  355. \texttt{close(false)} message from the receiver, to force re-
  356. initialization of the connection.
  357. \end{itemize}
  358. % TODO: Rate Limiting / Node Silencing
  359. \subsection{Object Model}
  360. The DHT and BitSwap allow GFS to form a massive peer-to-peer system for storing
  361. and distributing blocks quickly and robustly to users.
  362. GFS builds a filesystem out of this efficient block distribution system,
  363. constructing files and directories out of blocks.
  364. Files in GFS are represented as a collection of inter-related objects, like in
  365. the version control system Git. Each object is addressed by the cryptographic
  366. hash of its contents (\texttt{Checksum}). The file objects are:
  367. \begin{enumerate}
  368. \item \texttt{block}: a variable-size block of data.
  369. \item \texttt{list}: a collection of blocks or other lists.
  370. \item \texttt{tree}: a collection of blocks, lists, or other trees.
  371. \item \texttt{commit}: a snapshot in the version history of a tree.
  372. \item \texttt{ref}: a reference to any another object (symlink).
  373. \end{enumerate}
  374. \subsubsection{Block Object}
  375. The \texttt{Block} object contains an addressable unit of data, and
  376. represents a file.
  377. GFS Blocks are like Git blobs or filesystem data blocks. They store the
  378. users' data. (The name \textit{block} is preferred over \textit{blob}, as the
  379. Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
  380. GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
  381. Format:
  382. \begin{verbatim}
  383. block <size>
  384. <block data bytes>
  385. ...
  386. \end{verbatim}
  387. \subsubsection{List Object}
  388. The \texttt{List} object represents a (large) file made up of several
  389. GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
  390. an ordered sequence of \texttt{block} or \texttt{list} objects.
  391. In a sense, the GFS \texttt{List} functions like a filesystem file with
  392. indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
  393. Format:
  394. \begin{verbatim}
  395. blob <num objects> <size>
  396. <list or block> <checksum> <size>
  397. <list or block> <checksum> <size>
  398. ...
  399. \end{verbatim}
  400. \subsubsection{Tree Object}
  401. The \texttt{tree} object in GFS is similar to Git trees: it represents a
  402. directory, a list of checksums and names. The checksums reference \texttt{blob}
  403. or other \texttt{tree} objects. Note that traditional path naming
  404. is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
  405. \texttt{lists} are only addressed by their \texttt{checksums}.
  406. % Unlike in Git, GFS trees include file-system metadata such as file
  407. %permissions.
  408. Format:
  409. \begin{verbatim}
  410. tree <num objects> <size>
  411. <tree or list or block> <checksum> <size> <name>
  412. <tree or list or block> <checksum> <size> <name>
  413. ...
  414. \end{verbatim}
  415. \subsubsection{Commit Object}
  416. The \texttt{commit} object in GFS is similar to Git's. It represents a
  417. snapshot in the version history of a \texttt{tree}.
  418. \begin{verbatim}
  419. commit <size>
  420. parent <commit checksum>
  421. tree <tree checksum>
  422. author Full Name <email@address.com> <ISO UTC date>
  423. committer Full Name <email@address.com> <ISO UTC date>
  424. <commit message>
  425. \end{verbatim}
  426. \subsubsection{Version control}
  427. \subsubsection{Signed Objects}
  428. All objects can be signed. Add signature to bottom of object.
  429. (yes, this changes the hash, as it should)
  430. \subsubsection{Merkle Trees}
  431. The object model in GFS forms a \textit{Merkle Tree}, where every object
  432. contains hashes of its children. This provides GFS with the useful properties
  433. of merkle trees:
  434. \begin{enumerate}
  435. \item Tamper resistance
  436. \end{enumerate}
  437. \subsubsection{Published Branches}
  438. Users can publish branches (filesystems) with:
  439. publickey -> signed tree of branches
  440. \subsection{Object Distribution}
  441. \subsubsection{Spreading Objects}
  442. DHash spread along the DHT nodes?
  443. Mainline DHT peer registry?
  444. \subsubsection{Pinning Objects}
  445. \section{Conclusions}
  446. %\section{Acknowledgments}
  447. %\bibliographystyle{abbrv}
  448. %\bibliography{gfs}
  449. %\balancecolumns
  450. %\subsection{References}
  451. \end{document}