gfs.tex 31 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840
  1. \documentclass{sig-alternate}
  2. \usepackage{tikz}
  3. \usetikzlibrary{arrows}
  4. \usetikzlibrary{trees}
  5. \usetikzlibrary{positioning}
  6. \usepackage{array}
  7. \usepackage{amstext}
  8. \usepackage{mathtools}
  9. \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
  10. \begin{document}
  11. % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
  12. \title{Galactic File System}
  13. \subtitle{}
  14. \numberofauthors{1}
  15. \author{
  16. % You can go ahead and credit any number of authors here,
  17. % e.g. one 'row of three' or two rows (consisting of one row of three
  18. % and a second row of one, two or three).
  19. %
  20. % The command \alignauthor (no curly braces needed) should
  21. % precede each author name, affiliation/snail-mail address and
  22. % e-mail address. Additionally, tag each line of
  23. % affiliation/address with \affaddr, and tag the
  24. % e-mail address with \email.
  25. %
  26. % 1st. author
  27. \alignauthor
  28. Juan Benet\\
  29. \email{juan@benet.ai}
  30. }
  31. \maketitle
  32. \begin{abstract}
  33. The Galactic File System is a peer-to-peer distributed file system capable of
  34. sharing the same files with millions of nodes. GFS combines a distributed
  35. hashtable, cryptographic techniques, merkle trees, content-addressable
  36. storage, bittorrent, and tag-based filesystems to build a single massive
  37. file system shared between peers. GFS has no single point of failure, and
  38. nodes do not need to trust each other.
  39. \end{abstract}
  40. \section{Introduction}
  41. [Motivate GFS. Introduce problems. Describe BitTorrent existing problems (
  42. multiple files. one swarm. sloppy dht implementation.) Describe version
  43. control efforts. Propose potential combinations of good ideas.]
  44. [Cite:
  45. CFS,
  46. Kademlia,
  47. Bittorrent,
  48. Chord,
  49. DHash,
  50. SFS,
  51. Ori,
  52. Coral]
  53. This paper introduces
  54. GFS, a novel peer-to-peer version-controlled filesystem;
  55. and BitSwap, the novel peer-to-peer block exchange protocol serving GFS.
  56. The rest of the paper is organized as follows.
  57. Section 2 describes the design of the filesystem.
  58. Section 3 evaluates various facets of the system under benchmark and common
  59. workloads.
  60. Section 4 presents and evaluates a world-wide deployment of GFS.
  61. Section 5 describes existing and potential applications of GFS.
  62. Section 6 discusses related and future work.
  63. Notation Notes:
  64. (a) data structures are specified in Go syntax,
  65. (b) rpc protocols are specified in capnp interface,
  66. (c) object formats are specified in text with <fields>.
  67. \section{Design}
  68. \subsection{GFS Nodes}
  69. GFS is a distributed file system where all nodes are the same. They are
  70. identified by a \texttt{NodeId}, the cryptographic hash of a public-key
  71. (note that \textit{checksum} will henceforth refer specifically to crypographic
  72. hashes of an object). Nodes also store their public and private keys. Clients
  73. are free to instatiate a new node on every launch, though that means losing any
  74. accrued benefits. It is recommended that nodes remain the same.
  75. \begin{verbatim}
  76. type Checksum string
  77. type PublicKey string
  78. type PrivateKey string
  79. type NodeId Checksum
  80. type Node struct {
  81. nodeid NodeID
  82. pubkey PublicKey
  83. prikey PrivateKey
  84. }
  85. \end{verbatim}
  86. Together, the
  87. nodes store the GFS files in local storage, and send files to each other.
  88. GFS implements its features by combining several subsystems with many
  89. desirable properties:
  90. \begin{enumerate}
  91. \item A Coral-based \textbf{Distributed Sloppy Hash Table}\\
  92. (DSHT) to link and coordinate peer-to-peer nodes.
  93. Described in Section 2.2.
  94. \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
  95. Blocks efficiently, and to incentivize replication.
  96. Described in Section 2.3.
  97. \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
  98. Described in Section 2.4.
  99. \item An SFS-based self-certifying name system.
  100. Described in Section 2.5.
  101. \end{enumerate}
  102. These subsystems are not independent. They are well integrated and leverage
  103. their blended properties. However, it is useful to describe them separately,
  104. building the system from the bottom up. Note that all GFS nodes are identical,
  105. and run the same program.
  106. \subsection{Distributed Sloppy Hash Table}
  107. First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
  108. and identify which nodes can serve a particular block of data.
  109. \subsubsection{Kademlia DHT}
  110. Kademlia is a DHT that provides:
  111. \begin{enumerate}
  112. \item Efficient lookup through massive networks:
  113. queries on average contact $ \ceil{log_2 (n)} $ nodes.
  114. (e.g. $20$ hops for a network of $10000000$ nodes).
  115. \item Low coordination overhead: it optimizes the number of
  116. control messages it sends to other nodes.
  117. \item Resistance to various attacks, by preferring nodes who have been
  118. part of the DHT longer.
  119. \item wide useage in peer-to-peer applications, including \\
  120. Gnutella and Bittorrent, forming networks of over 100 million nodes.
  121. \end{enumerate}
  122. While some peer-to-peer filesystems store data blocks directly in DHTs,
  123. this ``wastes storage and bandwidth, as data must be stored at nodes where it
  124. is not needed''. Instead, GFS stores a list of peers that can provide the data block.
  125. \subsubsection{Coral DSHT}
  126. Coral extends Kademlia in three particularly important ways:
  127. \begin{enumerate}
  128. \item Kademlia stores values in nodes whose ids are ``nearest'' (using
  129. XOR-distance) to the key. This does not take into account application
  130. data locality, ignores ``far'' nodes who may already have the data, and
  131. forces ``nearest'' nodes to store it, whether they need it or not.
  132. This wastes significant storage and bandwith. Instead, Coral stores
  133. addresses to peers who can provide the data blocks.
  134. \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
  135. \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
  136. This still works since Coral users only need a single (working) peer,
  137. not the complete list. In return, Coral can distribute only subsets of
  138. the values to the ``nearest'' nodes, avoiding hot-spots (overloading
  139. \textit{all the nearest nodes} when a key becomes popular).
  140. \item Additionally, Coral organizes a hierarchy of separate DSHTs called
  141. \textit{clusters} depending on region and size. This enables nodes to
  142. query peers in their region first, ``finding nearby data without
  143. querying distant nodes'' and greatly reducing the latency of
  144. lookups.
  145. \end{enumerate}
  146. \subsubsection{GFS DSHT}
  147. The GFS DSHT supports four RPC calls:
  148. \subsection{Block Exchange - BitSwap Protocol}
  149. The exchange of data in GFS happens by exchanging blocks with peers using a
  150. BitTorrent inspired protocol: BitSwap. Like BitTorrent, BitSwap peers are
  151. looking to acquire a set of blocks, and have blocks to offer in exchange.
  152. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent.
  153. BitSwap operates as a persistent marketplace where node can acquire the
  154. blocks they need, regardless of what files the blocks are part of. The
  155. blocks could come from completely unrelated files in the filesystem.
  156. But nodes come together to barter in the marketplace.
  157. While the notion of a barter system implies a virtual currency could be
  158. created, this would require a global ledger (blockchain) to track ownership
  159. and transfer of the currency. This will be explored in a future paper.
  160. Instead, BitSwap nodes have to provide direct value to each other
  161. in the form of blocks. This works fine when the distribution of blocks across
  162. nodes is such that they have the complements, what each other wants. This will
  163. seldom be the case. Instead, it is more likely that nodes must \textit{work}
  164. for their blocks. In the case that a node has nothing that its peers want (or
  165. nothing at all), it seeks the pieces its peers might want, with lower
  166. priority. This incentivizes nodes to cache and disseminate rare pieces, even
  167. if they are not interested in them directly.
  168. \subsubsection{BitSwap Credit}
  169. The protocol must also incentivize nodes to seed when they do not need
  170. anything in particular, as they might have the blocks others want. Thus,
  171. BitSwap nodes send blocks to their peers, optimistically expecting the debt to
  172. be repaid. But, leeches (free-loading nodes that never share) must be avoided. A simple credit-like system solves the problem:
  173. \begin{enumerate}
  174. \item Peers track their balance (in bytes verified) with other nodes.
  175. \item Peers send blocks to debtor peers probabilistically, according to
  176. a function that falls as debt increases.
  177. \end{enumerate}
  178. Note that if a peer decides not to send, the peer subsequently ignores the
  179. other node for an \texttt{ignore\_cooldown} timeout. This prevents senders
  180. from trying to game the probability by just causing more dice-rolls.
  181. (Default BitSwap is 10 seconds).
  182. \subsubsection{BitSwap Strategy}
  183. The differing strategies that BitSwap peers might employ have wildly different
  184. effects on the performance of the exchange as a whole. In BitTorrent,
  185. while a standard strategy is specified (tit-for-tat), a variety of others have
  186. been implemented, ranging from BitTyrant (sharing the least-possible),
  187. to BitThief (exploiting a vulnerability and never share), to PropShare
  188. (sharing proportionally). A range of strategies (good and malicious) could
  189. similarly be implemented by BitSwap peers. The choice of function, then, should
  190. aim to:
  191. \begin{enumerate}
  192. \item maximize the trade performance for the node, and the whole exchange
  193. \item prevent freeloaders from exploiting and degrading the exchange
  194. \item be effective with and resistant to other, unknown strategies
  195. \item be lenient to trusted peers
  196. \end{enumerate}
  197. The exploration of the space of such strategies is future work.
  198. One choice of function that works in practice is the sigmoid, scaled by a
  199. \textit{debt retio}:
  200. Let the \textit{debt ratio} $ r $ between a node and its peer be:
  201. \[ r = \dfrac{\texttt{bytes\_sent}}{\texttt{bytes\_recv}} \]
  202. Given $r$, let the probability of sending to a debtor be:
  203. \[ P\Big( \; send \; | \; r \;\Big) = \dfrac{1}{1 + exp(6-3r)} \]
  204. As you can see in Table 1, this function drops off quickly as the nodes' \
  205. \textit{debt ratio} surpasses twice the established credit.
  206. The \textit{debt ratio} is a measure of trust:
  207. lenient to debts between nodes that have previously exchanged lots of data
  208. successfully, and merciless to unknown, untrusted nodes. This
  209. (a) provides resistance to attackers who would create lots of new nodes
  210. (sybill attacks),
  211. (b) protects previously successful trade relationships, even if one of the
  212. nodes is temporarily unable to provide value, and
  213. (c) eventually chokes relationships that have deteriorated until they
  214. improve.
  215. \begin{center}
  216. \begin{tabular}{ >{$}c<{$} >{$}c<{$}}
  217. P(\;send\;|\quad r) =& likelihood \\
  218. \hline
  219. \hline
  220. P(\;send\;|\;0.0) =& 1.00 \\
  221. P(\;send\;|\;0.5) =& 1.00 \\
  222. P(\;send\;|\;1.0) =& 0.98 \\
  223. P(\;send\;|\;1.5) =& 0.92 \\
  224. P(\;send\;|\;2.0) =& 0.73 \\
  225. P(\;send\;|\;2.5) =& 0.38 \\
  226. P(\;send\;|\;3.0) =& 0.12 \\
  227. P(\;send\;|\;3.5) =& 0.03 \\
  228. P(\;send\;|\;4.0) =& 0.01 \\
  229. P(\;send\;|\;4.5) =& 0.00 \\
  230. \end{tabular}
  231. \end{center}
  232. % TODO look into computing share of the bandwidth, as described in propshare.
  233. \subsubsection{BitSwap Ledger}
  234. BitSwap nodes keep ledgers accounting the transfers with other nodes.
  235. A ledger snapshot contains a pointer to the previous snapshot (its checksum),
  236. forming a hash-chain. This allows nodes to keep track of history, and to avoid
  237. tampering. When activating a connection, BitSwap nodes exchange their ledger
  238. information.
  239. If it does not match exactly, the ledger is reinitialized from scratch,
  240. loosing the accrued credit or debt. It is possible for malicious nodes to
  241. purposefully ``loose'' the Ledger, hoping the erase debts. It is unlikely that
  242. nodes will have accrued enough debt to warrant also losing the accrued trust,
  243. however the partner node is free to count it as \textit{misconduct} (discussed
  244. later).
  245. \begin{verbatim}
  246. type Ledger struct {
  247. parent Checksum
  248. owner NodeId
  249. partner NodeId
  250. bytes_sent int
  251. bytes_recv int
  252. timestamp Timestamp
  253. }
  254. \end{verbatim}
  255. Nodes are free to keep the ledger history, though it is not necessary for
  256. correct operation. Only the current ledger entries are useful. Nodes are
  257. also free to garbage collect ledgers as necessary, starting with the less
  258. useful ledgers: the old (peers may not exist anymore) and small.
  259. \subsubsection{BitSwap Specification}
  260. BitSwap nodes follow a simple protocol.
  261. \begin{verbatim}
  262. # Additional state kept:
  263. type BitSwap struct {
  264. ledgers map[NodeId]Ledger
  265. // Ledgers known to this node, inc inactive
  266. active map[NodeId]Peer
  267. // currently open connections to other nodes
  268. need_list []Checksum
  269. // checksums of blocks this node needs
  270. have_list []Checksum
  271. // checksums of blocks this node has
  272. }
  273. type Peer struct {
  274. nodeid NodeId
  275. ledger Ledger
  276. // Ledger between the node and this peer
  277. last_seen Timestamp
  278. // timestamp of last received message
  279. want_list []Checksum
  280. // checksums of all blocks wanted by peer
  281. // includes blocks wanted by peer's peers
  282. }
  283. # Protocol interface:
  284. interface Peer {
  285. open (nodeid :NodeId, ledger :Ledger);
  286. send_want_list (want_list :WantList);
  287. send_block (block :Block) -> (complete :Bool);
  288. close (final :Bool);
  289. }
  290. \end{verbatim}
  291. Sketch of the lifetime of a peer connection:
  292. \begin{enumerate}
  293. \item Open: peers send \texttt{ledgers} until they agree.
  294. \item Sending: peers exchange \texttt{want\_lists} and \texttt{blocks}.
  295. \item Close: peers deactivate a connection.
  296. \item Ignored: (special) a peer is ignored (for the duration of a timeout)
  297. if a node's strategy avoids sending
  298. \end{enumerate}
  299. \paragraph{Peer.open(NodeId, Ledger)}
  300. When connecting, a node initializes a connection with a
  301. \texttt{Ledger}, either stored from a connection in the past or a new one
  302. zeroed out. Then, sends an Open message with the \texttt{Ledger} to the peer.
  303. Upon receiving an \texttt{Open} message, a peer chooses whether to activate
  304. the connection. If -- acording to the receiver's \texttt{Ledger} -- the sender
  305. is not a trusted agent (transmission below zero, or large outstanding debt) the
  306. receiver may opt to ignore the request. This should be done probabilistically
  307. with an \texttt{ignore\_cooldown} timeout, as to allow errors to be corrected
  308. and attackers to be thwarted.
  309. If activating the connection, the receiver initializes a Peer object, with the
  310. local version of the \texttt{Ledger}, and setting the \texttt{last\_seen}
  311. timestamp). Then, it compares the received
  312. \texttt{Ledger} with its own. If they match exactly, the connections have
  313. opened. If they do not match, the peer creates a new zeroed out
  314. \texttt{Ledger}, and sends it.
  315. \paragraph{Peer.send\_want\_list(WantList)}
  316. While the connection is open, nodes advertise their
  317. \texttt{want\_list} to all connected peers. This is done (a) upon opening the
  318. connection, (b) after a randomized periodic timeout, (c) after a change in
  319. the \texttt{want\_list} and (d) after receiving a new block.
  320. Upon receiving a \texttt{want\_list}, a node stores it. Then, it checks whether
  321. it has any of the wanted blocks. If so, it sends them according to the
  322. \textit{BitSwap Strategy} above.
  323. \paragraph{Peer.send\_block(Block)}
  324. Sending a block is straightforward. The node simply transmits the block of
  325. data. Upon receiving all the data, the receiver computes the Checksum to
  326. verify it matches the expected one, and returns confirmation.
  327. Upon finalizing the correct transmission of a block, the receiver moves the
  328. block from \texttt{need\_list} to \texttt{have\_list}, and both the receiver
  329. and sender update their ledgers to reflect the additional bytes transmitted.
  330. If a transmission verfication fails, the receiver instead \textit{penalizes}
  331. the sender. Both receiver and sender should update their ledgers accordingly,
  332. though the sender is either malfunctioning or attacking the receiver. Note that
  333. BitSwap expects to operate on a reliable transmission channel, so data errors
  334. -- which could lead to incorrect penalization of an honest sender -- are
  335. expected to be caught before the data is given to BitSwap. GFS uses the uTP
  336. protocol.
  337. \paragraph{Peer.close(Bool)}
  338. The \texttt{final} parameter to \texttt{close} signals whether the intention
  339. to tear down the connection is the sender's or not. If false, the receiver
  340. may opt to re-open the connection immediatelty. This avoids premature
  341. closes.
  342. A peer connection should be closed under two conditions:
  343. \begin{itemize}
  344. \item a \texttt{silence\_wait} timeout has expired without receiving any
  345. messages from the peer (default BitSwap uses 30 seconds).
  346. In this case, the node issues a \texttt{Peer.close(false)} message.
  347. \item the node is exiting and BitSwap is being shut down.
  348. In this case, the node issues a \texttt{Peer.close(true)} message.
  349. \end{itemize}
  350. After a \texttt{close} message, both receiver and sender tear down the
  351. connection, clearing any state stored. The \texttt{Ledger} may be stored for
  352. the future, if it is useful to do so.
  353. \paragraph{Notes}
  354. \begin{itemize}
  355. \item Non-\texttt{open} messages on an inactive connection should be ignored.
  356. In case of a \texttt{send\_block} message, the receiver may check
  357. the block to see if it is needed and correct, and if so, use it.
  358. Regardless, all such out-of-order messages trigger a
  359. \texttt{close(false)} message from the receiver, to force
  360. re-initialization of the connection.
  361. \end{itemize}
  362. % TODO: Rate Limiting / Node Silencing
  363. \subsection{Object Model}
  364. The DHT and BitSwap allow GFS to form a massive peer-to-peer system for storing
  365. and distributing blocks quickly and robustly to users.
  366. GFS builds a filesystem out of this efficient block distribution system,
  367. constructing files and directories out of blocks.
  368. Files in GFS are represented as a collection of inter-related objects, like in
  369. the version control system Git. Each object is addressed by the cryptographic
  370. hash of its contents (\texttt{Checksum}). The file objects are:
  371. \begin{enumerate}
  372. \item \texttt{block}: a variable-size block of data.
  373. \item \texttt{list}: a collection of blocks or other lists.
  374. \item \texttt{tree}: a collection of blocks, lists, or other trees.
  375. \item \texttt{commit}: a snapshot in the version history of a tree.
  376. \end{enumerate}
  377. We hoped to use the Git object formats exactly, but had to depart to introduce
  378. certain features useful in a distributed filesystem, for example fast size
  379. lookups (aggregate byte sizes have been added to objects), large file
  380. deduplication and versioning (adding a \texttt{list} object), and more.
  381. However, our objects are perfectly compatible with Git and
  382. conversion between the two does not lose any information.
  383. Notes:
  384. \begin{itemize}
  385. \item \texttt{varint} is a variable size int, as in protocol-buffers.
  386. \item objects are serialized using \texttt{capnp}.
  387. \end{itemize}
  388. \subsubsection{Block Object}
  389. The \texttt{Block} object contains an addressable unit of data, and
  390. represents a file.
  391. GFS Blocks are like Git blobs or filesystem data blocks. They store the
  392. users' data. (The name \textit{block} is preferred over \textit{blob}, as the
  393. Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
  394. GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
  395. Format:
  396. \begin{verbatim}
  397. block <size>
  398. <block data bytes>
  399. ...
  400. \end{verbatim}
  401. \subsubsection{List Object}
  402. The \texttt{List} object represents a large or de-duplicated file made up of
  403. several GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
  404. an ordered sequence of \texttt{block} or \texttt{list} objects.
  405. In a sense, the GFS \texttt{List} functions like a filesystem file with
  406. indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
  407. Format:
  408. \begin{verbatim}
  409. list <num objects> <size varint>
  410. <list or block> <checksum> <size varint>
  411. <list or block> <checksum> <size varint>
  412. ...
  413. \end{verbatim}
  414. \subsubsection{Tree Object}
  415. The \texttt{tree} object in GFS is similar to Git trees: it represents a
  416. directory, a list of checksums and names. The checksums reference \texttt{blob}
  417. or other \texttt{tree} objects. Note that traditional path naming
  418. is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
  419. \texttt{lists} are only addressed by their \texttt{checksums}.
  420. Format:
  421. \begin{verbatim}
  422. tree <num objects> <size varint>
  423. <tree or list or block> <checksum> <size varint> <name>
  424. <tree or list or block> <checksum> <size varint> <name>
  425. ...
  426. \end{verbatim}
  427. \subsubsection{Commit Object}
  428. The \texttt{commit} object in GFS is similar to Git's. It represents a
  429. snapshot in the version history of a \texttt{tree}. Note that user
  430. addresses are NodeIds (the hash of the public key).
  431. \begin{verbatim}
  432. commit <size varint>
  433. parent <commit checksum>
  434. tree <tree checksum>
  435. author <author public key> <ISO UTC date>
  436. committer <committer public key> <ISO UTC date>
  437. <commit message>
  438. \end{verbatim}
  439. \subsubsection{Version control}
  440. The \texttt{commit} object represents a particular snapshot in the version
  441. history of a tree. Comparing the \texttt{trees} and children objects of two
  442. different commits reveals the differences between two versions of the
  443. filesystem. As long as a single \texttt{commit} and all the children objects
  444. it references are accessible, all preceding versions are retrievable and the
  445. full history of the filesystem changes can be accessed. This is a consequence
  446. of the \texttt{Git} object model and the graph it forms.
  447. The full power of the \texttt{Git} version control tools is available to GFS
  448. users. The object model is compatible (though not the same). The standard
  449. \texttt{Git} tools can be used on the \texttt{GFS} object graph after a
  450. conversion. Additionally, a fork of the tools is under development that will
  451. allow users to use them directly without conversion.
  452. \subsubsection{Object-level Cryptoraphy}
  453. GFS is equipped to handle object-level cryptographic operations. Any additional
  454. bytes are appended to the bottom of the object. This changes the object's hash
  455. (defining a different object, as it should). GFS exposes an API that
  456. automatically verifies signatures or decrypts data.
  457. \begin{itemize}
  458. \item \texttt{Signing}. Signature appended.
  459. \item \texttt{Encryption}. Optional recipient's public key appended.
  460. \end{itemize}
  461. \subsubsection{Merkle Trees}
  462. The object model in GFS forms a \textit{Merkle Tree}, which provides GFS with
  463. useful properties:
  464. \begin{enumerate}
  465. \item \textbf{Content Addressing:} all content is uniquely identified by its
  466. \texttt{checksum}, \textbf{including child checksums}. This means a
  467. particular \texttt{tree} object points to \textit{specific} children.
  468. Committing changes to a \texttt{block} also commits changes to the
  469. containing \texttt{tree}.
  470. \item \textbf{Tamper resistance:} all content is verified with its Checksum.
  471. If data is tampered with, before being delivered, the client
  472. detects and discards it.
  473. \item \textbf{Deduplication:} all objects who hold the exact same content
  474. are the same, and only stored once. This is particularly useful with
  475. parent objects, such as lists, trees, and commits.
  476. \end{enumerate}
  477. \subsection{The Filesystem}
  478. \subsubsection{Filesystem Paths}
  479. GFS exposes a slash-delimited path-based API. Paths work the same as in any
  480. traditional UNIX filesystem. Path subcomponents have different meanings per
  481. object:
  482. \begin{center}
  483. \begin{tabular}{ll}
  484. \texttt{object} & subcomponent meaning \\
  485. \hline
  486. \hline
  487. \texttt{block} & N/A (no children) \\
  488. \texttt{list} & integer index \\
  489. \texttt{tree} & string name \\
  490. \texttt{commit} & string name (in tree) \\
  491. \end{tabular}
  492. \end{center}
  493. \begin{figure}
  494. \centering
  495. \begin{tikzpicture}[->,>=stealth',auto,thick,
  496. minimum height=2em,minimum width=5em]
  497. \tikzstyle{ghost}=[rectangle,rounded corners=.8ex];
  498. \tikzstyle{block}=[rectangle,draw,fill=blue!20,rounded corners=.8ex];
  499. \tikzstyle{list}=[rectangle,draw,fill=cyan!20,rounded corners=.8ex];
  500. \tikzstyle{tree}=[rectangle,draw,fill=green!20,rounded corners=.8ex];
  501. \tikzstyle{commit}=[rectangle,draw,fill=magenta!20,rounded corners=.8ex];
  502. \tikzstyle{every path}=[draw]
  503. \node[commit] (ccc111) {ccc111};
  504. \node[tree] (ttt111) [below=3em of ccc111] {ttt111};
  505. \node[tree] (ttt222) [below left=3em and 3em of ttt111] {ttt222};
  506. \node[tree] (ttt333) [below=3em of ttt111] {ttt333};
  507. \node[ghost] (ghost1) [below right=3em and 3em of ttt111] {};
  508. \node[list] (lll111) [below=3em of ttt333] {lll111};
  509. \node[block] (bbb111) [below=3em of ttt222] {bbb111};
  510. \node[block] (bbb222) [below right=3em and 3em of ttt333] {bbb222};
  511. \node[block] (bbb333) [below left=3em and 3em of lll111] {bbb333};
  512. \node[block] (bbb444) [below=3em of lll111] {bbb444};
  513. \node[block] (bbb555) [below right=3em and 3em of lll111] {bbb555};
  514. \path[every node/.style={font=\sffamily\small}]
  515. (ccc111) edge[out=-90,in=90] (ttt111)
  516. (ttt111) edge[out=-90,in=90] (ttt222)
  517. edge[out=-90,in=90] (ttt333)
  518. to [out=-90,in=90] (ghost1)
  519. to [out=-90,in=90] (bbb222)
  520. (ttt222) edge[out=-90,in=90] (bbb111)
  521. (ttt333) edge[out=-90,in=90] (lll111)
  522. edge[out=-90,in=90] (bbb222)
  523. (lll111) edge[out=-90,in=90] (bbb333)
  524. edge[out=-90,in=90] (bbb444)
  525. edge[out=-90,in=90] (bbb555)
  526. ;
  527. \end{tikzpicture}
  528. \caption{Sample Object Graph} \label{fig:sample-object-graph}
  529. \begin{verbatim}
  530. # ccc111 contents
  531. commit 313
  532. tree ttt111
  533. author <author public key> <ISO UTC date>
  534. committer <committer public key> <ISO UTC date>
  535. # ttt111 contents
  536. tree 3 250
  537. tree ttt222 46 ttt222-name
  538. tree ttt333 166 ttt333-name
  539. block bbb222 11 bbb222-name
  540. # ttt222 contents
  541. tree 1 10
  542. block bbb111 10 bbb111-name
  543. # ttt333 contents
  544. tree 2 104
  545. list lll111 93 lll111-name
  546. block bbb222 11 bbb222-eman
  547. # lll111 contents
  548. list 3 39
  549. block bbb333 12
  550. block bbb444 13
  551. block bbb555 14
  552. # bbb111 contents # block bbb222 contents
  553. block 1 block 2
  554. 1 22
  555. # bbb333 contents # block bbb444 contents
  556. block 3 block 4
  557. 333 4444
  558. # bbb555 contents
  559. block 5
  560. 55555
  561. \end{verbatim}
  562. \caption{Sample Objects} \label{fig:sample-objects}
  563. \end{figure}
  564. For example, given the sample objects in Figures \ref{fig:sample-object-graph} and \ref{fig:sample-objects}:
  565. \begin{verbatim}
  566. # to access tree ttt333:
  567. ccc111/ttt333-name
  568. # to access block bbb222:
  569. ccc111/bbb222-name
  570. ccc111/ttt333-name/bbb222-eman
  571. # to access list lll111:
  572. ccc111/ttt333-name/lll111-name
  573. # to access block bbb555:
  574. ccc111/ttt333-name/lll111-name/2
  575. \end{verbatim}
  576. Note that:
  577. \begin{itemize}
  578. \item[(a)] blocks have no children \\
  579. \texttt{.../<block>/<child>} is impossible
  580. \item[(b)] commits implicitly access their trees \\
  581. \texttt{.../<commit>/name}
  582. looks up \texttt{"name"} in \texttt{<commit>}'s \texttt{<tree>}
  583. \item[(c)] \texttt{list} children are accessed by their index \\
  584. \texttt{.../<list>/4} looks up the fifth block.
  585. \end{itemize}
  586. \paragraph{Path Lookup Performance}
  587. Path-based access traverses the object graph. Retrieving
  588. each object requires potentially looking up its key in the DHT,
  589. connecting to peers, and retrieving its blocks. This is considerable
  590. overhead, particularly when looking up paths with many components.
  591. This is mitigated by:
  592. \begin{itemize}
  593. \item \textbf{tree caching}: since all objects are hash-addressed, they
  594. can be cached indefinitely. Additionally, \texttt{trees} tend to be
  595. small in size so GFS prioritizes caching them over \texttt{blocks}.
  596. \item \textbf{flattened trees}: for any given \texttt{tree}, a special
  597. \texttt{flattened tree} can be constructed to list all objects
  598. reachable from the \texttt{tree}. Figure \ref{flattened-ttt111} shows
  599. an example of a flattened tree. While GFS does not construct flattened
  600. trees by default, it provides a function for users. For example,
  601. \end{itemize}
  602. \begin{figure}
  603. \begin{verbatim}
  604. tree 5 250
  605. tree ttt222 46 ttt222-name
  606. block bbb111 10 ttt222-name/bbb111-name
  607. tree ttt333 166 ttt333-name
  608. list lll111 93 ttt222-name/lll111-name
  609. block bbb222 11 ttt333-name/bbb222-eman
  610. block bbb222 11 bbb222-name
  611. \end{verbatim}
  612. \caption{Flattened Tree for \texttt{ttt111}} \label{fig:flattened-ttt111}
  613. \end{figure}
  614. \subsubsection{Publishing Objects}
  615. GFS is globally distributed. It is designed to allow the files of millions
  616. of users to coexist together. The \textbf{DHT} with content-hash addressing
  617. allows publishing objects in a fair, secure, and entirely distributed way.
  618. Anyone can publish an object by simply adding its key to the DHT, adding
  619. themselves as a peer, and giving other users the object's hash.
  620. Additionally, the GFS root directory supports special functionality to
  621. allow namespacing and naming objects in a fair, secure, and distributed
  622. manner.
  623. \begin{itemize}
  624. \item[(a)] All objects are accessible by their hash. Thus, users can
  625. always reference an object (and its children) using
  626. \texttt{/<object\_hash>}.
  627. \item[(b)] \texttt{/<node\_id>} provides a self-certifying filesystem
  628. for user \texttt{node\_id}. If it exists, the object returned is a
  629. special \texttt{tree} signed by \texttt{node\_id}'s private key. Thus,
  630. a user can publish a \texttt{tree} or \texttt{commit} under their
  631. name, and others can verify it by checking the signature matches.
  632. \item[(c)] If \texttt{/<domain>} is a valid domain name, GFS
  633. looks up key \texttt{gfs} in its \texttt{DNS TXT} record. GFS
  634. interprets the value as either an object hash or another GFS path:
  635. \begin{verbatim}
  636. # this DNS TXT record
  637. fs.benet.ai. TXT "gfs=/aabbccddeeffgg ..."
  638. # behaves as symlink
  639. ln -s /aabbccddeeffgg /fs.benet.ai
  640. \end{verbatim}
  641. \end{itemize}
  642. \subsection{Local Objects}
  643. GFS clients require some \textit{local storage}, an external system
  644. on which to store and retrieve local raw data for the objects GFS manages.
  645. The type of storage depends on the node's use case.
  646. In most cases, this is simply a portion of disk space (either managed by
  647. the native filesystem, or directly by the GFS client). In others, non-
  648. persistent caches for example, this storage is just a portion of RAM.
  649. Ultimately, all blocks available in GFS are in some node's
  650. \textit{local storage}. And when nodes open files with GFS, the objects are
  651. downloaded and stored locally, at least temporarily. This provides
  652. fast lookup for some configurable amount of time thereafter.
  653. \subsubsection{Object Pinning}
  654. Nodes who wish to ensure the survival of particular objects can do so by
  655. \texttt{pinning} the objects. This ensures the objects are kept in the node's
  656. \textit{local storage}. Pinning can be done recursively, to pin down all
  657. descendent objects as well. For example, recursively pinning a \texttt{tree}
  658. or \texttt{commit} ensures \textit{all} objects pointed to are stored locally
  659. too. This is particularly useful for nodes wishing to keep all their own files.
  660. %\section{Acknowledgments}
  661. %\bibliographystyle{abbrv}
  662. %\bibliography{gfs}
  663. %\balancecolumns
  664. %\subsection{References}
  665. \end{document}