gfs.tex 8.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262
  1. \documentclass{sig-alternate}
  2. \usepackage{mathtools}
  3. \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
  4. \begin{document}
  5. % \conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
  6. \title{Galactic File System}
  7. \subtitle{}
  8. \numberofauthors{1}
  9. \author{
  10. % You can go ahead and credit any number of authors here,
  11. % e.g. one 'row of three' or two rows (consisting of one row of three
  12. % and a second row of one, two or three).
  13. %
  14. % The command \alignauthor (no curly braces needed) should
  15. % precede each author name, affiliation/snail-mail address and
  16. % e-mail address. Additionally, tag each line of
  17. % affiliation/address with \affaddr, and tag the
  18. % e-mail address with \email.
  19. %
  20. % 1st. author
  21. \alignauthor
  22. Juan Benet\\
  23. \email{juan@benet.ai}
  24. }
  25. \maketitle
  26. \begin{abstract}
  27. The Galactic File System is a peer-to-peer distributed file system capable of
  28. sharing the same files with millions of nodes. GFS combines a distributed
  29. hashtable, cryptographic techniques, merkle trees, content-addressable
  30. storage, bittorrent, and tag-based filesystems to build a single massive
  31. file system shared between peers. GFS has no single point of failure, and
  32. nodes do not need to trust each other.
  33. \end{abstract}
  34. \section{Introduction}
  35. Cite:
  36. CFS
  37. Kademlia
  38. Bittorrent
  39. Chord
  40. DHash
  41. SFS
  42. Ori
  43. \section{GFS Overview}
  44. GFS is a distributed file system where all nodes are the same. Together, the
  45. nodes store the GFS files in local storage, and send the files to each other.
  46. GFS implements its features by combining several subsystems with many
  47. desirable properties:
  48. \begin{enumerate}
  49. \item A Coral-based \textbf{Distributed Sloppy Hash Table} (DSHT) to link and
  50. coordinate peer-to-peer nodes.
  51. \item A Bittorrent-like peer-to-peer \textbf{Block Exchange} (BE) distribute
  52. Blocks efficiently, and to incentivize replication.
  53. \item A Git-inspired \textbf{Object Model} (OM) to represent the filesystem.
  54. \item An SFS-based self-certifying name system.
  55. \end{enumerate}
  56. These subsystems are not independent. They are well integrated and leverage
  57. their blended properties. However, it is useful to describe them separately,
  58. building the system from the bottom up. Note that all GFS nodes are identical,
  59. and run the same program.
  60. \subsection{Distributed Sloppy Hash Table}
  61. First, GFS nodes implement a DSHT based on Kademlia and Coral to coordinate
  62. and identify which nodes can serve a particular block of data.
  63. \subsubsection{Kademlia DHT}
  64. Kademlia is a DHT that provides:
  65. \begin{enumerate}
  66. \item Efficient lookup through massive networks:
  67. queries on average contact $ \ceil{log_2 (n)} $ nodes.
  68. (e.g. $20$ hops for a network of $10000000$ nodes).
  69. \item Low coordination overhead: it optimizes the number of
  70. control messages it sends to other nodes.
  71. \item Resistance to various attacks, by preferring nodes who have been
  72. part of the DHT longer.
  73. \item wide useage in peer-to-peer applications, including Gnutella and
  74. Bittorrent, forming networks of over 100 million nodes.
  75. \end{enumerate}
  76. While some peer-to-peer filesystems store data blocks directly in DHTs,
  77. this ``wastes storage and bandwidth, as data must be stored at nodes where it
  78. is not needed''. Instead, GFS stores a list of peers that can provide the data block.
  79. \subsubsection{Coral DSHT}
  80. Coral extends Kademlia in three particularly important ways:
  81. \begin{enumerate}
  82. \item Kademlia stores values in nodes whose ids are ``nearest'' (using
  83. XOR-distance) to the key. This does not take into account application
  84. data locality, ignores ``far'' nodes who may already have the data, and
  85. forces ``nearest'' nodes to store it, whether they need it or not.
  86. This wastes significant storage and bandwith. Instead, Coral stores
  87. addresses to peers who can provide the data blocks.
  88. \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
  89. \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
  90. This still works since Coral users only need a single (working) peer,
  91. not the complete list. In return, Coral can distribute only subsets of
  92. the values to the ``nearest'' nodes, avoiding hot-spots (overloading
  93. \textit{all the nearest nodes} when a key becomes popular).
  94. \item Additionally, Coral organizes a hierarchy of separate DSHTs called
  95. \textit{clusters} depending on region and size. This enables nodes to
  96. query peers in their region first, ``finding nearby data without
  97. querying distant nodes'' and greatly reducing the latency of
  98. lookups.
  99. \end{enumerate}
  100. \subsubsection{GFS DSHT}
  101. The GFS DSHT supports four RPC calls:
  102. \subsection{Object Model}
  103. Files are represented as a collection of inter-related objects, like in the
  104. version control system Git. Each object is addressed by the cryptographic hash of its contents (unless otherwise specified, \textit{checksum} will henceforth refer to this cryptographic file content hash). The file objects are:
  105. \begin{enumerate}
  106. \item \texttt{chunk}: a variable-size block of data.
  107. \item \texttt{list}: a collection of chunks or other lists.
  108. \item \texttt{tree}: a collection of chunks, lists, or other trees.
  109. \end{enumerate}
  110. \subsubsection{Block Object}
  111. The \texttt{Block} object contains an addressable unit of data, and
  112. represents a file.
  113. GFS Blocks are like Git blobs or filesystem data blocks. They store the
  114. users' data. (The name \textit{block} is preferred over \textit{blob}, as the
  115. Git-inspired view of a \textit{blob} as a \textit{file} breaks down in GFS.
  116. GFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
  117. Format:
  118. \begin{verbatim}
  119. block <size>
  120. <block data bytes>
  121. ...
  122. \end{verbatim}
  123. \subsubsection{List Object}
  124. The \texttt{List} object represents a (large) file made up of several
  125. GFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
  126. an ordered sequence of \texttt{block} or \texttt{list} objects.
  127. In a sense, the GFS \texttt{List} functions like a filesystem file with
  128. indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
  129. Format:
  130. \begin{verbatim}
  131. blob <num objects> <size>
  132. <list or block> <checksum> <size>
  133. <list or block> <checksum> <size>
  134. ...
  135. \end{verbatim}
  136. \subsubsection{Tree Object}
  137. The \texttt{tree} object in GFS is similar to Git trees: it represents a
  138. directory, a list of checksums and names. The checksums reference \texttt{blob}
  139. or other \texttt{tree} objects. Note that traditional path naming
  140. is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
  141. \texttt{lists} are only addressed by their \texttt{checksums}.
  142. % Unlike in Git, GFS trees include file-system metadata such as file
  143. %permissions.
  144. Format:
  145. \begin{verbatim}
  146. tree <num objects> <size>
  147. <tree or list or block> <checksum> <size> <name>
  148. <tree or list or block> <checksum> <size> <name>
  149. ...
  150. \end{verbatim}
  151. \subsubsection{Commit Object}
  152. The \texttt{commit} object in GFS is similar to Git's. It represents a
  153. snapshot in the version history of a \texttt{tree}.
  154. \begin{verbatim}
  155. commit <size>
  156. parent <commit checksum>
  157. tree <tree checksum>
  158. author Full Name <email@address.com> <ISO UTC date>
  159. committer Full Name <email@address.com> <ISO UTC date>
  160. <commit message>
  161. \end{verbatim}
  162. \subsubsection{Version control}
  163. \subsubsection{Signed Objects}
  164. All objects can be signed. Add signature to bottom of object.
  165. (yes, this changes the hash, as it should)
  166. \subsubsection{Merkle Trees}
  167. The object model in GFS forms a \textit{Merkle Tree}, where every object
  168. contains hashes of its children. This provides GFS with the useful properties
  169. of merkle trees:
  170. \begin{enumerate}
  171. \item Tamper resistance
  172. \end{enumerate}
  173. \subsubsection{Published Branches}
  174. Users can publish branches (filesystems) with:
  175. publickey -> signed tree of branches
  176. \subsection{Chunk Exchange}
  177. \subsection{Object Distribution}
  178. \subsubsection{Spreading Objects}
  179. DHash spread along the DHT nodes?
  180. Mainline DHT peer registry?
  181. \subsubsection{Pinning Objects}
  182. \section{Conclusions}
  183. %\section{Acknowledgments}
  184. %\bibliographystyle{abbrv}
  185. %\bibliography{gfs}
  186. %\balancecolumns
  187. %\subsection{References}
  188. \end{document}