ipfs-cap2pfs.tex 37 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900
  1. \documentclass{sig-alternate}
  2. \usepackage{tikz}
  3. \usetikzlibrary{arrows}
  4. \usetikzlibrary{trees}
  5. \usetikzlibrary{positioning}
  6. \usepackage{array}
  7. \usepackage{amstext}
  8. \usepackage{mathtools}
  9. \DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
  10. \begin{document}
  11. \title{IPFS - Content Addressed, Versioned, P2P File System (DRAFT 2)}
  12. \subtitle{}
  13. \numberofauthors{1}
  14. \author{
  15. \alignauthor
  16. Juan Benet\\
  17. \email{juan@benet.ai}
  18. }
  19. \maketitle
  20. \begin{abstract}
  21. The InterPlanetary File System is a peer-to-peer distributed file system
  22. capable of sharing the same files with millions of nodes. IPFS combines a
  23. distributed hashtable, cryptographic techniques, merkle trees, content-
  24. addressable storage, bittorrent, and tag-based filesystems to build a single
  25. massive file system shared between peers. IPFS has no single point of failure,
  26. and nodes do not need to trust each other.
  27. \end{abstract}
  28. \section{Introduction}
  29. [Motivate IPFS. Introduce problems. Describe BitTorrent existing problems (
  30. multiple files. one swarm. sloppy dht implementation.) Describe version
  31. control efforts. Propose potential combinations of good ideas.]
  32. [Cite:
  33. CFS,
  34. Kademlia,
  35. Bittorrent,
  36. Chord,
  37. DHash,
  38. SFS,
  39. Ori,
  40. Coral]
  41. This paper introduces
  42. IPFS, a novel peer-to-peer version-controlled filesystem;
  43. and BitSwap, the novel peer-to-peer block exchange protocol serving IPFS.
  44. %The rest of the paper is organized as follows.
  45. %Section 2 describes the design of the filesystem.
  46. %Section 3 evaluates various facets of the system under benchmark and common
  47. %workloads.
  48. %Section 4 presents and evaluates a world-wide deployment of IPFS.
  49. %Section 5 describes existing and potential applications of IPFS.
  50. %Section 6 discusses related and future work.
  51. Notation Notes:
  52. (a) data structures are specified in Go syntax,
  53. (b) rpc protocols are specified in capnp interfaces,
  54. (c) wire protocols are specified in capnp schemas.
  55. \section{Background}
  56. This section reviews important properties of successful peer-to-peer systems, which IPFS combines.
  57. \subsection{Distributed Hash Tables}
  58. Distributed Hash Tables (DHTs) are widely used to coordinate and maintain metadata about peer-to-peer systems. For example, the BitTorrent MainlineDHT tracks sets of peers part of a torrent swarm.
  59. \subsubsection{Kademlia DHT}
  60. Kademlia \cite{Kademlia} is a popular DHT that provides:
  61. \begin{enumerate}
  62. \item Efficient lookup through massive networks:
  63. queries on average contact $ \ceil{log_2 (n)} $ nodes.
  64. (e.g. $20$ hops for a network of $10,000,000$ nodes).
  65. \item Low coordination overhead: it optimizes the number of
  66. control messages it sends to other nodes.
  67. \item Resistance to various attacks, by preferring nodes who have been
  68. part of the DHT longer.
  69. \item wide useage in peer-to-peer applications, including \\
  70. Gnutella and Bittorrent, forming networks of over 100 million nodes.
  71. \end{enumerate}
  72. \subsubsection{Coral DSHT}
  73. While some peer-to-peer filesystems store data blocks directly in DHTs,
  74. this ``wastes storage and bandwidth, as data must be stored at nodes where it
  75. is not needed'' \cite{Coral}. Coral extends Kademlia in three particularly important ways:
  76. \begin{enumerate}
  77. \item Kademlia stores values in nodes whose ids are ``nearest'' (using
  78. XOR-distance) to the key. This does not take into account application
  79. data locality, ignores ``far'' nodes who may already have the data, and
  80. forces ``nearest'' nodes to store it, whether they need it or not.
  81. This wastes significant storage and bandwith. Instead, Coral stores
  82. addresses to peers who can provide the data blocks.
  83. \item Coral relaxes the DHT API from \texttt{get\_value(key)} to
  84. \texttt{get\_any\_values(key)} (the ``sloppy'' in DSHT).
  85. This still works since Coral users only need a single (working) peer,
  86. not the complete list. In return, Coral can distribute only subsets of
  87. the values to the ``nearest'' nodes, avoiding hot-spots (overloading
  88. \textit{all the nearest nodes} when a key becomes popular).
  89. \item Additionally, Coral organizes a hierarchy of separate DSHTs called
  90. \textit{clusters} depending on region and size. This enables nodes to
  91. query peers in their region first, ``finding nearby data without
  92. querying distant nodes'' and greatly reducing the latency of
  93. lookups.
  94. \end{enumerate}
  95. \subsubsection{S/Kademlia DHT}
  96. S/Kademlia extends Kademlia to protect against malicious attacks:
  97. \begin{enumerate}
  98. \item S/Kademlia provides schemes to secure \texttt{NodeId} generation,
  99. and prevent Sybill attacks. It requires nodes to create a PKI key pair, derive their identity from it, and sign their messages to each other. One scheme includes a proof-of-work crypto puzzle to make generating Sybills expensive.
  100. \item S/Kademlia nodes lookup values over disjoint paths, in order to
  101. ensure honest nodes can connect to each other in the presence of a large fraction of adversaries in the network. S/Kademlia achieves a success rate of 0.85 even with an adversarial fraction as large as half of the nodes.
  102. \end{enumerate}
  103. \subsection{Block Exchanges - BitTorrent}
  104. BitTorrent \cite{BitTorrent} is a widely successful peer-to-peer filesharing system, which succeeds in coordinating networks of untrusting peers (swarms) to cooperate in distributing pieces of files to each other. Key BitTorrent features that inform IPFS design:
  105. \begin{enumerate}
  106. \item BitTorrent's data exchange protocol uses a quasi tit-for-tat strategy
  107. which rewards nodes that contribute to each other, and punishes nodes who would only leech others' resources.
  108. \item BitTorrent peers track the availability of file pieces, prioritizing
  109. sending rarest-first. This takes load off seeds, making non-seed peers capable of trading with each other.
  110. \item BitTorrent's standard tit-for-tat is vulnerable to some exploitative
  111. bandwidth sharing strategies. PropShare \cite{propshare} is a different peer bandwidth allocation strategy that better resists exploitative strategies, and improves the performance of swarms.
  112. \end{enumerate}
  113. \subsection{Version Control Systems - Git}
  114. Version Control Systems provide facilities to model files changing over time and distribute different versions efficiently. The popular version control system Git provides a powerful Merkle DAG \footnote{Merkle Directed Acyclic Graph -- similar but more general construction than a Merkle Tree. Deduplicated, does not need to be balanced, and non-leaf nodes contain data.} object model that captures changes to a filesystem tree in a distributed-friendly way.
  115. \begin{enumerate}
  116. \item Immutable objects represent Files (\texttt{blob}), Directories (\texttt{tree}), and Changes (\texttt{commit}).
  117. \item Objects are content-addressed, by the cryptographic hash of their contents.
  118. \item Links to other objects are embedded, forming a Merkle DAG. This
  119. provides many useful integrity and workflow properties.
  120. \item Most versioning metadata (branches, tags, etc) are simply pointer references, and thus inexpensive to create and update.
  121. \item Version changes only update references or add objects.
  122. \item Distributing version changes to other users is simply transferring objects and updating remote references.
  123. \end{enumerate}
  124. \section{Design}
  125. IPFS is a distributed file system which synthesizes successful ideas from previous peer-to-peer sytems, including DHTs, BitTorrent, Git, and SFS. The contribution of IPFS is simplifying, evolving, and connecting proven techniques into a single cohesive system, greater than the sum of its parts. IPFS presents a new platform for writing and deploying applications, a new system for distributing and versioning large data, and could evolve the web itself.
  126. IPFS is peer-to-peer; no nodes are privileged. IPFS nodes store IPFS objects in local storage. Nodes connect to each other and transfer objects. These objects represent files and other data structures. The IPFS Protocol is divided into a stack of sub-protocols responsible for different functionality:
  127. \begin{enumerate}
  128. \item \textbf{Identities} - manages node identity generation and verification. Described in Section 3.1.
  129. \item \textbf{Network} - manages connections to other peers, using various underlying network protocols. Configurable. Described in Section 3.2.
  130. \item \textbf{Routing} - maintains information to locate specific peers and objects. Responds to both local and remote queries. Defaults to a DHT, but is swappable. Described in Section 3.3.
  131. \item \textbf{Exchange} - a block exchange protocol (BitSwap) that governs efficient block distribution. Modelled as a market, weakly intentivizes replication. Trade Strategies swappable. Described in Section 3.4.
  132. \item \textbf{Objects} - a Merkle DAG of content-addressed immutable objects with links. Used to represent arbitrary datastructures, e.g. file hierarchies and communication systems. Described in Section 3.5.
  133. \item \textbf{Files} - versioned file system hierarchy, inspired by Git. Described in Section 3.6.
  134. \item \textbf{Naming} - A self-certifying mutable name system. Described in Section 3.7.
  135. \end{enumerate}
  136. These subsystems are not independent; they are integrated and leverage
  137. blended properties. However, it is useful to describe them separately,
  138. building the protocol stack from the bottom up.
  139. \subsection{Identities}
  140. Nodes are identified by a \texttt{NodeId}, the cryptographic hash\footnote{throughout this document, \textit{hash} and \textit{checksum} refer specifically to cryptographic hash checksums of data} of a public-key, created as in \cite{skademlia}. Nodes store their public and private keys (encrypted with a passphrase). Users are free to instatiate a ``new'' node identity on every launch, though that loses accrued network benefits. Nodes are incentivized to remain the same.
  141. \begin{verbatim}
  142. type Checksum string
  143. type PublicKey string
  144. type PrivateKey string
  145. type NodeId Checksum
  146. type Node struct {
  147. nodeid NodeID
  148. pubkey PublicKey
  149. prikey PrivateKey
  150. }
  151. \end{verbatim}
  152. Upon first connecting, peers exchange public keys, and check: \texttt{hash(other.PublicKey) equals other.NodeId}. If not, the connection is terminated.
  153. \subsection{Network}
  154. IPFS nodes communicate regualarly with hundreds of other nodes in the network, across the wide internet. IPFS can use any reliable transport protocol, and is best suited for WebRTC DataChannels \cite{WebRTC} (for browser connectivity) or uTP \cite{uTP} (LEDBAT) \cite{LEDBAT}. IPFS also uses the ICE NAT traversal techniques \cite{ICE} to increase connectivity between peers.
  155. \begin{itemize}
  156. \item \textbf{Reliability:} IPFS can provide reliability if underlying networks do not provide it, using uTP or SCTP.
  157. \item \textbf{Integrity:} optionally checks integrity of messages using a hash checksum.
  158. \item \textbf{Authenticity:} optionally checks authenticity of messages using HMAC with sender's public key.
  159. \end{itemize}
  160. \subsection{Routing}
  161. IPFS nodes require a routing system that can find (a) other peers' network addresses and (b) peers who can serve particular objects. IPFS achieves this using a DSHT based on S/Kademlia and Coral, using the properties discussed in 2.1. The size of objects and use patterns of IPFS are similar to Coral \cite{Coral} and Mainline \cite{Mainline}, so references are stored in the DHT instead of entire blocks. References are the \texttt{NodeIds} of peers who can serve the block.
  162. The interface of this DSHT is:
  163. \begin{verbatim}
  164. routing.findPeer(NodeId)
  165. // gets a particular peer's network address
  166. routing.findValuePeers(Checksum, int)
  167. // gets a number of peers serving a value.
  168. routing.provideValue(Checksum)
  169. // announces that this node can serve a value.
  170. \end{verbatim}
  171. Note: different use cases will call for substantially different routing systems (e.g. DHT in wide network, static HT in local network). Thus the IPFS routing system can be swapped for one to fit the users' needs. As long as the interface above is met, the rest of the system will continue to function.
  172. \subsection{Block Exchange - BitSwap Protocol}
  173. The exchange of data in IPFS happens by exchanging blocks with peers using a
  174. BitTorrent inspired protocol: BitSwap. Like BitTorrent, BitSwap peers are
  175. looking to acquire a set of blocks, and have blocks to offer in exchange.
  176. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent.
  177. BitSwap operates as a persistent marketplace where node can acquire the
  178. blocks they need, regardless of what files the blocks are part of. The
  179. blocks could come from completely unrelated files in the filesystem.
  180. But nodes come together to barter in the marketplace.
  181. While the notion of a barter system implies a virtual currency could be
  182. created, this would require a global ledger to track ownership
  183. and transfer of the currency. This can be implemented as a BitSwap Strategy, and will be explored in a future paper.
  184. In the base case, BitSwap nodes have to provide direct value to each other
  185. in the form of blocks. This works fine when the distribution of blocks across
  186. nodes is such that they have complements, what each other wants. This will
  187. seldom be the case. Instead, it is more likely that nodes must \textit{work}
  188. for their blocks. In the case that a node has nothing that its peers want (or
  189. nothing at all), it seeks the pieces its peers want, with lower
  190. priority than what the node wants itself. This incentivizes nodes to cache and
  191. disseminate rare pieces, even if they are not interested in them directly.
  192. \subsubsection{BitSwap Credit}
  193. The protocol must also incentivize nodes to seed when they do not need
  194. anything in particular, as they might have the blocks others want. Thus,
  195. BitSwap nodes send blocks to their peers optimistically, expecting the debt to
  196. be repaid. But, leeches (free-loading nodes that never share) must be protected against. A simple credit-like system solves the problem:
  197. \begin{enumerate}
  198. \item Peers track their balance (in bytes verified) with other nodes.
  199. \item Peers send blocks to debtor peers probabilistically, according to
  200. a function that falls as debt increases.
  201. \end{enumerate}
  202. Note that if a peer decides not to send, the peer subsequently ignores the
  203. other node for an \texttt{ignore\_cooldown} timeout. This prevents senders
  204. from trying to game the probability by just causing more dice-rolls.
  205. (Default BitSwap is 10 seconds).
  206. \subsubsection{BitSwap Strategy}
  207. The differing strategies that BitSwap peers might employ have wildly different effects on the performance of the exchange as a whole. In BitTorrent, while a standard strategy is specified (tit-for-tat), a variety of others have been implemented, ranging from BitTyrant \cite{BitTyrant} (sharing the least-possible), to BitThief \cite{BitThief} (exploiting a vulnerability and never share), to PropShare \cite{PropShare} (sharing proportionally). A range of strategies (good and malicious) could similarly be implemented by BitSwap peers. The choice of function, then, should aim to:
  208. \begin{enumerate}
  209. \item maximize the trade performance for the node, and the whole exchange
  210. \item prevent freeloaders from exploiting and degrading the exchange
  211. \item be effective with and resistant to other, unknown
  212. strategies
  213. \item be lenient to trusted peers
  214. \end{enumerate}
  215. The exploration of the space of such strategies is future work.
  216. One choice of function that works in practice is a sigmoid, scaled by a
  217. \textit{debt retio}:
  218. Let the \textit{debt ratio} $ r $ between a node and its peer be:
  219. \[ r = \dfrac{\texttt{bytes\_sent}}{\texttt{bytes\_recv} + 1} \]
  220. Given $r$, let the probability of sending to a debtor be:
  221. \[ P\Big( \; send \; | \; r \;\Big) = 1 - \dfrac{1}{1 + exp(6-3r)} \]
  222. \begin{figure}
  223. \centering
  224. \begin{tikzpicture}[domain=0:4]
  225. \draw[->] (-0,0) -- (4.2,0) node[right] {$r$};
  226. \draw[->] (0,-0) -- (0,1.20) node[above] {$P(\;send\;|\;r\;)$};
  227. %ticks
  228. \foreach \x in {0,...,4}
  229. \draw (\x,1pt) -- (\x,-3pt)
  230. node[anchor=north] {\x};
  231. \foreach \y in {1,...,1}
  232. \draw (1pt,\y) -- (-3pt,\y)
  233. node[anchor=east] {\y};
  234. \draw[color=red] plot[] function{1 - 1/(1+exp(6-3*x))};
  235. \end{tikzpicture}
  236. \caption{Probability of Sending as $r$ increases}
  237. \label{fig:psending-graph}
  238. \end{figure}
  239. As you can see in Figure \ref{fig:psending-graph}, this function drops off quickly as the nodes' \
  240. \textit{debt ratio} surpasses twice the established credit.
  241. The \textit{debt ratio} is a measure of trust:
  242. lenient to debts between nodes that have previously exchanged lots of data
  243. successfully, and merciless to unknown, untrusted nodes. This
  244. (a) provides resistance to attackers who would create lots of new nodes
  245. (sybill attacks),
  246. (b) protects previously successful trade relationships, even if one of the
  247. nodes is temporarily unable to provide value, and
  248. (c) eventually chokes relationships that have deteriorated until they
  249. improve.
  250. % \begin{center}
  251. % \begin{tabular}{ >{$}c<{$} >{$}c<{$}}
  252. % P(\;send\;|\quad r) \;\;\;\;\;& \\
  253. % \hline
  254. % \hline
  255. % P(\;send\;|\;0.0) =& 1.00 \\
  256. % P(\;send\;|\;0.5) =& 1.00 \\
  257. % P(\;send\;|\;1.0) =& 0.98 \\
  258. % P(\;send\;|\;1.5) =& 0.92 \\
  259. % P(\;send\;|\;2.0) =& 0.73 \\
  260. % P(\;send\;|\;2.5) =& 0.38 \\
  261. % P(\;send\;|\;3.0) =& 0.12 \\
  262. % P(\;send\;|\;3.5) =& 0.03 \\
  263. % P(\;send\;|\;4.0) =& 0.01 \\
  264. % P(\;send\;|\;4.5) =& 0.00 \\
  265. % \end{tabular}
  266. % \end{center}
  267. % TODO look into computing share of the bandwidth, as described in propshare.
  268. \subsubsection{BitSwap Ledger}
  269. BitSwap nodes keep ledgers accounting the transfers with other nodes. This allows nodes to keep track of history, and to avoid tampering. When activating a connection, BitSwap nodes exchange their ledger information. If it does not match exactly, the ledger is reinitialized from scratch, loosing the accrued credit or debt. It is possible for malicious nodes to purposefully ``loose'' the Ledger, hoping the erase debts. It is unlikely that nodes will have accrued enough debt to warrant also losing the accrued trust, however the partner node is free to count it as \textit{misconduct} (discussed later).
  270. \begin{verbatim}
  271. type Ledger struct {
  272. owner NodeId
  273. partner NodeId
  274. bytes_sent int
  275. bytes_recv int
  276. timestamp Timestamp
  277. }
  278. \end{verbatim}
  279. Nodes are free to keep the ledger history, though it is not necessary for
  280. correct operation. Only the current ledger entries are useful. Nodes are
  281. also free to garbage collect ledgers as necessary, starting with the less
  282. useful ledgers: the old (peers may not exist anymore) and small.
  283. \subsubsection{BitSwap Specification}
  284. BitSwap nodes follow a simple protocol.
  285. \begin{verbatim}
  286. // Additional state kept
  287. type BitSwap struct {
  288. ledgers map[NodeId]Ledger
  289. // Ledgers known to this node, inc inactive
  290. active map[NodeId]Peer
  291. // currently open connections to other nodes
  292. need_list []Checksum
  293. // checksums of blocks this node needs
  294. have_list []Checksum
  295. // checksums of blocks this node has
  296. }
  297. type Peer struct {
  298. nodeid NodeId
  299. ledger Ledger
  300. // Ledger between the node and this peer
  301. last_seen Timestamp
  302. // timestamp of last received message
  303. want_list []Checksum
  304. // checksums of all blocks wanted by peer
  305. // includes blocks wanted by peer's peers
  306. }
  307. // Protocol interface:
  308. interface Peer {
  309. open (nodeid :NodeId, ledger :Ledger);
  310. send_want_list (want_list :WantList);
  311. send_block (block :Block) -> (complete :Bool);
  312. close (final :Bool);
  313. }
  314. \end{verbatim}
  315. Sketch of the lifetime of a peer connection:
  316. \begin{enumerate}
  317. \item Open: peers send \texttt{ledgers} until they agree.
  318. \item Sending: peers exchange \texttt{want\_lists} and \texttt{blocks}.
  319. \item Close: peers deactivate a connection.
  320. \item Ignored: (special) a peer is ignored (for the duration of a timeout)
  321. if a node's strategy avoids sending
  322. \end{enumerate}
  323. \paragraph{Peer.open(NodeId, Ledger)}
  324. When connecting, a node initializes a connection with a
  325. \texttt{Ledger}, either stored from a connection in the past or a new one
  326. zeroed out. Then, sends an Open message with the \texttt{Ledger} to the peer.
  327. Upon receiving an \texttt{Open} message, a peer chooses whether to activate
  328. the connection. If -- acording to the receiver's \texttt{Ledger} -- the sender
  329. is not a trusted agent (transmission below zero, or large outstanding debt) the
  330. receiver may opt to ignore the request. This should be done probabilistically
  331. with an \texttt{ignore\_cooldown} timeout, as to allow errors to be corrected
  332. and attackers to be thwarted.
  333. If activating the connection, the receiver initializes a Peer object, with the
  334. local version of the \texttt{Ledger}, and setting the \texttt{last\_seen}
  335. timestamp). Then, it compares the received
  336. \texttt{Ledger} with its own. If they match exactly, the connections have
  337. opened. If they do not match, the peer creates a new zeroed out
  338. \texttt{Ledger}, and sends it.
  339. \paragraph{Peer.send\_want\_list(WantList)}
  340. While the connection is open, nodes advertise their
  341. \texttt{want\_list} to all connected peers. This is done (a) upon opening the
  342. connection, (b) after a randomized periodic timeout, (c) after a change in
  343. the \texttt{want\_list} and (d) after receiving a new block.
  344. Upon receiving a \texttt{want\_list}, a node stores it. Then, it checks whether
  345. it has any of the wanted blocks. If so, it sends them according to the
  346. \textit{BitSwap Strategy} above.
  347. \paragraph{Peer.send\_block(Block)}
  348. Sending a block is straightforward. The node simply transmits the block of
  349. data. Upon receiving all the data, the receiver computes the Checksum to
  350. verify it matches the expected one, and returns confirmation.
  351. Upon finalizing the correct transmission of a block, the receiver moves the
  352. block from \texttt{need\_list} to \texttt{have\_list}, and both the receiver
  353. and sender update their ledgers to reflect the additional bytes transmitted.
  354. If a transmission verfication fails, the receiver instead \textit{penalizes}
  355. the sender. Both receiver and sender should update their ledgers accordingly,
  356. though the sender is either malfunctioning or attacking the receiver. Note that
  357. BitSwap expects to operate on a reliable transmission channel, so data errors
  358. -- which could lead to incorrect penalization of an honest sender -- are
  359. expected to be caught before the data is given to BitSwap. IPFS uses the uTP
  360. protocol.
  361. \paragraph{Peer.close(Bool)}
  362. The \texttt{final} parameter to \texttt{close} signals whether the intention
  363. to tear down the connection is the sender's or not. If false, the receiver
  364. may opt to re-open the connection immediatelty. This avoids premature
  365. closes.
  366. A peer connection should be closed under two conditions:
  367. \begin{itemize}
  368. \item a \texttt{silence\_wait} timeout has expired without receiving any
  369. messages from the peer (default BitSwap uses 30 seconds).
  370. In this case, the node issues a \\
  371. \texttt{Peer.close(false)} message.
  372. \item the node is exiting and BitSwap is being shut down.
  373. In this case, the node issues a \texttt{Peer.close(true)} message.
  374. \end{itemize}
  375. After a \texttt{close} message, both receiver and sender tear down the
  376. connection, clearing any state stored. The \texttt{Ledger} may be stored for
  377. the future, if it is useful to do so.
  378. \paragraph{Notes}
  379. \begin{itemize}
  380. \item Non-\texttt{open} messages on an inactive connection should be ignored.
  381. In case of a \texttt{send\_block} message, the receiver may check
  382. the block to see if it is needed and correct, and if so, use it.
  383. Regardless, all such out-of-order messages trigger a
  384. \texttt{close(false)} message from the receiver, to force
  385. re-initialization of the connection.
  386. \end{itemize}
  387. % TODO: Rate Limiting / Node Silencing
  388. \subsection{Object Model}
  389. The DHT and BitSwap allow IPFS to form a massive peer-to-peer system for storing
  390. and distributing blocks quickly and robustly to users.
  391. IPFS builds a filesystem out of this efficient block distribution system,
  392. constructing files and directories out of blocks.
  393. Files in IPFS are represented as a collection of inter-related objects, like in
  394. the version control system Git. Each object is addressed by the cryptographic
  395. hash of its contents (\texttt{Checksum}). The file objects are:
  396. \begin{enumerate}
  397. \item \texttt{block}: a variable-size block of data.
  398. \item \texttt{list}: a collection of blocks or other lists.
  399. \item \texttt{tree}: a collection of blocks, lists, or other trees.
  400. \item \texttt{commit}: a snapshot in the version history of a tree.
  401. \end{enumerate}
  402. We hoped to use the Git object formats exactly, but had to depart to introduce
  403. certain features useful in a distributed filesystem, for example fast size
  404. lookups (aggregate byte sizes have been added to objects), large file
  405. deduplication and versioning (adding a \texttt{list} object), and more.
  406. However, our objects are perfectly compatible with Git and
  407. conversion between the two does not lose any information.
  408. Notes:
  409. \begin{itemize}
  410. \item \texttt{varint} is a variable size int, as in protocol-buffers.
  411. \item objects are serialized using \texttt{capnp}.
  412. \end{itemize}
  413. \subsubsection{Block Object}
  414. The \texttt{Block} object contains an addressable unit of data, and
  415. represents a file.
  416. IPFS Blocks are like Git blobs or filesystem data blocks. They store the
  417. users' data. (The name \textit{block} is preferred over \textit{blob}, as the
  418. Git-inspired view of a \textit{blob} as a \textit{file} breaks down in IPFS.
  419. IPFS files can be represented by both \texttt{lists} and \texttt{blocks}.)
  420. Format:
  421. \begin{verbatim}
  422. block <size>
  423. <block data bytes>
  424. ...
  425. \end{verbatim}
  426. \subsubsection{List Object}
  427. The \texttt{List} object represents a large or de-duplicated file made up of
  428. several IPFS \texttt{Blocks} concatenated together. \texttt{Lists} contain
  429. an ordered sequence of \texttt{block} or \texttt{list} objects.
  430. In a sense, the IPFS \texttt{List} functions like a filesystem file with
  431. indirect blocks. Since \texttt{lists} can contain other \texttt{lists}, topologies including linked lists and balanced trees are possible. Directed graphs where the same node appears in multiple places allow in-file deduplication. Cycles are not possible (enforced by hash addessing).
  432. Format:
  433. \begin{verbatim}
  434. list <num objects> <size varint>
  435. <list or block> <checksum> <size varint>
  436. <list or block> <checksum> <size varint>
  437. ...
  438. \end{verbatim}
  439. \subsubsection{Tree Object}
  440. The \texttt{tree} object in IPFS is similar to Git trees: it represents a
  441. directory, a list of checksums and names. The checksums reference \texttt{blob}
  442. or other \texttt{tree} objects. Note that traditional path naming
  443. is implemented entirely by the \texttt{tree} objects. \texttt{Blocks} and
  444. \texttt{lists} are only addressed by their \texttt{checksums}.
  445. Format:
  446. \begin{verbatim}
  447. tree <num objects> <size varint>
  448. <tree or list or block> <checksum> <size varint> <name>
  449. <tree or list or block> <checksum> <size varint> <name>
  450. ...
  451. \end{verbatim}
  452. \subsubsection{Commit Object}
  453. The \texttt{commit} object in IPFS is similar to Git's. It represents a
  454. snapshot in the version history of a \texttt{tree}. Note that user
  455. addresses are NodeIds (the hash of the public key).
  456. \begin{verbatim}
  457. commit <size varint>
  458. parent <commit checksum>
  459. tree <tree checksum>
  460. author <author public key> <ISO UTC date>
  461. committer <committer public key> <ISO UTC date>
  462. <commit message>
  463. \end{verbatim}
  464. \subsubsection{Version control}
  465. The \texttt{commit} object represents a particular snapshot in the version
  466. history of a tree. Comparing the \texttt{trees} and children objects of two
  467. different commits reveals the differences between two versions of the
  468. filesystem. As long as a single \texttt{commit} and all the children objects
  469. it references are accessible, all preceding versions are retrievable and the
  470. full history of the filesystem changes can be accessed. This is a consequence
  471. of the \texttt{Git} object model and the graph it forms.
  472. The full power of the \texttt{Git} version control tools is available to IPFS
  473. users. The object model is compatible (though not the same). The standard
  474. \texttt{Git} tools can be used on the \texttt{IPFS} object graph after a
  475. conversion. Additionally, a fork of the tools is under development that will
  476. allow users to use them directly without conversion.
  477. \subsubsection{Object-level Cryptoraphy}
  478. IPFS is equipped to handle object-level cryptographic operations. Any additional
  479. bytes are appended to the bottom of the object. This changes the object's hash
  480. (defining a different object, as it should). IPFS exposes an API that
  481. automatically verifies signatures or decrypts data.
  482. \begin{itemize}
  483. \item \texttt{Signing}. Signature appended.
  484. \item \texttt{Encryption}. Optional recipient's public key appended.
  485. \end{itemize}
  486. \subsubsection{Merkle Trees}
  487. The object model in IPFS forms a \textit{Merkle Tree}, which provides IPFS with
  488. useful properties:
  489. \begin{enumerate}
  490. \item \textbf{Content Addressing:} all content is uniquely identified by its
  491. \texttt{checksum}, \textbf{including child checksums}. This means a
  492. particular \texttt{tree} object points to \textit{specific} children.
  493. Committing changes to a \texttt{block} also commits changes to the
  494. containing \texttt{tree}.
  495. \item \textbf{Tamper resistance:} all content is verified with its Checksum.
  496. If data is tampered with, before being delivered, the client
  497. detects and discards it.
  498. \item \textbf{Deduplication:} all objects who hold the exact same content
  499. are the same, and only stored once. This is particularly useful with
  500. parent objects, such as lists, trees, and commits.
  501. \end{enumerate}
  502. \subsection{The Filesystem}
  503. \subsubsection{Filesystem Paths}
  504. IPFS exposes a slash-delimited path-based API. Paths work the same as in any
  505. traditional UNIX filesystem. Path subcomponents have different meanings per
  506. object:
  507. \begin{center}
  508. \begin{tabular}{ll}
  509. \texttt{object} & subcomponent meaning \\
  510. \hline
  511. \hline
  512. \texttt{block} & N/A (no children) \\
  513. \texttt{list} & integer index \\
  514. \texttt{tree} & string name \\
  515. \texttt{commit} & string name (in tree) \\
  516. \end{tabular}
  517. \end{center}
  518. \begin{figure}
  519. \centering
  520. \begin{tikzpicture}[->,>=stealth',auto,thick,
  521. minimum height=2em,minimum width=5em]
  522. \tikzstyle{ghost}=[rectangle,rounded corners=.8ex];
  523. \tikzstyle{block}=[rectangle,draw,fill=blue!20,rounded corners=.8ex];
  524. \tikzstyle{list}=[rectangle,draw,fill=cyan!20,rounded corners=.8ex];
  525. \tikzstyle{tree}=[rectangle,draw,fill=green!20,rounded corners=.8ex];
  526. \tikzstyle{commit}=[rectangle,draw,fill=magenta!20,rounded corners=.8ex];
  527. \tikzstyle{every path}=[draw]
  528. \node[commit] (ccc111) {ccc111};
  529. \node[tree] (ttt111) [below=3em of ccc111] {ttt111};
  530. \node[tree] (ttt222) [below left=3em and 3em of ttt111] {ttt222};
  531. \node[tree] (ttt333) [below=3em of ttt111] {ttt333};
  532. \node[ghost] (ghost1) [below right=3em and 3em of ttt111] {};
  533. \node[list] (lll111) [below=3em of ttt333] {lll111};
  534. \node[block] (bbb111) [below=3em of ttt222] {bbb111};
  535. \node[block] (bbb222) [below right=3em and 3em of ttt333] {bbb222};
  536. \node[block] (bbb333) [below left=3em and 3em of lll111] {bbb333};
  537. \node[block] (bbb444) [below=3em of lll111] {bbb444};
  538. \node[block] (bbb555) [below right=3em and 3em of lll111] {bbb555};
  539. \path[every node/.style={font=\sffamily\small}]
  540. (ccc111) edge[out=-90,in=90] (ttt111)
  541. (ttt111) edge[out=-90,in=90] (ttt222)
  542. edge[out=-90,in=90] (ttt333)
  543. to [out=-90,in=90] (ghost1)
  544. to [out=-90,in=90] (bbb222)
  545. (ttt222) edge[out=-90,in=90] (bbb111)
  546. (ttt333) edge[out=-90,in=90] (lll111)
  547. edge[out=-90,in=90] (bbb222)
  548. (lll111) edge[out=-90,in=90] (bbb333)
  549. edge[out=-90,in=90] (bbb444)
  550. edge[out=-90,in=90] (bbb555)
  551. ;
  552. \end{tikzpicture}
  553. \caption{Sample Object Graph} \label{fig:sample-object-graph}
  554. \begin{verbatim}
  555. # ccc111 contents
  556. commit 313
  557. tree ttt111
  558. author <author public key> <ISO UTC date>
  559. committer <committer public key> <ISO UTC date>
  560. # ttt111 contents
  561. tree 3 250
  562. tree ttt222 46 ttt222-name
  563. tree ttt333 166 ttt333-name
  564. block bbb222 11 bbb222-name
  565. # ttt222 contents
  566. tree 1 10
  567. block bbb111 10 bbb111-name
  568. # ttt333 contents
  569. tree 2 104
  570. list lll111 93 lll111-name
  571. block bbb222 11 bbb222-eman
  572. # lll111 contents
  573. list 3 39
  574. block bbb333 12
  575. block bbb444 13
  576. block bbb555 14
  577. # bbb111 contents # block bbb222 contents
  578. block 1 block 2
  579. 1 22
  580. # bbb333 contents # block bbb444 contents
  581. block 3 block 4
  582. 333 4444
  583. # bbb555 contents
  584. block 5
  585. 55555
  586. \end{verbatim}
  587. \caption{Sample Objects} \label{fig:sample-objects}
  588. \end{figure}
  589. For example, given the sample objects in Figures \ref{fig:sample-object-graph} and \ref{fig:sample-objects}:
  590. \begin{verbatim}
  591. # to access tree ttt333:
  592. ccc111/ttt333-name
  593. # to access block bbb222:
  594. ccc111/bbb222-name
  595. ccc111/ttt333-name/bbb222-eman
  596. # to access list lll111:
  597. ccc111/ttt333-name/lll111-name
  598. # to access block bbb555:
  599. ccc111/ttt333-name/lll111-name/2
  600. \end{verbatim}
  601. Note that:
  602. \begin{itemize}
  603. \item[(a)] blocks have no children \\
  604. \texttt{.../<block>/<child>} is impossible
  605. \item[(b)] commits implicitly access their trees \\
  606. \texttt{.../<commit>/name}
  607. looks up \texttt{"name"} in \texttt{<commit>}'s \texttt{<tree>}
  608. \item[(c)] \texttt{list} children are accessed by their index \\
  609. \texttt{.../<list>/4} looks up the fifth block.
  610. \end{itemize}
  611. \paragraph{Path Lookup Performance}
  612. Path-based access traverses the object graph. Retrieving
  613. each object requires potentially looking up its key in the DHT,
  614. connecting to peers, and retrieving its blocks. This is considerable
  615. overhead, particularly when looking up paths with many components.
  616. This is mitigated by:
  617. \begin{itemize}
  618. \item \textbf{tree caching}: since all objects are hash-addressed, they
  619. can be cached indefinitely. Additionally, \texttt{trees} tend to be
  620. small in size so IPFS prioritizes caching them over \texttt{blocks}.
  621. \item \textbf{flattened trees}: for any given \texttt{tree}, a special
  622. \texttt{flattened tree} can be constructed to list all objects
  623. reachable from the \texttt{tree}. Figure \ref{flattened-ttt111} shows
  624. an example of a flattened tree. While IPFS does not construct flattened
  625. trees by default, it provides a function for users. For example,
  626. \end{itemize}
  627. \begin{figure}
  628. \begin{verbatim}
  629. tree 5 250
  630. tree ttt222 46 ttt222-name
  631. block bbb111 10 ttt222-name/bbb111-name
  632. tree ttt333 166 ttt333-name
  633. list lll111 93 ttt222-name/lll111-name
  634. block bbb222 11 ttt333-name/bbb222-eman
  635. block bbb222 11 bbb222-name
  636. \end{verbatim}
  637. \caption{Flattened Tree for \texttt{ttt111}} \label{fig:flattened-ttt111}
  638. \end{figure}
  639. \subsubsection{Publishing Objects}
  640. IPFS is globally distributed. It is designed to allow the files of millions
  641. of users to coexist together. The \textbf{DHT} with content-hash addressing
  642. allows publishing objects in a fair, secure, and entirely distributed way.
  643. Anyone can publish an object by simply adding its key to the DHT, adding
  644. themselves as a peer, and giving other users the object's hash.
  645. Additionally, the IPFS root directory supports special functionality to
  646. allow namespacing and naming objects in a fair, secure, and distributed
  647. manner.
  648. \begin{itemize}
  649. \item[(a)] All objects are accessible by their hash. Thus, users can
  650. always reference an object (and its children) using
  651. \texttt{/<object\_hash>}.
  652. \item[(b)] \texttt{/<node\_id>} provides a self-certifying filesystem
  653. for user \texttt{node\_id}. If it exists, the object returned is a
  654. special \texttt{tree} signed by \texttt{node\_id}'s private key. Thus,
  655. a user can publish a \texttt{tree} or \texttt{commit} under their
  656. name, and others can verify it by checking the signature matches.
  657. \item[(c)] If \texttt{/<domain>} is a valid domain name, IPFS
  658. looks up key \texttt{gfs} in its \texttt{DNS TXT} record. IPFS
  659. interprets the value as either an object hash or another IPFS path:
  660. \begin{verbatim}
  661. # this DNS TXT record
  662. fs.benet.ai. TXT "gfs=/aabbccddeeffgg ..."
  663. # behaves as symlink
  664. ln -s /aabbccddeeffgg /fs.benet.ai
  665. \end{verbatim}
  666. \end{itemize}
  667. \subsection{Local Objects}
  668. IPFS clients require some \textit{local storage}, an external system
  669. on which to store and retrieve local raw data for the objects IPFS manages.
  670. The type of storage depends on the node's use case.
  671. In most cases, this is simply a portion of disk space (either managed by
  672. the native filesystem, or directly by the IPFS client). In others, non-
  673. persistent caches for example, this storage is just a portion of RAM.
  674. Ultimately, all blocks available in IPFS are in some node's
  675. \textit{local storage}. And when nodes open files with IPFS, the objects are
  676. downloaded and stored locally, at least temporarily. This provides
  677. fast lookup for some configurable amount of time thereafter.
  678. \subsubsection{Object Pinning}
  679. Nodes who wish to ensure the survival of particular objects can do so by
  680. \texttt{pinning} the objects. This ensures the objects are kept in the node's
  681. \textit{local storage}. Pinning can be done recursively, to pin down all
  682. descendent objects as well. For example, recursively pinning a \texttt{tree}
  683. or \texttt{commit} ensures \textit{all} objects pointed to are stored locally
  684. too. This is particularly useful for nodes wishing to keep all their own files.
  685. %\section{Acknowledgments}
  686. %\bibliographystyle{abbrv}
  687. %\bibliography{gfs}
  688. %\balancecolumns
  689. %\subsection{References}
  690. \end{document}