File Sharing

File sharing applications usually describe software used by ordinary Internet users to exchange all kinds of files. Below is a presentation of major file sharing protocols from a technical point of view, following roughly their order of appearance determined by the evolution of the technology.

The first notable file sharing systems were Bulletin Board Systems. The classical configuration consisted of a central computer, typically a personal computer hosted at home by the system operator, causally called the sysop. Bulletin Board systems emerged with the widespread adoption of personal computers starting in the eighties. Users connected called the phone number of the BBS over POST lines and were given the possibility to chat, post and read messages on public forums (hence the name Bulletin Board Systems) and upload or download files. This system was however limited by the throughput of the modems at that time, as well as by the number of simultaneous connections such a server maintained by a small business or a hobbyist could allow.

With the emergence of Internet, such Bulletin Boards took advantage from the new possibilities offered by this technology. Some client and server application suites such as Hotline Connect were specifically built to take advantage of the graphical user interface of modern operating systems. However, the limitation caused by the centralized structure of the system remained: such a server has a given throughput and when it's reached, download requests have to be queued until a slot becomes available.

The advent of Napster in 1999 dramatically improved this structure. The centralized server was used only to index files hosted in chosen shared folders of users' hard disks, without the need to upload them all to the server itself. The computers of the users, called clients, queried the server to see whom got the files they were looking for, and retrieved the file in question directly from the other user. The file exchange was taking place between two clients and not between a client and server, hence this system has been called Peer-to-Peer or P2P. Because Napster focused only on MP3 files, it got sued by the music industry for alleged copyright infringement and went bankrupt, but before its demise the protocol had been reverse engineered and released as open source, so that even today, there are several servers using this protocol called OpenNap which is not limited to MP3 files. Another peer to peer file sharing technology relying on centralized servers is Direct Connect. These file sharing protocols, inspired by Bulletin Board Systems, also host chat rooms.

Legal problems faced by such centralized, server based protocols, in which the server is vulnerable to legal or technical attacks inspired the development of totally decentralized networks. The first such network, which only relied upon peers, was Gnutella and as of 2008, it is still widely used. Although relying only on equal peers at the beginning, scalability problems quickly occured and to fix this issue, some peers, usually those running on computers with enough computing resources and which have a good connection, were designed to be "more equal" than others. Depending on the network, they are called hubs, ultrapeers or nodes. The first network to implement this concept was Kazaa. In this case, a node handles the connections from a few dozen to a few hundreds of peers and indexes their files, acting as a mini server. The node itself is connected to other nodes, and when a peer wants to search for a file, it queries the nodes to which it is connected. These nodes then forward the query to their neighbour nodes until a reply is returned, or, as long as the query has not been forwarded for too long. This method of search is called query flooding and introduces the concept of horizon: a given node typically cannot reach the entire network because in order to preserve the integrity of the network from a too important number of flooding queries, search requests have to be dropped at some point. As stated above, Gnutella which switched to the leaf-ultrapeer architecture since its design, Kazaa, but also an open source network inspired by Kazaa and called OpenFT (for Open FasttTrack, FastTrack being the name of the proprietary protocol used by Kazaa), Gnutella 2 which is an incompatible variant of Gnutella and Ares Galaxy, amongst others, use this system which is virtually impossible to shut down.

To address the problem of the horizon stated above, Distriuted Hash Tables came into use. In this case, the supernodes do not directly index all files of the peers connected to them, but they rely on a hash table, and each supernode takes care of only a given range of file. As such a network is based on peers which can go offline anytime and for redundancy purposes, such nodes are duplicated. Schematically, a node takes care of files whose name begin with A, another of files whose name begin with B and so on. When a peer asks the node to which it is connected for a file whose name begins with E, the node in question will contact specifically the node taking care of this group of files instead of all its neighbour nodes. Of course, the classification is not based on file names but on hashes of these files, hence the name Distributed Hash Table (DHT). For instance, eMule uses both eDonkey servers and Kad, a decentralized network based on the Kademlia DHT.

The most widely used protocol today, BitTorrent, relies on a slightly different architecture: it is centralized, since peers must connect to a tracker to find files they are looking for. But instead of previously mentioned networks, a virtual network for each file is momentarily created. A user does not share all files contained in a specific shared folder but seeds a limited number of files. This leads to increased speeds for popular files, the downside being potentially less diversity compared to other networks mentioned above.

In all these peer to peer architectures, the speed at which a file can be exchanged groves exponentially: as soon as a peer retrieves a file, it becomes a source itself. This provides the possibility of retrieving a given file from several peers at once, each peer uploading a different part or segment of the same file. This is called segmented download or swarming, and dramatically decreases the necessary time for a complete download. In most cases even a partial copy of a file can be a source, without the need to wait to get a complete copy, thus increasing even more the possibility to spread a given file.

Finally, it should be noted that the vast majority of file sharing protocols described can be accessed using free, open source clients either because they have been reverse engineered, or because their design has been released into the public domain. Therefore, nothing prevents an application from supporting several networks at once, and this tends to become more and more common so that users can benefit from to content of several networks easily. For instance, LimeWire has both Gnutella and BitTorrent support, Ares Galaxy connects to its own Ares network and to BitTorrent, KCeasy can connect to Ares, Gnutella, OpenFT and FastTrack, and Shareaza supports Gnutella, Gnutella 2, eDonkey and BitTorrent, all these clients being free and open source.

For a more detailed list of file sharing clients the relevant category of the Open Directory Project may be an interesting resource.