Storage@home: Petascale Distributed Storage

Storage@home is a distributed storage infrastructure developed to solve the problem of backing up and sharing scientific results using a distributed model of volunteer managed hosts. Data is actually maintained by a mixture of replication and monitoring, with possible repairs.



Authors start with assumptions: (1) around 100 000 computers would be volunteers, (2) most users will have up to approximately 400Kbps of upload bandwidth and (3) file are around 100 MB each.

Implementation: Each file is encrypted and split into four copies striped across ten hosts, giving each host an overlapped 40% of the file (so the data will be lost of 4 or more nodes will fail (nodes that adjacent to this data)).

The policy engine is the master of the system and does all the planning and coordination of the storage. It pulls the information on what hosts are available from the identify server, and the data on what should be in the system from metadata server. Major challenge for this engine is taking the decision where to put replicas of files so that the chances of loss are minimized.

Nature of volunteer computing: expecting 500-1 000 hosts to disappear every day with 5-10 TB of space.

Deployment: Storage@home follows a use model typical of a volunteer computing project, with an agent installed on the user’s machine after they register to participate. Volunteers are rewarded with points, which are made publicly viewable on the project statistics site. Volunteers also form teams to compete with other teams and to make recruiting more fun. In the case of Storage@home, we award points for giving us space, and penalize them if their hosts end up being marked as dead without them telling us first.

Related Works: The Google File System[4] uses large racks of cheap drives, but relies on a high speed LAN and makes a large set of operational assumptions that do not match up with our usage. It is an example of a cluster-based system and is representative of an entire class of file systems using a metadata and data server model over the last 40 years.

Fully peer-to-peer systems research tend to focus on anonymity for large numbers of users, or the large scale distribution of copyrighted works. This results in a great deal of complexity as well as lacking the guarantees and reliability metrics we require. OceanStore[5] and PAST[3] are good representatives of this class of research.

Erasure codes are currently used in many storage and peer-to-peer systems as opposed to replication. Since the local Internet connection is a serious bottleneck in getting the data out.

decentralized_storage_systems/storagehome.txt · Last modified: 2012/04/23 01:06 by julia
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki