Buscar en Mind w/o Soul

lunes, enero 25, 2010

A taxonomy for storage of data in the Cloud

TO DO: Add lots of practical examples. And graphics!

In a nutshell: Provide constant feedback for the state of all data in distributed on-line services ("the Cloud"), with respect to connection (synchronized, unsynchronized) and access speed (slow, fast).


The problem of storage in a connected world

Information in computers has been traditionally stored in filesystems. This is a storage model that worked well for mainframes and corporate networks, where an expert administrator took care of the filesystems, and that survived a transition to personal computers (somewhat transformed by the "desktop metaphor") where the amount of available data was reduced and limited to a single device.

Most programmers will not realize it most of the time, but files are by itself an abstraction model over a more complex technology model - that of persistent storage media, which handle blocks of electric signals in physical electronic devices.

The abstraction provided by filesystems is less than perfect for the typical data needs of people in the mainstream information sphere, where both corporate and personal users share information through the decentralized system that is the Internet. Not the smaller problem is that nobody will be responsible for administering data, or when they do, this creates a single point of control and failure - two undesirable traits.


The Cloud - on sunny days

The ideal data model for personal user data should be one of "always persistent, always accessible, anywhere". This puts users again in control of the information they own, using a simple centralized abstraction.

The problem is, current technology still doesn't support this directly. The proposed "cloud" metaphor has been marketed to users as an approximation of that model, but it still fails to deliver the promised experience because of abstraction leaks.

Abstraction leaks are situations where there is a mismatch between the model expectations induced by the metaphor and the actual system behavior. Sometimes the limits of underlying technology will skip through the details of the proposed model: network connection will fail, local storage disks will break... The resulting error situation and steps required to fix it are utterly difficult to explain to people not expert in the used technology.


The Cloud Storage taxonomy

There are two axis that directly affect user experience of data stored in the cloud, and thus should be included in the user model for cloud storage:
- Location: Connected vs Disconnected.
- Speed: Fast vs Slow.

Note that Location is not primarily about the physical position of the stored data (which is mostly independent to the problem of data storage, thanks to the Internet) but one of logical access: devices can be a part of the Net sometimes, but sometimes not. Connected/disconnected is the relevant fact to know about a device when thinking of retrieving the same data later.

So, the different and fundamental kinds of persistent data introduced by the Cloud, in addition to the abstraction of "Document" and "Media type" already supported by filesystems, are these four classifications:

* Connected data located in a fast medium.
* Disconnected data located in a fast medium.

* Connected data located in a slow medium.
* Disconnected data located in a slow medium.

Why are these four states relevant, and why should they be treated as different one to each other in the user model?

The first state (connected, fast) is the ideal situation, the one that best matches the promised Cloud metaphor of "always on, anywhere" data. But as soon as the technical limits appear, the experience degrades:

- Disconnected data is one that is created or modified in a device without a connection to the Internet. The problem with disconnected data is that it breaks the Cloud metaphor: if I move to a different device, my new data or recent edits will not be available. This problem is what prompted users to abandon mail clients in favor of webmail, and the reason why web services are the trend.

- Slow data is one that takes time to stabilize and get "imported" into the cloud; it's data I must wait for after I've finished my work. Fast data is instantly protected as soon as it's created; I can be confident that closing the application, unplugging the USB key or turning off the computer will not be a problem - I have enough feedback to be sure that the data is safe. On the contrary, with slow data, in all those scenarios I have babysit the data saving process before I can proceed with my life.

- Disconnected AND slow merits a separate because, in addition to the problems listed above, it shows an important trait: user data is at the highest risk of being lost. Any system problem may cause open data to be destroyed. A simple power glitch, or an application crash will destroy the user current session and all the ongoing work in it will be lost.

Connected slow data can protect against it by saving data in the background; in case of failure the latest online version can be retrieved, even if a bit outdated.

Fast disconnected is even safer in this scenario; a power failure or broken application will not destroy data, because fast data will already be saved to a persistent state. It doesn't protect in the case of a lost or broken device, but that's less severe because for cloud data, disconnected is supposed to be a transient state anyway; as soon as the device gets reconnected, it will synchronize to the cloud and the data will be safe again.


The sad thing is, disconnected slow data is the default storing model for desktop applications, so it's the most common. Online web applications may change this, but they are more likely to copy this dangerous model from desktop apps.


"Saved" state

There's a third dimension (saved / unsaved) that is relevant to system designers and programmers but should be transparent to users, following the principle by Jef Raskin of "treat user data as sacred" (see Your data is sacred, Always safeguard users' data) and its corollary "don't force users to input the same data twice". This means that all user-introduced data should be instantly saved in some way, and it shouldn't be lost except by catastrophic failure (and ideally, even in that case too).

No hay comentarios: