Some thoughts on “Database Filesystems”

With the recent (or not so recent, I am a very slow writer) interest in database file systems, I’ve been thinking about what a typical user really wants from such a system. What would they use it for? What would we need to do to help them get the most from it? Are there any precedents that show how useful a database file system could be? If not, could we invent one? This lead me to some “gedanken solutions” (like gedanken experiments, just with software) that I thought I’d distract you with.

IMPORTANT NOTE: Most of what is discussed here has already being implemented by BeOS since 1996, however the author has never used BeOS and so he was not familiar with its capabilities while writing this article.


As technical people, we can all think of a bunch
of cunning uses for a database filesystem. My personal dream use
would be a superlative code management system; when integrated with a
good editor/IDE, it could provide revision control, tagging,
searchable documentation, name completion, and probably any number of
other things. Imagine being able to search the Doxygen comments for
the function you can vaguely remember provides exactly the feature you
want. Imagine being able to find every place a method is called so
you can tweak its interface. Imagine being able to examine, shuffle
and package changesets, like Bitkeeper.

But that is quite a lot to implement all in one go, even as an
imaginary system, and it doesn’t really
show how a general user would be able to take advantage of the tools
we want to provide.

Instead, I’d like to focus on the humble email client. Email clients
have a number of features that make them interesting here:

  • Everyone uses a mail client
  • Email messages have a bunch of attributes that can be easily extracted
  • Mail clients use a custom database

The latter item is particularly intriguing. Email clients have a
custom database, but what do they do with it? They use it to
implement what at first sight appears to be a straightforward
filesystem with folders and files just like the OS-native version.
There are some deviations from this norm like the virtual folders in
Thunderbird and Evolution, and I believe Opera uses a more generic
database in their mail client, but predominately we are still using a
hierarchical structure to organize our email.

This observation inspires the following questions:

  • If database filesystems are so good, are there any good reasons why no-one has implemented one for email?
  • Can we explore the usefulness of a database filesystem by implementing one within a mail client?
  • What “killer features” would such a mail client provide, and would they convince users to switch?

The rest of this article tries to find some answers to these
questions by creating a specification for a database-backed mail
client.

The Mail Database

First of all, we should explore what features a database backed mail
client would provide the user. In a pure email system, we would only
need to store two different types of objects: email and addressbook
entries. To simplify things, I’m ignoring all the other things, like
task items and diary dates, that some mail clients store.

We can divide the attributes for each object into three different categories:

  • Intrinsic attributes – These are defined by the objects themselves, e.g.,
    • The sender, date, recipients, subject etc. for an email.
    • The name, email address etc. for an addressbook entry.
  • Client attributes – These are invented by the mail client to manage the database objects, e.g.,
    • Object type
    • Unique identifier
    • Per-message flags: draft, sent, unread, deleted etc.
    • Received date
  • User attributes – These are attributes that the user maintains e.g.,
    • Per object flags e.g., message has been replied-to, message has been forwarded, message needs response
    • Object category attributes, e.g., message is a personal/work message, addressbook entry is a friend/business associate
    • Custom attributes e.g., Deal-with-by date

The above is obviously not an exhaustive list of attributes, but I
think they give a feel for the type of things we are talking about.

We want to use the message attributes to help a user organize their
email in ways that weren’t possible with the old folder paradigm.
For example, the user might want to

  • Set a “Needs reply” flag so that the user can see which messages need to be responded to.
  • Set a “Deal with by” date so that the user can specify any deadlines imposed by the message and a completed flag the user can set when the task is complete.
  • Set flags indicating that the message is work/personal/etc.

or any other attributes that the user might think of. The important
thing is the user should be able to modify the set of attributes
whenever he wants; it might be difficult to get a user to maintain a
set of attributes that we impose on him, but he is bound to be keen to
use attributes that he defines himself.

The user can can use these new attributes to manage his email in lots of new and interesting ways, for example,

  • The user can find all messages that have been waiting for a reply for longer than a week
  • The user can find all messages with imminent deadlines
  • The user can find all work messages from a particular recipient

Creating a Message Hierarchy

One attribute type that I haven’t mentioned is a explicit message
folder. Instead we can produce a folderlike hierarchy using any set
of attributes. But will the user want to sort his email into a
hierarchy? Considering the precedents – current mail clients,
hierarchical databases and filesystems, DNS, taxonomy and any number
of other examples – I think we can safely assume that the need to
categorize objects into a hierarchy is hardwired into the human brain.

I can think of two approaches to producing a hierarchy from object
attributes. First of all, we can categorize objects using a subset of the
available attributes. At each level of the hierarchy, we choose an
attribute, and assign messages into subcategories using that
attribute.

This hierarchy is very simple to achieve but its usefulness is
probably limited. Most attributes aren’t suitable. Who would want to
categorize their messages using the message ID? How would we use a
multi-valued attribute such as recipients? Even the originator will
only be useful under limited circumstances.

The second option is to use a specific user-defined category
attribute. The user enumerates all possible values of this attribute
and assigns messages to their appropriate categories as he sees fit.
To produce a hierarchy, we divide the category attribute into fields,
with each field used to categorize objects at a given level in the
hierarchy.

The most useful solution would probably be a combination of these two.
At the highest level, the user would want to see their messages
categorized using the message flags to produce categories like unread
and uncategorized messages, messages waiting to be sent, deleted
messages etc. Afterwards, it is probably sufficient to arrange
messages according to the single category attribute.

Note that with this scheme, we no longer guarantee that the message
categorization is disjoint – a given message can exist in more
than one category. In fact it might be useful to make the category
attribute multivalued. After all, not every message is easy to
pigeonhole.

Addressbook Entries

As mail messages and addressbook entries are stored in the same
database, the same tools are available for both types of objects
– addressbook entries can be flagged and categorized just like
email. Personally, I would identify my contacts as friends,
colleagues, managers, people I play soccer with, etc. These
categories are definitely not disjoint; surprisingly enough, your
colleagues and fellow footballers can be friends too.

Once we have assigned attributes to addressbook entries, we can use
them in lots of interesting ways. One very powerful use would be to
generate lists of recipients, for example to

  • Send a message to all people marked as soccer players to organize a match.
  • All people who are both soccer players and colleagues to organize a game against another company.

More interesting possibilities arise when we allow database “joins”
between addressbook entries and email messages. This would allow us
to do things like

  • Find messages from your managers that need a response.
  • Find a list of friends that we haven’t emailed recently.

Issues and Omissions

There are some unfortunate consequences to storing messages in the
way described above. For example, as objects are not uniquely
categorized, a given object can exist in multiple places in our
hierarchy. If a user is used to the old folder paradigm, he will be
surprised when “copies” of an object are updated when he updates one
instance. This is not much of a problem for immutable objects like
email messages, but may be a problem in other cases.

There is a related problem that appears when we want to delete an
object. Does the user want to delete the object from the database, or
does he just want to remove it from the current categorization? In
the old paradigm, these two were synonymous. Perhaps we want to
provide both mechanisms to the user, but he may find this confusing.

Taking these thoughts further, does this API mean that the user is
less likely to delete messages? Will we get enormous mail databases
with items squirreled away in obscure categories? Do we need to give
the user tools to help him clean up the database?

Finally, there is one major piece missing from the above. What do we
do if the user wants to search for a word or phrase in the subject or
body of one of the messages? This is arguably the most useful
feature, yet it is not obviously provided by a database based message
store; it can just as easily be provided by a traditional email
datastore. So is the whole complex edifice described above useful
enough to warrant working on an implementation? If we can’t convince
the user to maintain and use the user attributes, the answer is
probably no.

Mail Client UI

In my opinion, defining the database and inventing some examples of
how it could be used is the easy bit. The UI is another matter
entirely. After all, we can have the most advanced database in the
world, but it will not be used unless the interface to it is simple
and intuitive. User interface design is not my strong point, so I won’t belabor this
issue, but I’d like to briefly suggest some approaches that I think
might be fruitful.

First of all, we should define the problem: There are three aspects of the UI to a database message store:

  • Schema modifications – introducing new attributes and attribute values.
  • Attribute maintenance – viewing/modifying an object’s attributes.
  • Specifying queries – using the attributes to find messages in the database.

Schema modifications are relatively straightforward. For example,
when adding an attribute to the schema, the user just has to specify
an attribute name and type, and perhaps a set of allowed attribute
values for enumerated types. This is ideal material for a simple
wizard.

Attribute maintenance is slightly more tricky. Most attributes are
intrinsic to the email messages or maintained by the mail client itself,
so will never need to be updated by the user. Most of the rest can be
modified using straightforward UI widgets – message flag
attributes can be set via buttons or tickboxes, enumerated attributes
via selection boxes etc.

The message category is another matter. A typical user may want to
use many different categories, forming a deeply nested category tree.
In addition, he may want to assign two or three categories to a
message. This means that the usual solution – a drop-down
selection box – is probably not useful.

One option would be to allow the user to drag a message from a
“message list” panel to a “folder list” panel in analogy to the way he
would move a message into a folder in current mail clients. This is a
good solution, but does not make it obvious that multiple categories
can be assigned to the message. Also, many of the “folders” would be
like the virtual folders in current mail systems, so could not be used
as a target. So this interface may not be intuitive.

Another option would be a tree view of the available categories with
tick boxes adjacent to each category. Boxes would be ticked for each
category that the message was assigned to. This is the most complete
representation of the category information, but would be rather
tedious to use.

Another option would be a text entry box with autocompletion, like the
addressbar in a web browser. This would be good for typists, but
perhaps not so good for everyone else. I’m sure that there are many
other solutions too. This is one area where some experimentation may
be required to discover which one is best.

Finally, we come to the UI used to specify queries into the
database. Essentially, we want to create a UI component to replace
the “folder view” found in current email systems. This would be the
primary tool for finding messages in the database, so is the most
important component of the new mail client UI.

We can conceptually divide message location techniques into the following:

  • Find messages based on a few fundamental message properties, e.g.,
    • Recently arrived and uncategorized messages, i.e., an Inbox folder.
    • Messages queued up to be sent, i.e., an Outbox folder.
    • Messages currently being prepared i.e., a Drafts folder.
    • Messages marked as sent i.e., a Sent messages folder.
  • Find messages based on their category. As discussed above, this is the equivalent of the traditional folder hierarchy, and can be displayed as such.
  • Find messages based on message thread or other intrinsic or client attributes.

In addition we would want to include more DB-like interfaces for finding messages, e.g.,

  • Find messages using a query language.
  • Virtual folders that remember useful queries.

A fully featured query interface may be too difficult for a typical
user, but current DB query tools can be used as a guide.

One simple query interface hinted at above is the query tree; here
we build a hierarchy using simple attributes.
At each level of the hierarchy, the user selects which attribute to
use by, for example, clicking on an icon where the "[+]" and "[-]" icons
usually are in current folder browsers. Each time the icon is
clicked, the tree is expanded using a different attribute and the icon
is changed accordingly. Each entry in the expanded subtree
corresponds to one value of the chosen attribute.

Not all attributes are suitable for such a query tree, particularly
attributes with many distinct values, e.g., dates and the message
subject. Also the user would only be able to do very simple queries
in this way. But this might be a simple enough interface that it
might be used by everyday people. Again, more research is required.

Database Implications

You have probably gathered that the database described in this email
is not a typical relational database. I like to think of it as an "ad
hoc" database; we can use it to store any old junk.

In our favour we don’t have too much information to store. Even
the most sociable of users will not have millions of emails. In
addition, the attributes we use are pretty straightforward: flags,
enumerated values, strings, a category attribute with fields, perhaps
dates and numbers.

Having said that, this database may be difficult to implement
efficiently. The user can add new attributes to any object at will.
Most attributes are optional and some are multivalued. Together this
will make it difficult to find an efficient storage scheme, though
something like Reiser4 whould suffice. Further, we will want to key
every attribute, as well as creating a “string search” key into the
message subject and body.

These things may make the database difficult to implement. These
difficulties will only get worse as we add more object types to our
database, and may be prohibitive in an unconstrained system like a
full database file system. Perhaps this is why Microsoft has failed
to implement one for the last 10 years.

Taking it Further

Even though this system is imaginary, it is useful thinking how it
will develop over time. I can think of a few directions that could be
explored:

Accessing databases remotely

Just as the MAPI and IMAP protocols provide remote access to traditional
mail databases, we would need to create a network protocol to provide
remote access to our database.

If we were using a DBFS, this could be the filesystem’s standard
network protocol, if one existed. Conversely, creating a network
protocol to access our database would serve as a good prototype for a
network DBFS.

Linking together multiple databases

Quite often, people have access to more than one mail database.
For example, a company employee typically has access to two: their own
database containing personal messages and a shared public database
containing messages of general interest.

It is interesting to consider how we could integrate multiple
databases in this framework. Will the user see a combined view or
will they see two distinct databases? How will access control be
implemented? Can the user add private attributes to public entries?

Setting attributes automatically

Some user attributes, e.g., attributes indicating that the message
has been forwarded or replied-to, can be automatically set by the mail
client. It would be useful to provide some javascript-like scripting
language to allow the user to automate the maintenance of such
attributes.

Taking this one step further, we could use the same techniques used
to identify spam &ndash Bayesian filtering for
example – to place messages into other categories. Although I’m
not sure if users would be willing to allow their messages to be
categorised by a machine.

Adding attributes to outbound messages

A user might want to include attributes within messages sent to
other users of database-backed mail clients. For example, the sender
might like to set a reply-required flag or a reply-by date. The
recipient might like it if the message indicated whether it was a work
or a personal email.

The RFC822 standard is flexible enough that it would be quite easy
to add these attributes to a message. The difficult bit would be
creating a shared schema for all users to use.

Where Do DB Filesystems Fit?

You may have noticed that database filesystems weren’t mentioned
much in the above. So now it is time to ask what would such a
filesystem give us?

It is not obvious that a database filesystem would implement what
we require. For example, would we be able to add attributes at will
to database objects? Would the database filesystem allow us to do
string searches into the body of objects? Perhaps the database
filesystem would just provide an efficient storage layer (e.g.,
Rieser4) and we would have to do all the indexing ourselves.

Assuming that we have access to a database filesystem that fulfills
our requirements, implementing a mail client on top might be little
more than implementing a UI. Unfortunately, as I attempted to convey
above, I think that this would be one of the harder problems to
solve.

In addition to easing the implementation of our mail client, the
database filesystem would allow users to manipulate messages using the
standard filesystem tools. For example, users could view and edit
message attachments with standard utilities – the attachments
would appear as if they were just another file in an ordinary file
hierarchy. Of course, we don’t need a database filesystem for this;
we could achieve the same result by exporting the contents of our
database as a userspace filesystem using tools like FUSE.

Perhaps most importantly of all, a database filesystem would allow
us to unify the way we handle all filesystem objects. For example, if
we extended the database to extract intrinsic attributes from word
documents or music files, then these attributes would automatically be
available in our mail client.

So it seems that a database filesystem does not buy us very much.
On the other hand, many of the ideas and issues outlined in the
previous sections apply to a database filesystem just as much as to
our database backed mail client. In the former, we still need tools
for administering attributes and specifying queries. Solving these
issues in the simple email case should give us a good insight into
more general solutions for a full database filesystem.

Conclusion

The above was a fairly undirected ramble through some ideas, and I
must apologize for inflicting it upon you. I think the point I am
trying to illustrate is that a database filesystem is not a solution
in itself. Like all databases, it is only as useful as the
applications that use it. The converse is not true – we can start
implementing the applications straight away using a custom
database.

I think this suggests that it is worthwhile starting to implement
the applications now. We can start using the database filesystems
when they become available.

If I get some time, I plan on experimenting with some of the above
ideas in a prototype mail client. But considering it took me a month
to write this essay, don’t hold your breath.

Further Reading

DB Mail Clients:

DB filesystems

  • Reiser4 is an efficient
    database for storing lots of small objects.
  • Hans Reiser’s vision for a database filesystem
  • Real soon now, Microsoft will unleash WinFS
    onto the world and make all other database filesystems obselete.
    Though details are still a little vague.


If you would like to see your thoughts or experiences with technology published, please consider writing an article for OSAlert.

40 Comments

  1. 2005-02-21 11:19 pm
  2. 2005-02-22 12:00 am
  3. 2005-02-22 12:24 am
  4. 2005-02-22 12:35 am
  5. 2005-02-22 12:50 am
  6. 2005-02-22 12:53 am
  7. 2005-02-22 2:09 am
  8. 2005-02-22 2:52 am
  9. 2005-02-22 3:14 am
  10. 2005-02-22 3:24 am
  11. 2005-02-22 4:14 am
  12. 2005-02-22 5:56 am
  13. 2005-02-22 5:58 am
  14. 2005-02-22 6:18 am
  15. 2005-02-22 7:29 am
  16. 2005-02-22 8:01 am
  17. 2005-02-22 8:22 am
  18. 2005-02-22 9:16 am
  19. 2005-02-22 10:14 am
  20. 2005-02-22 10:47 am
  21. 2005-02-22 12:51 pm
  22. 2005-02-22 12:55 pm
  23. 2005-02-22 1:00 pm
  24. 2005-02-22 1:27 pm
  25. 2005-02-22 1:54 pm
  26. 2005-02-22 2:54 pm
  27. 2005-02-22 3:41 pm
  28. 2005-02-22 4:02 pm
  29. 2005-02-22 4:17 pm
  30. 2005-02-22 4:21 pm
  31. 2005-02-22 4:38 pm
  32. 2005-02-22 5:46 pm
  33. 2005-02-22 5:59 pm
  34. 2005-02-22 6:42 pm
  35. 2005-02-23 1:06 am
  36. 2005-02-23 9:41 am
  37. 2005-02-23 1:28 pm
  38. 2005-02-23 5:05 pm
  39. 2005-02-23 6:06 pm
  40. 2005-02-23 9:46 pm