Content Manager

One of the challenges in designing a content manager that uses a VCS backend instead of a database is in how to store all the metadata about the content. This could include title, author, date, keywords, and potentially much more. Storing auxiliary information is what a database does best, but using VCS storage means that all your content and metadata has to be organized in terms of a filesystem. It is a more restricted environment that demands more careful design, but if we succeed in elegant design, then the VCS brings many benefits: revision history, ability to manage and edit with common tools, offline editing, branching and merging, and ease of backups, to name a few.

I have had these things in mind recently as I have mulled over the questions surrounding the design of my blog, which to this point is the most complex section of my website. I had to decide what kind of hierarchy I wanted to organize the blog entries in on disk, what the URLs that access these entries would look like, how any path translations from internal (actual) paths to external clean-URL paths would be made easily configurable by the site administrator, and how all the metadata about both the blog entries and the directories would be stored. I made solid progress on all of these questions in my content manager yesterday and today, and now it's time to use the blog to, er, blog about it.

First, I decided on internal pathnames that look like this:

blog/2011/08/20110801--content-manager.text

A couple of points bear mentioning. Under the 'blog' directory, I have subdirectories for year and month. This makes for a file hierarchy that will be easy to navigate with emacs and common shell tools, and shouldn't get too cluttered. The file names for the blog entries follow a convention I have used elsewhere: an eight-digit date followed by a two hyphen separator, followed by the short-title and a file extension. Having the date encoded right into the file name is a time-saver: directory listings will naturally be ordered by date, even if the file timestamps get lost. What's more, I can have my content manager extract this information from the file name to use as the date of the blog entry. Elegant, thinks I.

The one important criteria that this design does not fill is that of clean URLs. There is repeated information in the path (year and month both occur twice), and the double hyphen would look a little odd in a blog URL. We will want to do something about that. One more conventionally sees blog URLs that look like this:

blog/2011/08/01/content-manager

We can achieve these clean, rich URLs by adding a path translation capability to the content manager. This is just what I accomplished yesterday and today, though as often happens in programming, I wrote it twice, two slightly different designs, the second one the better of the two, which I will summarize now:

I wanted the translation rules to be easily manageable as part of the metadata of the file hierarchy, and also, since I intend for this content manager to be able to manage lots of content divided into many independent sections, I wanted the translation rules to be stored in a way that was local to the part of the tree they affect, in this case, 'blog', affording a kind of independence for the different sections. Configuring one section of the site does not affect any other section.

I designated the special file name "_meta" within a directory to be the metadata file for that directory and all its contents. The file contains an alist, which in this case holds our single path translation rule, and looks like this:

((translate-paths . ([(Y / m / Y m d "--" short-title) .
                      (Y / m / d / short-title)])))

So many parens! But it's really simple: it defines the translation rule I want for my blog, with one half being a pattern that matches against the internal file names:

(Y / m / Y m d "--" short-title)

and the other half telling how to rewrite the path for the front end:

(Y / m / d / short-title)

In the part of the program that processes this, 'Y', 'm', 'd', and 'short-title' are defined as special stand-ins for regular expression fragments. 'Y' matches a sequence of four digits, 'm' matches a sequence of two digits, and so on. The program builds a regular expression dynamically from this rule definition, a delightfully simple thing to do in Chicken Scheme, and with it, extracts the desired information and builds the new path with simple substitution. We also now have the date of the blog entry, and can set it aside for whatever additional processing. (I have not yet written that part.)

So back to that question about how to best store metadata about content. Metadata about directories is stored in an optional file called "_meta" in each directory. That leaves the question of metadata for files. We saw that some metadata can be encoded directly into file names, like the date and the short-title. But we still need a place to store other metadata like the full title, tags or keywords, author, and who knows what else, access and editing permissions? For these, I have two ideas, and may end up including both. The first is to use auxiliary files with the same name as the file but with an additional extension of ".meta". The second is to steal a trick from Moritz Heidkamp's Hyde and include the metadata in the file itself as a header. Each file would have to be read at init time to check for this header, and when the file was processed for display, the header would have to be stripped out. The first approach is more general, and more easily manageable if metadata needed to be machine-editable. The second has a certain attraction of convenience: the metadata is right there while you are editing the file in question — it seems so intuitive, and doesn't clutter up the filesystem. The second approach though is not compatible with all file formats that I might like to use with the content manager. Therefore, I will almost without a doubt implement the first approach, and maybe later on, as a convenience, add in the second.

I'm feeling pretty good about the direction this project is going, but like any creative endeavor, each development seems to open new questions, new problems, and new directions. I have written content managers before in languages like Perl and PHP, and the difference this time is that I'm having real fun. At every turn I find that Chicken Scheme is a wonderful and joyful language for web programming, and the Awful web framework that my content manager is based on is most definitely not awful at all, but awesome. So now it's time to pause, plan, and get some experience by adding content to the site and seeing where this river wants to flow.