by Jon Udell

Managing metadata

analysis
Oct 20, 200511 mins
Data ManagementData MiningDatabases

Data about data provides enormous opportunities to organize information in new and useful ways

When we talk and write about IT issues, we use certain words to mean many different things: “Platform,” “architecture,” and “integration” are among the worst offenders. But the most overloaded term in the IT lexicon may well be “metadata.”

Everyone knows the common definition: Metadata is data about data, a secondary thing that’s separate in some way from the primary thing to which it refers. But that definition begs a series of questions. Is metadata something we derive from data, or assign to it? Does it classify things, or enable us to search for things, or govern the behavior of things? If data that is described by metadata also, in turn, refers to other data, does it then qualify as both data and metadata?

These questions can verge on the philosophical, but by working through some examples, we can define various types of metadata, list the benefits that we expect from using it, and identify the challenges associated with maintaining it. Programs, documents, messages, files, Web resources, and Web services are some of the IT constructs often described by metadata. Let’s review the roles that metadata can play in these different scenarios.

Software metadata

Since the birth of software, programmers have embedded one kind of metadata — namely comments — in their source code. Making such comments more integral to software has been a long-standing quest. In the 1980s, the legendary computer scientist Donald Knuth began evangelizing a technique he called “literate programming.” Knuth was the inventor of TeX, a markup language that’s still used for math-intensive typesetting. His idea was to use TeX in tandem with a programming language to compose a single document that blended both code and documentation.

Knuth’s approach never really caught on, but the idea of weaving comments more intimately into code continued to evolve. Java programmers, for example, write specially formatted comments in their source code and then use the Javadoc tool to translate those comments into HTML documentation.

Comments are an informal kind of metadata used to describe the design and operation of software for human readers. But they can also be used in more formal ways to declare properties of software components and relationships among them. A module that checks credit card numbers, for example, might be invoked directly or by way of a Web services framework. Specifying the invocation style in a comment, rather than in the code, is one way to separate configuration logic from business logic.

Because comments don’t survive compilation, though, such configuration metadata is only indirectly linked to the code to which it refers. Why not embed the metadata directly in the generated code? The .Net architecture enables just that. With J2SE 1.5, Java does, too. Thanks to a technique called reflection, available in both environments, it’s possible to query class files or assemblies at run time, discover these metadata annotations, and react dynamically to them. Metadata can be used to declare that a component must run in a transactional context, for example, or to specify the kind of authentication it must use.

These custom annotations are assigned to software, not intrinsic to it. But Java and .Net programs also make available, through reflection, intrinsic metadata about the objects they contain, as well as the types and properties of those objects. As a result, these self-describing programs can collaborate with other programs in highly dynamic ways.

Statement completion in interactive development environments — what Microsoft calls IntelliSense — is one example. Interaction with controls at design time is another.

These scenarios don’t strictly require that metadata be embedded in generated code, but it’s administratively convenient to do so. Objects that contain their own metadata descriptions are well adapted to an increasingly decentralized world. At the same time, though, there are countervailing reasons to prefer looser coupling between data and its metadata. In the Java and .Net environments, for example, embedded configuration metadata is handy for programmers but not as convenient for system administrators. So, administrators can override the hard-coded settings with alternates they declare in separate XML configuration files maintained under their control.

The tension between these two world views — the programmer’s and the system administrator’s — shows why there often isn’t one right way to manage metadata. Even in a single domain, using a single controlled vocabulary, the people who manage metadata can occupy different organizational roles and expect to use different sets of tools.

Document and message metadata

Documents and messages carry richer intrinsic metadata than the basic facts — owner, creation and modification date, size, permissions — that file systems maintain. A Word document has built-in properties; a Web server includes metadata in HTTP response headers; e-mail messages also transmit metadata in headers.

In all these cases, it’s possible to include extra items of metadata. Word documents can have custom properties; Web pages can use metatags; e-mail messages can carry custom header fields. We’ve long hoped that optional user-assigned metadata would help us relate our documents and messages to the activities they represent and embody. But assigning extra information requires extra effort, which is always a stumbling block. And when this optional metadata isn’t limited to controlled vocabularies, it’s hard to build reliable processes around it.

New social tagging services, such as del.icio.us and Flickr, have brought a fresh approach to this ancient dilemma. When data is open to public inspection — such as the set of all Web resources, or just the subset of online photos — these tagging services have proved to be surprisingly effective ways to gather and exploit voluntarily contributed free-form metadata.

If you tag personal documents and messages stored on your own hard drive or in your own partition of an online service, the only person who benefits from that effort is you. And you, in turn, gain nothing from the effort that others invest in tagging their personal data. Without feedback and positive reinforcement, rigorous use of metadata requires a level of discipline verging on the neurotic.

When metadata refers to shared data, however, and when lots of people can interact with that shared data, the dynamics shift. In this scenario, the metadata produced by a few people — for selfish reasons — can also benefit many others. Enlightened self-interest is one key motivator, but peer pressure is another. Although nothing compels me to tag my bookmarks or photos according to group conventions, doing so means the items I tag will receive more attention from the group. That’s a powerful incentive to contribute, and it creates a tight feedback loop that helps metadata vocabularies converge.

Social tagging can be a great way to manage metadata about things that aren’t confidential, to align communities of interest around sets of resources, and to harness the collective ability of those communities to tease out the relationships among things. At InfoWorld.com, for example, this process has proved an effective way to answer the request: “Show me clusters of articles related to the current one.”

When data can’t be freely shared, however, and when there aren’t lots of people interacting with it, social tagging lacks the critical mass it needs to thrive. At the scale of a workgroup, a department, or even a whole company, it’s unlikely this approach will yield accurate answers to requests such as: “Show me the discussions related to sales projections made by people working on the Trinity project.” To answer such a query, you’d need to know which items relate to the project, which are sales projections, and which are messages related to those projections.

Web and file system metadata

The Web’s inventor, Tim Berners-Lee, has long imagined a “semantic Web” that makes it possible to reason about interrelated things. To that end, the World Wide Web Consortium has proposed two initiatives: RDF (Resource Description Framework), a grammar for describing and exchanging metadata about resources; and OWL (Web Ontology Language), a set of languages for classifying resources.

RDF describes interrelated things in terms of subject-predicate-object “triples.” Examples might be “DOCUMENT IS-A SALES_PROJECTION,” or “DOCUMENT HAS-AUTHOR PAUL_SMITH.” If you had lots of resources described in this way, and if the metadata vocabularies were carefully controlled, and if you had a query engine that could efficiently process sets of these assertions, you could answer all kinds of very difficult but very interesting questions. Those are three huge ifs, of course, and given the scale and chaotic complexity of the Web, it’s not surprising that little progress has been made to date.

Is there better traction to be gained in the more restricted domain of personal information management? That’s what Microsoft hopes to prove with WinFS, the next-generation file system that was originally planned for Windows Vista only, then appeared in beta for Windows XP this fall, and is now slated to appear on both platforms sometime after Vista ships.

A WinFS data store is a collection of strongly typed items. The list of types includes Document, Person, Message, and — most crucially — Relationship. The idea is that applications managing these items use relationships to weave what are, in effect, RDF triples. Although applications can assert that a document is a sales projection, or that its author is Paul Smith, or that Paul Smith is a project Trinity team member, these relationships are not held privately by any of the applications. Instead, they’re available systemwide to all WinFS-aware applications, any of which can query for messages written by team members that refer to sales projections.

Marrying a relational database to a file system is one of the daunting challenges faced by WinFS. Another will be convincing developers to exploit the built-in WinFS types and create new types that define customized axes of controlled metadata. A third challenge will be to build bridges between WinFS types, which are specialized .Net objects, and documents or messages represented using XML and described by XML schemas.

Over time, of course, all file systems have evolved in the direction of richer metadata. The star-crossed BeOS left one important signpost. The Be File System supported unlimited amounts and types of file metadata and could effectively index and query that metadata years before its spiritual descendant, Apple’s Spotlight, arrived on the scene.

Another signpost is ReiserFS, a journaling file system that’s used in a number of Linux distributions. In Version 4, which was recently released, speed, reliability, and extensibility were all priorities, but the long-term goal is to model information in ways that are friendlier to the associative style of human thought. “We very much share the BeFS vision of enhancing the file system namespace semantics,” architect Hans Reiser has said. To get there, it was first necessary to create a flexible, high-performance foundation for managing metadata. Now that’s done, he says, and the next phase — “adding search engine and database semantic features into the file system namespace” — can begin.

Bringing it all together

As we weave more and better metadata into software, documents, Web sites, and file systems, the information stored in these various containers will become more available, more cohesive, and therefore more useful. The next challenge is how — in this new era of interconnected systems, people, and business processes — to unite these separate realms.

The solution is a complex recipe, but we can find many of the ingredients at work in the emerging discipline of SOA (service-oriented architecture). We use metadata to describe the interfaces to services and to define the policies that govern them. The messages exchanged among services carry metadata that interacts with those policies to enable dynamic behavior and that defines the contexts in which business transactions occur. The documents that are contained in those messages and that represent those transactions will themselves also be described by metadata.

There’s no overarching schema for the metadata that flows through the service network, touching routers, registries, security gateways, databases, and end-user applications. And, in view of its many forms and uses, it’s not clear that convergence on a single standard is necessary or even desirable. What is necessary is that within each metadata domain we strike healthy balances between the constraints we apply to metadata vocabularies and the evolutionary freedom we allow them. Across domains, we’ll speak the lingua franca of data and metadata, namely XML.