Demystifying Metadata - Trip-M April 2000




	Marty Lucas was the Founding Editor of Mappa.Mundi. He began writing and producing media for the Internet in 1992 with the Internet Multicasting Service, based in Washington D.C., where he helped pioneer audio on the Internet. Presently, he is a partner at Becknell and Lucas Media.
	Links » (1) "Metadata, Cataloguing by Any Other Name..." an excellent examination of the present state of the art for metadata generation, in Online Magazine. » (2) A Beginner's Guide to URLs » (3) Naming and Addressing URIs, URLs from the W3Cs Architecture Domain. » (4) Knowing the Rules, a conversation with Hal Varian from the January, 2000 edition of Mappa.Mundi contains a discussion of Metcalfe's Law, a positive network feedback mechanism. » (5) A Map of Yahoo!, Martin Dodge's Map of the Month from February 2000. » (6) Tim Bray's Hyperlink Totems Martin Dodge's Map of the Month from July 1999. » (7) See Maps, Space, and Other Metadata Metaphors by Carl Malamud and Marshall T. Rose, also in the April edition of Mappa.Mundi Magazine. » (8) Judy and Magda's List of Metadata Initiatives. » (9) The LBNL EPA Scientific Metadata Standards Project » (10) The Metadata Information Clearinghouse (MICI)

By Marty Lucas, marty@mappa.mundi.net

Trip-M Archives »

	In This Trip-M
	In the past few months the idea of using metadata to help bring order to the Web and other large digital information repositories has been gaining momentum. What does 'metadata' mean from a content perspective? And what should a content producer or information manager be thinking about when they begin the hard work of gleaning the metadata out of their content store? We've collected links to some fundamental resources about how to catalogue metadata - study them and you'll be on your way to a more consistent approach to generating information about your information.

Demystifying Metadata

In the faddish dot-com world it's tempting to dismiss metadata as this nanosecond's buzzer button, but metadata is really an age-old answer to an age-old problem. The problem is, how to get the most out of a stored collection of information. Datastores are bigger than ever and so is the problem. A consensus is growing that metadata is the answer. Metadata is often described as "information about information" but I prefer to think of it as another layer of information - simplified, distilled, made orderly - created to help people use an information source.

As long as people have been collecting information together, be it in the form of a library, an institutional filing system, a collection of accounting records or whatever, they've needed to come up with ways to help them know how to properly file and retrieve documents. These systems needn't involve any high technology.

Mere Humans

When I was an abstractor for a land title company I spent many a stultifying day looking through long lists of names in hulking burgundy colored leather bound handwritten ledgers that looked like they might have spent a few decades on Jacob Marley's desk. We searched documents maintained by the local county government to insure that people buying real estate got a clean, market worthy title.

Many kinds of documents can affect the title to a piece of property; deeds, mortgages, liens, divorce proceedings, and deceased person's estates, and documents up to fifty years old could still cast a shadow over a title, rendering it unmerchantable. The collection of documents was much too large and confusing to allow reliable direct searching. The metadata used to make this datastore accessible? People's names, the property's location, and the dates of transactions, all arrayed in a system of reciprocal lists, with each document identified by its book and page number. Using this entirely handcrafted system of chronological and alphabetical indices an experienced abstractor could search fifty years worth of records for several different parcels of real estate each day.

This tale from the bad old days of goose quills and tree-ware illustrates two points: (1) using metadata to manage information is not a new idea; and (2) humans are capable of performing useful information retrieval functions even without computer assistance if the system is thoroughly catalogued and cross-indexed. Nowadays we're mostly interested in metadata in the context of computer systems, but that's because computers are where you put data these days. The concept relates more to the data, and not so much to the machine. It's about how we can extract key elements of information on a consistent and orderly basis.

Metadata Basics

"Creating metadata" is really just techspeak for cataloging. [1] Museums and libraries have accumulated centuries of experience in cataloguing and indexing their collections. The US Library of Congress maintains one of the world's great libraries. They've published their Modes of Cataloging Employed in the Cataloging Directorate online. It describes their basic system for cataloguing their collections:

"In general, each bibliographic record contains a description of an item as a means of identifying it and distinguishing it from other items. The record contains the means of providing access through various avenues, including author and other entities associated with the work or bearing a relationship to other works in the catalog, title (including uniform title to collocate those works issued under varying titles), series, and subject. Subject access is through subject headings and classification. Classification groups works by topic and is combined with a system of numbering individual items uniquely for purposes of physical access and inventory control. Headings are under authority control, which insures that particular iterations for authors, etc., subjects, and series are distinct for entities/concepts that are separate but are consistent for multiple iterations of the same entity/concept."

Basically, metadata needs to fulfill these three functions:

render each item in the collection uniquely identifiable;

provide multiple pathways for finding each item; and,

place the information contained in each item into context with other documents, items, information and knowledge.

You may choose to do more with your metadata, but these tried and true cataloguing goals are a good place to start when you contemplate creating metadata. Resist the temptation to try to recreate your document - just provide information about your document from several points of view. If you're working on the Web (or an intranet) you or a colleague will most likely be responsible for cataloguing your own documents.

A Unique Identifier for Each Item. If you're working on the Web each of your documents already has its very own Uniform Resource Locator ('URL')[2]. It's unique, and also serves as a pointer to the document - it's this later quality that distinguishes a URL from a URI.[3] Combining identification and location would seem an ideal situation. But Web sites get reorganized and redesigned (perhaps more often than necessary) and things move.

A Uniform Resource Name (URN) differs from a URL in that its primary purpose is persistent labeling of a resource with an identifier. All three concepts are addressed in detail by Tim Berners-Lee in "Uniform Resource Identifiers (URI): Generic Syntax" (RFC 2396). It's worth reading if you want to understand the ongoing debate about identifying Web content and why this seemingly simple issue is getting complicated.

Multiple Pathways for Finding an Item. Once the unique identity of a document has been established so that it cannot be confused with any other, the next step is providing multiple pathways for finding it. In 1937, H.G. Wells described the World Brain, an idea expanded by visionary information scientist Eugene Garfield as the Informatorium. We have a long way to go before we see Welles' dream of the World Brain "dissolving human conflict into unity" but responsible use of metadata can help the World Brain think better, and that might help.

Ask a Librarian

In 1995 a group of information experts met at the Online Computer Library Center in Dublin, Ohio to discuss creating a system of metadata to support resource discovery. The result was the Dublin Core Metadata Element Set composed of semantic descriptions of fifteen basic descriptive elements, which are set out in IETF RFC 2413, published September 1998.

There's nothing revolutionary about the Dublin Core, it's simply a set of guidelines that can help you be sure that you are cataloguing your content in a way that other people will be able to use and understand. What does that get you? Arguably not much for now, because metatags in HTML documents on the Web have been so commonly abused for the purpose of getting a higher search engine profile, that currently they are often ignored. But as XML based knowledge management architectures, like Invisible Worlds' Blocks, come to the forefront, having your info-ducks in a row is going to make your life easier. But keep in mind, it's not so much a question of adopting Dublin Core, it's more a matter of getting organized and being consistent. Any logical and consistent system will be better than chaos, and Dublin Core seems to be gaining the most followers. If it can be made to work for you, Metcalfe's Law hints that you might be best off to 'go with the herd'.[4]

	It's a start...
	Invisible Worlds (parent company of this publication) has not adopted the Dublin Core as part of its Blocks architecture - it's agnostic with respect to Dublin Core, and other efforts at standardizing metadata content.[7] However, it's intuitively obvious that efforts at some fundamental consistency in metadata are bound to make it easier to extract basic commonly used cataloguing information (metadata) from independently produced (and otherwise inconsistent) HTML based document collections (not that Blocks is limited to any particular data format). This might prove to be a scheme other than the Dublin Core, but because the Dublin Core is restricted to very basic information it's difficult to imagine that any generic system won't include more or less the same elements.[8] Scientific and technical studies may, of course, benefit from discipline specific metadata schemes.[9]

For general information, the key is creating multiple paths for finding the document. That means allowing searching by author/creator, title, subject, keywords and date of publication. The Dublin Core provides for all of these, and provides useful guidance for ensuring that others will be able to understand and properly interpret the metadata you have input.

Feeding David Clark's 'Up Button'

Placing an Item into Context. If we've established the unique identity of a document, and are able to find (and presumably, retrieve) it, the final step is to place it into context with other documents and information, and ideally to place it into context ontologically, that is put it in its place in terms of human knowledge in general. When Invisible Worlds Protocol Advisory Board Member David Clark expressed the need to create the missing 'up' button, he was talking about the need to be able to find context, to see who's next door to us.[5] But where the up button takes us depends upon the questions we ask - which of the many metadata pathways we've chosen as relevant to our quest.

Libararies have traditionally indexed books by title, author and subject. With computerization, keywords have partially supplanted subject categorizations, but the continued popularity of the Yahoo! hierarchical subject system demonstrates that subject classifications remain a useful pathway for finding information. When you step up above the Web and look around, you are most likely to be looking for more information on the same topic, or at least a closely related one. Creating a classification system for all possible topics (typically in the form of a taxonomy) isn't the kind of task everyone needs to be doing - reference to an existing taxonomy provides some opportunity for pooling of knowledge. At Mappa.Mundi Magazine we've adopted the classification system at the DMOZ open directory project - by doing so we've avoided the need to create and maintain our own taxonomy. Moreover, we've made it possible for users to jump directly to the DMOZ categories that relate to an article's main topics, allowing our readers easy access to more related content on other Web sites.

Of course, context can mean many things - hyperlinks to and from a document [6] are an obvious example of a kind of context for Web pages. Similarly, journal articles and court decisions have a context in terms of citations. See for example The Astrophysical Journal Letters Citation analysis and The History Behind Shepard's Citators, a system of legal metadata that started out with stickers.

Summary

Effective metadata systems help people manage information better. Most all of these systems assign a unique identifier to each piece in the collection, allow multiple pathways to search and retrieve it, and help place each unique item into multiple kinds of contexts. It's highly unlikely that any one metadata scheme will be a complete answer for every datastore, but consistency and discipline in creating and maintaining metadata will help make that information more adaptable and hence, more useful.

contact | about | site map | home