Small Business Resources, Business Advice and Forms from AllBusiness.com

Taxonomies & topic maps: Categorization steps forward

By Trippe, Bill
Publication: EContent
Date: Wednesday, August 1 2001
HEADNOTE

Taxonomies & Topic Maps

Categorization Steps Forward

In all the discussion of the Web, we tend to get lost in the buzz and the technical detail. There's plenty to distract-the sheer numbers and

the rapid growth, the boom and bust economics of dot com companies, the flashy new technologies. It's easy to forget at times that the Web on some level isn't about solving whole new problems of computing, but solving some longstanding ones in an open and scalable fashion, where the sources of the data are readily available and not centralized.

Take the problem of search. There are so many technologies, and so many approaches to the problem, that we may forget at times that the ultimate goal of search technology is simple-providing users with quick access to meaningful information. In the recent history of content management, much has been made of XML and the related standards, streaming media and Content Distribution Networks (CDNs), and the steady increases in performance through broadband options such as DSL and cable modems. Yet one of the biggest challenges the Web faces is one of its oldest, and indeed is a problem that predates the Web itself. How do average users find the information they need amidst a flood of irrelevant matter? And how do they do this quickly, easily, and consistently?

According to some, the path to improved information retrieval on the Web lies in intelligently applied taxonomies. In this view, content needs to be more accurately identified by category in such a way that search engines and other navigational aids can be better tuned to help the user. As content moves increasingly to the Web, these data sources need to benefit from technologies and techniques that allow people to view, navigate, and search data by broadly understood categories.

Happily, categorization technologies seem to have matured to the point where they can be useful to more and more publishers. Increasingly, Web publishers are investing in both the technologies to categorize content and the labor associated with implementing the technology. And looming on the horizon are "topic maps," an intriguing approach to tagging data for categories, especially for collections of data as opposed to singular documents.

THIS DOES NOT HAVE TO BE EXPENSIVE

The process of categorizing data need not be either expensive or overly complex. In recent correspondence on the email list xml-dev, Carol Ellerbeck, a taxonomy expert with Harvard Business School's Baker Library and formerly of Lycos, made this very point. Responding to a writer who suggested that one needed to be "king of the world" and have "an unlimited budget" to create effective taxonomies, Ellerbeck wrote, "If you `were king of the world'...you would not need `an unlimited budget'...just a modest one, to have experts build your taxonomy/domain vocabularies. I say this as a taxonomist who has been in the vocabulary trenches with electronic information for years. Automation is wonderful (and I would say, even essential), but start with not just humans (albeit smart humans), start with humans who have some expertise, and you will accomplish your goal faster, with fewer people, more efficiently, and have a more solid foundation to build on."

IMAGE PHOTOGRAPH 6

This same point was made recently by none other than the father of the Web, Tim Berners-Lee. This past December, World Wide Web Consortium director Berners-Lee addressed this point as part of the Knowledge Management track at the XML 2000 conference in Washington, DC. In a far-ranging and fast-moving presentation, Berners-Lee outlined the current Web infrastructure, current standardization efforts at the W3C, and necessary efforts and improvements to arrive at a "Semantic Web." For Berners-Lee, something has semantics when it "can be processed and understood" by a computer, such as how a bill can be processed by a software package such as Quicken. Getting to that level of semantics, in a broad, open, and public infrastructure such as the Web, is easier said than done, of course. It involves, for Berners-Lee, the entire existing infrastructure, including XML, namespaces, XML Schemas, and a suite of new things. These new things include agreed-upon means of sharing and distributing application logic, and new layers that provide both proof (who you are, who the other party is) and trust. Together, these will provide a complete Semantic Web.

A great deal of Berners-Lee's discussion had to do with the theoretical difficulties of shared application logic and other esoteric detail. But at one point in the talk, BernersLee matter-of-factly stated that the question of taxonomies was a simple one and relatively easy and inexpensive to solve. And while he stopped short of endorsing topic maps or any other particular approach, he made clear that some such approach was necessary and should be used to unify disparate efforts now underway.

TIME TO DO SOMETHING

IMAGE PHOTOGRAPH 10

Around the same time of Berners-Lee's presentation at XML 2000, Cambridge, MA-based Forrester Research published a report with the pithy title, "Must Search Stink?" Forrester more or less answered the question by saying, "Not if you begin to implement search technology better, as well as develop some best practices." As Forrester pointed out in a related report, "Managing Content Hypergrowth" (January 2001), "Mushrooming online assets force both contributors and end-users to wade through even deeper content haystacks for the 'needles' they want, dramatically increasing the likelihood of confusion and frustration." And this is quickly becoming everyone's problem, since the amount of online content continues to grow and the effective use of search technology continues to lag.

What should publishers do? Forrester's advice dovetails well with what experts like Carol Ellerbeck suggest-develop a taxonomy with the help of experts, apply it consistently, and create feedback mechanisms so that you can continuously improve the data. Content management vendor Eprise makes such tagging a central component of its recommended best practices. According to Hank Barnes, vice president of strategy for Eprise, "A key aspect of making content more effective is metatags for classification. These tags enable content users to more easily find relevant information and to get more in-depth information on specific subjects." Barnes notes that Eprise uses these types of tags to dynamically locate information in response to user actions, such as following a certain path through a Web site. Adds Barnes, "Often, this approach of content delivery based on classification is much more effective than full-text or general-- purpose searching."

Categorization has advantages beyond the core systems of search and content management. Orlando-based DigitalOwl develops and markets solutions for content syndication and Digital Rights Management (DRM). According to DigitalOwl president and CEO Kirstie Chadwick, "Content that is tagged with relevant keywords is ideally suited to be marketed and distributed through a broad range of distribution channels."

According to Chadwick, DigitalOwl provides tools that simplify the classification and tagging of content for the purpose of distribution and marketing through digital distribution channels. Once tagged and classified, DigitalOwl's KineticEdge technology is able to automatically drive highly relevant content items directly to the desktops of corporate end-users that have expressed a need for specific topics or areas of interest.

Of course, the idea of well-tagged content is something that information professionals know well, and Web publishers rely on some widely used processes to apply categories. In HTML-tagged content, category is typically indicated in the values of the metatag. Savvy Web publishers pay careful attention to how they populate the Description attribute and, more significantly, the Keywords attribute. These two attributes have much to do with how the HTML-tagged page is indexed by the various search engines.

Yet, despite the understanding of how categorization can aid in retrieval, it has not been used to its full advantage. Within many companies and organizations, there has traditionally been resistance to categorizing large volumes of data, as "hand categorization" has been viewed as human-- intensive and unscalable, and automated categorization techniques are viewed as less effective.

TECHNOLOGY FINALLY CATCHING UP?

Categorization technology seems finally to be overcoming the conventional wisdom. This is partly because the tools seem to be improving, and also because the tools allow for the kinds of user intervention that improve the technology's results. Ideally, the technology allows for the user to create the high-- level categories and hierarchies, and then the tools are used to tag individual documents. As more and more content is tagged, the user can intervene to shape the categories and refine how documents are being tagged. This kind of iteration and continuous improvement leads to the best results in a cost-effective manner.

Two leading providers of categorization tools are Semio and Inxight. Semio and Inxight are both perhaps better known for their visual navigation tools, Semio Map and Inxight's Star Tree Studio. In fact, under the flashy visual tools, both companies rely on a core of sophisticated linguistic software that each has been developing for years. Contemporary search engines are supported by a variety of linguistic approaches that have become de rigueur: conjugation, including inflection and uninflection; at least rudimentary noun-phrase analysis; and spelling correction. Inxight and Semio add some of the newer approaches that are less widely available and perhaps in some cases less proven, including tagging for categories.

IMAGE PHOTOGRAPH 21

Underneath any of these tools are two things: databases of words, and software to help interpret them. Clearly, the better the software, the better the resulting tool, but the underlying database is perhaps just as important. When you begin to enter areas such as categorization and summarization, where the software is trying to divine the meaning of the text, the words begin to have many facets: multiple meanings, and varying meanings in different contexts. Linguists offer many examples: "leaves" as the plural for "leaf," as well as a form of the verb "leave"; the word "mole" as both a thing that burrows through the ground and a spy that goes underground, as well as that mark on your arm. The database of words needs to support the software in its efforts to interpret these words and their many facets.

Linguistic technology continues to advance for many reasons: computers are faster, and both disc and RAM costs continue to drop. But the databases also improve. As research has progressed, so has the availability of tagged corpus material to work with. And, on a more practical level, once a group of words has been captured and codified, adding to that database becomes easier. This is one business, and one process, that benefits from critical mass.

To better understand this idea of critical mass, consider Inxight Categorizer, which employs a process of "categorization by example." In this model, the publisher hand-tags a set of "training" documents for different categories. The publisher then begins the more automated process of comparing a new document with this collection of manually coded documents. Using Inxight's linguistic analysis technologies, the Categorizer selects similar documents from the training set and infers the probable coding for the new document from these examples. Over time, the savvy user can refine the training sets to get increasingly accurate results.

Autonomy's similarly named Categorizer also employs a method of categorization by example, and includes an XML tagging function that can automatically add XML tagging to the data. For a vendor like Autonomy, the value-add comes in the automation. It positions its tool as a key component in eliminating "costly and time-consuming" manual tagging of individual documents, allowing the Web publishers to concentrate on the user-facing hierarchy and high-level topics. Autonomy reasons that it is the high-level topics that ultimately must resonate with the end-- users. The technology should help Web publishers assign documents automatically to the topics. If a Web publisher decides to reorganize or realign the user-facing topics, the underlying data should also change easily.

Semio's offering in the categorization space, Tagger, provides support for a wide variety of data sources, including data stored in relational databases, and a graphical user interface that gives the user powerful tools for working on the categories and hierarchies, which are then easy to populate automatically using the tools. Significantly, LexisNexis has licensed the Semio technology to include in LexisNexis Portal, it's core offering to law firms. The LexisNexis Portal is an integrated, customized desktop solution that allows law firms flexible access to the rich LexisNexis database. Semio Tagger will be available as an optional, add-on component to the LexisNexis Portal.

According to Michele Vivona, vice president of Large Law Market Planning for LexisNexis, "Semio Tagger creates customized, browsable category structures for Web portals, giving users better and faster access to the information they need. It also allows LexisNexis Portal customers the ability to create and implement customized, automated text categorization and browsing capabilities as part of a complete Web portal solution."

WILL TOPIC MAPS GUIDE THE FUTURE?

IMAGE TABLE 27

Besides Tim Berners-Lee's talk, the other big buzz at XML 2000 was topic maps, which were the focus of a number of presentations during the technical sessions. Moreover, an impressive number of companies on the exhibition floor demonstrated support for topic maps. Part of it too was the announcement of XML Topic Maps (XTM) 1.0. XTM 1.0 is the product of Topicmaps.Org, which has been formed specifically to bring the topic maps paradigm to the Web.

What are topic maps? Topic maps are a formal way to declare a set of topics and then to provide links to documents or subdocument nodes that address the topics. In other words, they are a way to declare a set of labels for topics, and then to point to places where those topics are discussed and addressed. For proponents, topic maps are the ideal solution for helping users find information about a topic across a variety of documents. Whereas an HTML metatag is bound to the very document it is describing, topic maps exist apart from the individual documents, allowing applications, and users, to understand the topical relationships between documents.

A simple example of a topic map is illustrated in a "gentle" introduction to the subject on Topicmaps.org. A topic map discussing Shakespeare might include some explicit links to URLs for some of the plays, connecting the URLs to "topics" (one topic called Hamlet, another topic called Tempest). Each of these URLs is considered an "occurrence" of each topic. These two occurrences can then be tied together with an "association"; this seemingly simple connection is what begins to give topic maps their power. Taken together, these topics, occurrences, and associations begin to form a topic map.

Since topic maps are intended to exist in separate documents, and since they don't require the source documents to be changed, they allow the information designer to create many views of the same data. Indeed, though this approach is not supported in a particular product yet, a topic map-supported Web site could allow the end-user to design his or her own topic maps of the content.

Bear in mind that while the thinking behind topic maps is mature, the proposed standard itself is relatively new (published in December of 2000), and the commercial technology supporting the standard is still in its early stages. Having said that, though, there are at least two vendors worth looking at:

Ontopia's core product is the Ontopia Topic Map Engine. It is a generic implementation of the topic map standard, available as a software development kit. The SDK allows developers to access and manipulate the constructs found in topic maps (topics, associations, and occurrences).

Empolis, a technology subsidiary of Bertelsmann, produces a product called K42. The K42 knowledge server is a software application developed on the basis of topic map technology. Based on open standards such as Java, XML, and XSL, K42 features a query language for accessing the topic maps.

Look for more implementations of topic maps in search and categorization technology. Empolis, for example, has shown an application of its K42 knowledge server working with the Inxight categorization software. The key for topic maps may well be whether a larger standards organization, such as the W3C, ends up hosting the effort. Given the W3C's emphasis on a semantic Web, it seems only logical it would get behind such an approach, or one much like it. But, indeed, whether topic maps become the preferred means of categorizing data for the Web remains to be seen. In the meantime, there is much work to do-technology to explore, data to be tagged, and, most importantly, users' needs to be met.

IMAGE ILLUSTRATION 34AUTHOR_AFFILIATION

BILL TRIPPE (btrippe@nmpub.com) is the founder of Boston, MA-based consulting practice New Millennium Publishing,

Comments? Email letters to the Editor to ecletter@onlineinc.com.

In addition, make sure to read these articles:

Keep Your Web Site Language Simple
Interview with Maria Giudice, chief creative officer of Hot Studio.