posted 14 Jan 2005 in Volume 8 Issue 4
Zen and the art of taxonomy maintenance; Part 1: What you need to think about before you start
A masterclass covering the creation, implementation and maintenance of taxonomies in a corporate context. By Jan Wyllie
The handful of sand looks uniform at first, but the longer we look at it the more diverse we find it to be. Each grain of sand is different. Some are similar in some way, some are similar in another way, and we can form the sand into separate piles on the basis of this similarity and dissimilarity. Shades of colour in different piles; sizes in different piles; grain shapes in different piles; subtypes of grain shapes in different piles; grades of opacity in different piles – and so on, and on and on…
Classical understanding is concerned with the piles and the basis for sorting and interrelating them. Romantic understanding is directed toward the handful of sand before the sorting begins…
What has become an urgent necessity is a way of looking at the world that does violence to neither of these two kinds of understanding and unites them into one… To reject that part of the Buddha that attends to the analysis of motorcycles is to miss the Buddha entirely.
Pirsig, R.M., Zen and the Art of Motorcycle Maintenance (1974)
The classic motorcycle
Before embarking on the tasks of constructing and applying taxonomies to different types of information, it is important to understand some of the philosophical issues involved when humans and computers work with taxonomies.
For Pirsig, the only way the essence of a motorcycle can be invented and described is through the application of a classical multifaceted taxonomy structuring different aspects of the assembly of parts and processes. Despite their image of messy and grimy physicality to the romantic mind, motorcycles are, for Pirsig, a more or less perfect rendition of an abstract human construct that is quite literally created in steel out of the human mind using the principles of rational analysis, as well as techniques of organising concepts utilising multifaceted hierarchical taxonomies.
Indeed, there wouldn’t be much science or industry without a large degree of taxonomy working, which is necessary to enable groups of people to work together. Engineering and chemistry use multifaceted process taxonomies to determine the design and manufacture of an entity that exists in physical reality. In these cases, the taxonomy and its physical applications are bound together as different aspects of the same entity.
The world beyond the machine
When dealing with other subjects – ecology, for example – that encompass the world outside the manufacture of physical artefacts, the relationship between taxonomies as aids to understanding and the world as it is becomes much more tenuous.
If information is deemed the raw material of knowledge, then perspective – how the information is seen by its classification (conscious or not) into level of importance and subject (formal or informal) – is a major influence on what is perceived as true or false, hopeful or threatening.
This world of cultural, scientific and formal taxonomies, which pervades virtually all purposeful human action, is actually nothing like the classicist taxonomy of the motorcycle. In this non-physical reality of concepts, words and thoughts, taxonomies are actually tools of interpretation that give information the meaning that turns it into knowledge.
So it is important to distinguish between the rational-process taxonomies that determine the complete nature of all the motorcycles ever produced, and the sort of taxonomy used to assist in information retrieval or intelligence monitoring that is subject to conflicting interests, political bias and marketing manipulation.
Simply being aware that these kinds of taxonomies do not reveal truth, but are social constructs that form a vital part of human discourse, gives the taxonomist a real advantage when it comes to understanding the significance of what is being said and reported.
If most taxonomies are social constructions, what is happening when a computer program offers its own taxonomy of search results, or when an automatic-classification algorithm ‘judges’ how to classify an item? This is an important question, as heretofore classification has been the attribution of human meaning to an item – a meaning, remember, that can more often than not be interpreted in different ways, according to personal perspectives and purposes.
It is also important to bear in mind that there is no application of human faculties such as intuition, anticipation and creativity when automating the natural, human process of taxonomy working.
The conundrums of automation
Automated classification systems all make the assumption that information can be managed using the classical, rational principles apparent in, for example, motorcycle engineering. They all use different techniques and algorithms to produce what they deem to be accurate results.
Four of the most common approaches used in competing software tools that support taxonomy development and application are: training by example; business-rules-based semantic analysis; statistical methods; and, natural-language processing, also known as computation linguistics.
It is profoundly worrying that there is so little agreement among software vendors about how automated taxonomy working should be done. In contrast, one of the great things about motorcycles is that there is virtually universal agreement about the basic design and operation taxonomies, while the results can easily be tested according to an agreed performance taxonomy.
Is it possible that all the different techniques being used effectively produce the same accurate results? Unfortunately, it is impossible to know because, unlike with many other types of software, there is no independent benchmarking process comparing different vendors’ products according to a set of agreed criteria, and judged by independent groups of indexers and users. Until they prove otherwise, the assumption must be that vendors are reluctant to take the opportunity of benchmarking in case their particular technology does not measure up.
While considering the philosophical questions involved with the purchase and use of taxonomy software, those working on taxonomies should also be aware that some aspects of the information world are already defined in ways analogous to the parts of the motorcycle. There is no problem with the automatic classification of items by country or city, for instance. Companies should be automatically classified under industries as expressed by Standard Industry Classification codes. The relationships involved are straightforward and algorithmic (either, or, else…). As concepts to be classified become more abstract, and often more useful – concepts such as success, failure, agreement, disagreement etc – the role of interpretation becomes more important, as do the questions of what the software is actually doing and its fitness for purpose.
Of course, given the huge volume of data that organisations have to organise, and the expense (and inconsistency) of human indexers and classifiers, most organisations must seriously consider some kind of automation software. The short answer on how to proceed is to test, test and test again, using a sample your own predefined taxonomy and a selection of the kind of items you will want to classify.
The benefits of human intelligence
Another trick that organisations can easily miss with the application of automated taxonomy working is the community-building and time-saving effects of groups of people evolving and applying common classification systems. A well designed, human-managed taxonomy for e-mail, news or research, for example, not only enables participants to add value to information in just a couple of seconds, it also gives them a positive feeling of making an active contribution.
In this form, a simple taxonomy becomes a tool to improve the coherence and productivity of human communication. It would be tragic if, by using software tools to automate taxonomy working, organisations missed out on the cheap and simple benefits of groups of people practising the human use of taxonomies as tools for enhancing communication and innovation.
Never forget that a core aspect of human intelligence is the ability to discern and classify patterns in information flows. Taxonomies can be used as tools to enhance and share that kind of intelligence. Conversely, automated taxonomising, can, if applied unwisely, stifle a process that is necessary for human understanding.
The answer is to design a taxonomy and a process in which software automation can do what it is good at – speed and a certain consistency according to its own operating rules – while humans do what they are good at – judging the significance and meaning of information.
Elements of information architecture
Before embarking on any taxonomy design and implementation process, it is necessary to understand where taxonomies fit into their wider context, now known as information architecture. Taxonomies are used in a series of other applications, from thesauri and topic maps to the semantic web.
In topic maps and the semantic web, programmers use programming languages to combine taxonomies with rules and relationships that do things like design, sell and build customer products and services. There may come a time when these kinds of applications are widespread, but for the moment they are still in the very early stages of development and progress appears to be slower than expected. It is unlikely that anyone outside the pioneer programmers will need to make information-architecture decisions about topic maps and the semantic web (although a close watch should be kept on developments in this area, especially standards).
A decision most users will have to make, however, is whether to use a thesaurus or a taxonomy, or both. Once again, the answer depends on the purpose of the exercise. If the aim is to enable information users to search unstructured data more effectively, then a thesaurus may be the most appropriate tool. Thesauri in this context tag terms used in text to assist them in searching by relating words occurring naturally in text to broader terms (for instance, mammal is a broader term for dog and cat), narrower terms (dog and cat are narrower terms of the broader term mammal), preferred terms (use dog for mutt and cat for kitty), along with synonyms and other types of semantic relationships a user might want to define.
The thesaurus is essentially a tool to make messy, unstructured, text-based information collections more searchable. It is a tool designed to assist people who know what they are looking for and who have a primary understanding of the key issues and search terms of the domain in question. Another difference between a thesaurus-based approach and the use of taxonomy is that the former acts as a powerful system for back-of-the-book indexing. No effort is made to organise or mould the information being indexed, only to provide an alphabetical form of access to different instances of similar information in any source base. Thesauri are fairly useless for the vast majority of users, who need to know what’s important and what’s where in order to discover what they should be looking for.
If the purpose is to structure information in order to add value, then taxonomies are the right tool. Good multifaceted taxonomies work both as search tools and as browsing tools. The taxonomist’s approach is focused on organising information, more akin to a table of contents at the front of a book. They quite rightly add a layer of meaning and interpretation that did not exist before, and are as such creative and purposeful acts of research.
For those with the resources, the combination of a user-browsable taxonomy with thesaurus-enhanced free-text searching would be the most powerful combination. The point must be made that undertaking to write and apply a thesaurus is a significant undertaking, best done with specialist advice. It may be possible to get a head start by building on ready-made thesauri in a relevant subject area, for instance from Taxonomy Warehouse (www.taxonomywarehouse.com).
The next three articles
This course, though, is about taxonomies, not thesauri. Taxonomies by their nature shape the worlds to which they are applied, which is why they must be both created and applied with great care. A little bit of shaping using even the simplest of taxonomies, such as agree/disagree, urgent/routine, can add a lot of value to the flow of information.
The next article in this four-part series will be about understanding and establishing a purpose for the taxonomy to be developed. Is it for information retrieval, knowledge sharing or for intelligence purposes? Who benefits?
The third part will focus on designing taxonomies, and will deal with the role of technology, type and dimension, as well as the management and human issues involved in taxonomy working.
The fourth, concluding part will deal with implementation and maintenance. It will cover the technical and motivational issues surrounding taxonomies, as well as ways of evaluating the performance of a taxonomy project. Each article will be illustrated by short case examples.
Please send comments and questions to