posted 15 Nov 2000 in Volume 4 Issue 3
Cutting through data smog
Effective knowledge management relies on people having access to the right information at the right time, but the threat of information overload is undermining knowledge retrieval processes. Thomas Gerick examines a new tool for collaborative KR that promises to dominate knowledge transfer technologies over the next few years.
Business success used to mean having information in sufficient quantity. Today, it is the quality of information that counts. The exponential growth of available information is making fast filtering and context-specific delivery of truly relevant information a critical factor of efficient knowledge management. Knowledge retrieval (KR) deals with the computer-aided process of knowledge transfer. Various highly intelligent search and information weighting technologies are now available. In this article, we present the most important methods – and in the case of topic maps a new ISO standard – that are likely to bring about lasting changes in the IT world.
Some facts and figures
Approximately 80 per cent of all workers lose some 40 minutes a day searching for information they need. This was one of the findings of a comprehensive study carried out among the 1,000 top UK companies by Deloitte & Touche and Sqribe Technologies in 1999.
To take a fictional example based on these figures, let us assume an enhanced navigation tool could cut search time by 25 per cent, down to half an hour a day. A company with 7,900 workers – the average in the survey – of whom 6,200 (about 80 per cent) were given the new tool would save over 1,000 hours a day. Assuming a moderate hourly pay rate of US $25, the savings easily come to US $25,000 a day or, with 200 working days, some US $5 million a year.
So far, many companies have taken a more or less fatalistic stance regarding the internal cost of obtaining information. The focus of investment activity is on knowledge supply in the form of intranets and databases, with little spending on knowledge demand to identify what knowledge is mission-critical and to implement more efficient means of accessing it.
Several parallel developments make managing information a key challenge for every business. One is exponential growth in the volume of information; an IDC study on global intranets, for example, predicts a ten-fold increase over the next five years to the gigantic figure of 1,159 terabytes. The situation is aggravated by the fact that, for the most part, strategic information, such as expert reports, patents and product and project descriptions, is hidden away in large file systems or databases, has little or no structure, and is heavily context-dependent (again, findings of the Deloitte & Touche study).
The problems of searching
A common problem that arises when searching document-based knowledge banks is that it becomes impossible to keep track when the size of the repository approaches 1,000 stored documents. For example, project and quality documents often take the form of unstructured text files stored on a distributed basis in Lotus Notes databases, file systems, and so on.
One relatively quick way to make information more accessible is to categorise, order, and assign keywords to the documents that contain it. In practice, however, this method usually proves ill-suited to its purpose. Categorising and assigning keywords creates extra effort for users when filing documents away, an activity in which they already face motivation problems. This often means that users simply cease to file documents properly. A further problem is the restricted flexibility of a category-based system. A well-structured document repository entails a lot of administration. Categories, subgroups and keywords are constantly changing. The underlying documents must be continuously brought into line with these changes – a task that becomes well-nigh impossible as the number of documents grows (see figure 1).
An alternative way to provide efficient access to information in documents is a full text search engine. By searching the full text, it is possible to find occurrences of specific words. Queries can be refined with numeric or Boolean operators (such as AND, OR, NOT, >, < and =) and proximity operators (such as SENTENCE and NEAR); this is still by far the most widespread approach. Finding the right documents on a specific topic requires well-honed search skills with knowledge of ambiguities, homonyms and synonyms. Boolean retrieval alone can not convincingly meet the need to locate specific information in documents, however, because it treats words as character strings stripped of syntax and semantics.
Requirements of an ideal search engine
The usefulness of a search system in the field critically depends on how far it meets user needs. The technology-centric search applications in common use today are not designed to cope with the huge diversity of search situations and search needs they encounter in the real world. For example, while one worker at a large automobile corporation might initially want a general overview of what the document repository contains on ‘airbags’, another may have very specific questions and be looking for ‘EU Safety Directive NCAP crash tests with head airbags’.
It also transpires that some workers with knowledge of the appropriate technical terms and search engine syntax are very good at searching document repositories. Yet most are incapable of putting together a complex query or of searching in different databases.
What is needed, then, is a flexible and intuitive navigation system that quickly and interactively leads users to the information they want. Fuzzy filters and synonym features even make allowance for typing errors and foreign words. All available data sources can be searched simultaneously or individually selected.
Submitting a query for ‘airbag’ opens a topic map on the subject with higher and lower-order relations (see figure 2).
The visualised search map reveals lateral relationships at a glance. Aspects are shown that the user may not even have thought of at first, such as safety, crash tests or child seats; these decisively narrow down the scope of the search to the user’s desired context. These query topics are based on successful past search strings that users can activate, add to and combine at a click. Another important feature is automatically generated, query-specific summary information on documents that match the search criteria. The system even generates English summaries of documents in other languages. Predefined interest profiles allow users to be supplied each day with all new and modified documents on the topics they want. A ‘push’ service acts as a personal information agent without users having to search for matching documents themselves.
A good knowledge retrieval system – or to use the GartnerGroup definition, good content access management – must help companies tap into implicit employee knowledge. This only works if there is easy access not only to structured information, but also to documents that are only available in unstructured form, so that the work of cataloguing and indexing the repository can be reduced to a minimum. Such a system leverages users’ motivation by ‘noting down’ the searches they conduct and offering successful past queries to other users with the same or similar interests. Anything that any user has ever sought and found in the past is thus made available to all other users. Users’ search skills are made reusable, yielding a collaboratively compiled query structure independent of the document repository. User benefits and self-teaching mechanisms ensure general acceptance of a system that, by integrating knowledge into business processes, proactively delivers that knowledge dependent on the workflow or subprocess at hand. However, the usefulness of any KR tool still largely depends on the quality of the content. The smartest search method is useless if there is little or no information to feed it with. Monitoring in the form of search and access statistics is consequently an important management tool and critical to the quality of a document repository. Knowledge monitoring mechanisms ensure that the information value of documents continuously rises by constant adaptation to user needs.
Systems are now on the market that are mature, easy to use and have been tested in practice. Currently available knowledge retrieval products are based on three main technology principles. The aim is added value in excess of a simple full text search engine.
Linguistic, statistical and semantic retrieval methods
Boolean retrieval alone can not convincingly meet the need to locate information that is in the user’s desired context, because it treats words as character strings stripped of syntax and semantics. So far, attempts to boost retrieval performance with linguistic methods have likewise met with limited success. Manually compiling subject-specific thesauri is very laborious. Then there are methods based on statistical analysis of document contents. These self-teaching models use methods drawn from probability theory, analysing statistical relationships between words to automatically generate a set of concepts. The main advantage of these statistical methods is that they need relatively little administration. Semantic concepts can be traced back to models of human memory, but also integrate perceptions and research findings from the AI field. When retrieving information from documents, they take into account the meaning of the context in which words occur. To a certain extent they can replicate our associative thinking, and context-sensitive components that ‘understand’ text allow them to deliver very high-quality search results. As with statistical methods, the use of self-teaching mechanisms means that relatively little administration is necessary.
Topic maps: The new knowledge representation standard
In a recent research note, GartnerGroup defines the concept of topic maps in the context of knowledge management applications and points out that software producers have so far taken little notice of the idea. Going by the GartnerGroup forecasts, however, this technology has vast potential: “Because the paradigm is powerful, flexible and extensible, topic maps will become a mainstream technology by 2003.” (D. Logan, Research Note 27, June 2000).
But what are topic maps and what are their uses? They can be described as knowledge structures that help provide efficient access to large, unstructured bodies of data (see figure 3).
Whereas a full text search engine does no more than list the content of information sources, this new approach evaluates meta-structures based on that content. Topic maps can thus be used as the basis of efficient knowledge retrieval techniques and visual navigation systems. They can describe large pools of documents by means of a knowledge structure.
When supplementing a knowledge base on the subject of airbags with information on the performance of different vehicles in a crash test, for example, we would most likely include a category named ‘MPVs’ or ‘People Carriers’ with vehicles like the Galaxy, Sharan and Voyager. Problem: The Sharan and Galaxy are identical in construction. How can this, very important, piece of information be made visible to other users? Only by using a new way of structuring information that, unlike a system based on hierarchical categories, additionally shows lateral relationships.
In autumn 1999, the International Standards Organization (ISO) formulated this principle as the basis of a new standard, ISO/IEC 13250:1999 (Topic Maps). Topic maps – formerly ‘topic navigation maps’ – specify a model and an architecture for a structured network of hyperlinks to information objects. They may be regarded as a further development of semantic networks. The key conceptual components of the standard are ‘topics’ (nodes) and ‘topic occurrences’ (stating where topics are to be found). ‘Associations’ (links) reveal relationships between topics. In the example above, ‘automobile safety’, ‘crash tests’ and ‘MPVs’ would be topics. The corresponding topic occurrences would include ‘test reports’ and ‘FMEAs’. Finally, associations represent relationships such as ‘Sharan and Galaxy are identical in construction’ or ‘airbags improve safety’.
The standard provides for a means of deriving search strings from topics and of using these search strings to identify matching documents. By separating the structure from the documents, it becomes possible to maintain and use the structure independently.
An important practical aspect for real-world organisations is the ability to extend predefined topic maps on a collaborative basis: Once an initial topic map has been created, users can add their own topics and associations. Topic maps on issues relating to an organisation grow with each valid suggestion and thus faithfully reflect the organisation’s conceptual space. Users’ search actions can be used with purpose-developed heuristics to generate statistics and hence to refine topic maps semi-automatically. New topics, synonyms and relations can be extracted from the statistics and proposed to an editor or knowledge manager.
The data interchange format for topic maps and their concepts is Standard Generalised Markup Language (SGML) or its derivative, Extensible Markup Language (XML). That is, a topic map is an SGML or XML document. These languages allow text to be explicitly labelled with appropriate semantic markup tags so that they can be interpreted in context – unlike a simple full text search in HTML documents.
In this way, knowledge structures can be made available in the form of topic maps to all other users who have been equipped with tools that support them.
The subject of automobile safety is associated with the topics ‘airbags’ and ‘child seats’. The ‘airbags’ topic is associated in turn with the ECE 44/03 testing standard. If a user asks for information on dangers of airbags, the meta knowledge in the form of the topic map will show that airbags are a potential hazard specifically in conjunction with child seats.
The great potential of this approach resides in the ability of associations – relations between topics – to carry meaning. The ability to freely associate among topics that comes so easily to humans is endlessly difficult for a computer. Topic maps provide a way of modelling the chains of associations between topics.
USU has been deploying technologies based on topic maps for some time. The company uses on-going statistics on user behaviour together with user surveys to evaluate the effectiveness of these products. A doubling in the number of search actions per user was observed in the space of twelve months. The number of users who said they often or always found the information they were looking for rose over the same period from 38 per cent to 52 per cent. This is all the more remarkable for the fact that the number of documents simultaneously doubled to 20,000.
The experts agree: This new approach is a milestone on the road to the successful practice of knowledge retrieval. Companies who face major challenges in organising and channelling the flood of information should follow the development of this technology. GartnerGroup, for one, predicts that half of all portals and search engines will use topic maps by 2003.
Thomas Gerick is director of corporate communications at USU AG. He can be contacted at: firstname.lastname@example.org