posted 1 Jun 1998 in Volume 1 Issue 6
Six techniques for better matching,
filtering and profiling stored knowledge
One of the negative side effects of the knowledge society is information overload – giving users too much information for the human mind to understand, process and act upon. Chris Knowles and Dr. Innes Ferguson examine six ways in which computer processing tools can use artificial intelligence to identify and retrieve relevant information.
The argument has been won. Everyone now agrees that knowledge management is critical to business success. Senior managers and directors recognise that the stock market value of their companies depends more on intangible than on physical assets. Future revenue, profits and growth depend on the company’s ability to develop new products and get them to market quickly, to understand their customers better, to cut down wasteful administration, and increase productive time staff spend in front of customers. All these issues depend on what the company knows, not what it owns, and the key business issue is how to improve the management and utilisation of the great fund of knowledge within the organisation.
For those companies who haven’t yet got the message, all the major Management consultancies are building up their knowledge management practices, running workshops, and advising their clients on knowledge management strategies.
But why, then, is there so much uncertainty among practitioners as to what knowledge management really means, and what are the practical business benefits? A glance through the articles in past issues of this magazine shows many different interpretations and approaches. Areas of concern include how to change the culture of the organisation, how to influence people’s behaviour, and how to understand, define and classify the different facets of knowledge management.
I believe the debate is now about to shift from the “why” we need knowledge management, to the “how” to implement practical solutions that offer clear business benefit. As organizations move through the phases of understanding, strategy formulation, and allocation of resources, the next step will be the specification, design, and implementation of specific projects to meet defined business goals.
This article provides a brief description of six techniques that are now available to assist in the implementation of one aspect of Knowledge Management: the provision and use of large volumes of relevant, accurate and up to date information, available from multiple sources of stored knowledge, located both within and outside the organisation.
A large number of products claim to offer Knowledge Management solutions. In some cases this has involved simply taking an existing database, publishing, or document management package and giving it a new name. Rather than look at individual products, this article examines some of the underlying technologies and attempts to answer the question “what are the key technologies which can bring about a quantum leap in the quality of sharing stored knowledge?”
Aspects of Knowledge Management
The following classification structure for aspects of Knowledge Management is adapted from a scheme proposed in a previous issue of this magazine by Terry Finerty of Arthur Anderson.1
People who think knowledge is about skill and experience put people in touch with other people
“I know a man who can”
People who think knowledge is about creating something new, build communities
“Let’s get together and brainstorm the answer”
People who think knowledge is about being able to do it themselves, teach others, or go on training courses
“Let me show you, and then you can do it yourself”
People who think knowledge is a thing, which can be captured and stored, build databases
“I know where I can find the answer”
The first three aspects are essentially management, leadership and training issues.
Connecting people with other people who have different skills and experience is part of good management practice. A good manager finds the right people with the right skills and experience, and gives them the necessary resources to solve a particular problem. Creating a community of people to develop something new is a classic team leader role, for example in an R&D or a new business development team. And there is nothing new about investing in training and staff development.
Good companies have been managing, leading and training their staff for decades. It is a sign of the adverse effects of the last 10 years of Business Process Reengineering, downsizing and flat management hierarchies, that these skills have fallen by the wayside, and it has taken knowledge management to remind people that they are important.
On the other hand the fourth aspect - the ability to capture and store large volumes of information and data in a form that can be accessed quickly and easily, regardless of time and place - is very new. It depends entirely on the massive growth in the capability and power of electronic systems for the storage and processing of digital information.
Knowledge management practitioners have been very critical of naïve attempts to introduce inappropriate technical solutions, before the cultural and Organizational issues have been addressed. Knowledge is not simple, and depends on the application, context and local environment. However, at some point in every program, there will be a need to capture and store the information and learning gained. Ideally this needs to be in a form in which it can be searched and retrieved at a later date, and so shared with other people within the organisation. This is not a trivial task. Fortunately the quality and sophistication of technical tools to assist in the profiling, matching and filtering of information has improved considerably. An essential part of any knowledge management program is now how to identify the best tools for the job, and use them effectively.
Six techniques for better matching, filtering and profiling stored knowledge
It can be easy to get knowledge in to a database, but very hard to get it out again. The over-riding problem is that there is simply too much information available for the human mind to absorb, digest and understand, let alone make decisions and act on them.
This means that, for the first time, there is a real, generic, business need for computer processing tools that use artificial intelligence and other similar techniques to identify and retrieve relevant information. Some of the techniques are well established and have been used for years within the field of information retrieval: for example classification and indexing schemes, Boolean logic and full text retrieval.
Other newer techniques, based on profiling and pattern matching, can understand complex requirements, and represent them in a way that can be processed electronically, and matched against other multiple sets of requirements. Examples include data mining, probabilistic matching techniques, vector space modelling, neural networks, collaborative filtering, fuzzy logic and genetic algorithms. Many of these techniques have been researched within academia over the past 20 years, but have only recently begun to find commercial application; for example, to match a user’s search for information to a set of references, news stories, or Web pages, to match a set of customer requirements to a new product, or to find the optimum location for a retail outlet based on the profile of people living nearby.
It is often said that knowledge is Information in Context. Profiling, matching and filtering are the core technologies that can create a context in which a task such as information retrieval can take place. When used well, this can create personalised one to one services, avoid needless duplication and waste, and ensure we are not bombarded with irrelevant rubbish.
Six of the most significant techniques are described briefly below, with particular reference to their use providing and searching information stored in databases and resources accessible on the web via the Internet and corporate intranets.
1. Classification and indexing schemes
Creating an index is the traditional way of searching for and retrieving information - for example the index of a book; the alphabetical classification of words found in a dictionary or encyclopædia; a subject classification such as the Dewey decimal system still used in most libraries; and Yellow Pages and other business classification schemes: (e.g. “Monumental masons; see also Funeral Directors or Stonemasons and Drystone Wallers”).
Classification and indexing systems are still of enormous value. It is surprising how many Web pages provide an alphabet as an aid to searching, for example in many sites offering company information. The letters are set up as hyperlinks on the Web page, and you click on the letter with the mouse instead of turning the pages of a book.
Other simple classification schemes include date and time order - still the most useful system for real-time news - where people want simply to see the most recent items first. The Web is notoriously bad at retrieving any information by date, something that would be very easy to correct, through standardising the methods used for date and time, and stamping all pages whenever they are created or amended.
Indexing is especially effective as a means of distinguishing items that are substantially about a particular subject, from others which make only passing reference to it. Suitable classification schemes can also resolve ambiguities where the same word may have several different meanings. This is especially useful in the field of company financial information where articles that are substantially about a particular company can be indexed and therefore easily retrieved. For example, news stories about companies with names like Shell, or Iceland, can be distinguished from other articles that contain the same words but may be about holidays on the beach or in the cool country near the North Pole with hot geysers.
In the web environment, such classified or structured indexing information is often referred to as Metadata, or data about data, which can be held as tagged text in the source of HTML pages, or on more sophisticated sites as part of a database that generates dynamic web pages on the fly.
The disadvantage of classification systems is that they can involve a very high level of work in manually indexing items as they are entered. This is an expensive process. In addition, many different and overlapping schemes are available, often not compatible with each other, and this creates problems searching across multiple sources. Some suppliers such as the Dialog Corporation are starting to offer standard classification systems, and automated sorting of documents against the standard classification. One issue to consider is: would you allow a third party to own and control the classification and sorting scheme for your own company’s knowledge base, or do you want to keep control of this yourself?
2. Vector Space Modelling
Vector Space Modelling (VSM) is based on pioneering information retrieval work done over many years by Gerald Salton at Cornell University.
It was designed to overcome the limitations of Boolean inquiries and free text searching, which are able to match inquiries against well structured data very precisely. They are far less effective in matching inquiries against a large number of documents or articles where there are many possible matches, which all correspond to the inquiry to a greater or lesser extent. In this case, the user is typically not looking for an exact match, but is trying to find as many good matches as possible, whilst rejecting poor and irrelevant matches.
The quality of the result depends significantly on how well the inquiry has been formulated in the first place. To give a simple example, nearly all inquiries made by users of web search engines consist of no more than two words. Given the vast number of items searched, it is not surprising that two words alone are unlikely to give a very precise indication of what the user is really looking for. Most search engines provide significantly better results if more words are entered, but it is not always easy for the user to think of the right words.
VSM is a probabilistic profiling and matching technique, which allows any document or body of text, in any language, to be represented by a weighted vector based on the frequency of occurrence of words and phrases. This vector acts as a mathematical representation of the conceptual meaning of the document and can then be used to identify and match similar documents. In many cases between 20 and 30 terms are needed to accurately represent a typical document, although this depends on the type of content, document length and variability within any particular resource or set of documents.
The technique was refined and developed over many years by Salton and his colleagues and, in comparative testing, consistently provided better results than structured Boolean searches.
VSM techniques are now used in many web search engines. Together with other techniques including differential weighting based on location, relevance feedback, and automated metadata extraction, it also forms the underlying technology for Z-Cast, Zuno’s new intelligent search tool for information access and knowledge sharing across the organisation.
3. Neural Networks
Neural network techniques are another way of applying probabilistic methods to the problems of matching inquiries against a large number of unstructured textual articles or documents. They are based on computer systems that mimic the operation of the human brain. Similar actions or events provide feedback and reinforce each other, in a way analogous to the operation of neurons firing in the human brain. The system learns as it goes along, and is refined and fine-tuned as more inquiries are made.
In information retrieval and profiling applications, such as Autonomy’s Agentware, the system will first examine a piece of text, which can be a user’s inquiry, or a particular document or set of documents, and look for patterns. These patterns are represented by the system in a mathematical form and compared with other patterns taken from other documents or bodies of text. Where similarities are found, the weighting given to the pattern can be reinforced, and where no match is found the weighting can be decreased.
Neural network systems can be very effective, but require an initial period of training, to build up useful patterns against which new information can be compared.
4. Genetic algorithms
The use of genetic algorithms is another technique for resolving the problem of identifying similarities, rather than an exact match that passes or fails to meet particular search criteria. It can be used for complex problems, which involve many different variables and which are capable of being addressed in many different ways. One example is choosing the optimum locations for a new chain of retail outlets, from a number of potential sites, in relation to the profiles of, say, all the potential customers who live in a broad area within 30 minutes drive of each of the potential sites. The problem is made more difficult because choosing any one location has an impact on the other locations. If sites are located very close to each other it can be assumed people will not go to both, and one criteria for the search may be that the sites should be a certain distance apart from each other. The sheer number of possible solutions, which can run into many billions, makes this a very difficult problem to solve.
Genetic algorithms have been used commercially to identify the best locations for a new chain of pubs/bars, using a system developed by a new UK software company, Searchspace Ltd.2
In this case the system started with a selection of possible sites, chosen either at random, or through an initial set of relatively simple rules. The sites were scored against a set of criteria, and combinations of five sites received a total score. Different combinations were then allowed to “swap sites”, and an element of natural selection was introduced through “random mutations”: i.e. new sites introduced at random into some combinations.
Combinations that scored high were kept, those that scored low were discarded, and after many iterations, the process tended to find the best combinations of sites.
5. Data Mining and Knowledge Discovery
The proliferation of various electronic indexing and scanning technologies (e.g. bar coding and point of sale systems), coupled with the growth in both the volume and variety of data that is being recorded about people and their business transactions (e.g. employee work performance, consumer buying habits) has led to the existence of vast amounts of valuable information - information with many hidden patterns, correlations, and trends, ready to be exploited by the right kind of technology.
The technology used to extract such information from large databases or data warehouses is often referred to as Data Mining or Knowledge Discovery (KDD) technology. In effect, KDD is concerned with the creation and application of new tools and techniques for intelligent analysis of databases, using appropriate techniques from such fields as statistics, machine learning, or artificial intelligence, to classify, cluster, partition, and generally identify the underlying patterns, deviations, and correlations that exist among seemingly unrelated information elements. The extracted patterns can be regarded as descriptive or predictive models for understanding or revealing the underlying knowledge contained within the stored data. Such knowledge can then be used to solve additional business problems such as improving the effectiveness of a particular process, increasing return on investment or market share, or maximising the quality of service offered.
A number of successful applications of KDD technology can be found in such areas as fraud detection, equity portfolio management, satellite image recognition, marketing and predicting medical effectiveness, among others.
6. Collaborative filtering
Most information retrieval techniques suffer from two weaknesses. Firstly, it is very difficult for a user to accurately represent an inquiry in words, and secondly having identified one good result, there is no guarantee that other results, which the user may consider equally good matches, contain the same words, phrases or patterns, which can be identified and matched using Boolean, VSM, neural network, or other data matching techniques.
One way round this is not to match the inquiry against a resource containing a large number of items, such as documents or articles, but to compare the inquiries and actions of one user, with the inquiries and actions of other users, and look for similarities between users. For example my teenage daughter likes Leonardo di Caprio and the Spice Girls (!). If lots of other people, who like Leonardo di Caprio and the Spice Girls, also like the Back Street Boys, there is a fairly high probability that my daughter will also like them. This technique is known as collaborative filtering and, not surprisingly, it is used increasingly for providing personal recommendations for books or records you can buy using electronic commerce services on the web.
No one technique is best. All those described, and others, have advantages and disadvantages. In most cases the greatest benefit comes from how well the technique is applied, rather than from the inherent benefits of any one approach.
New technology means that collecting and accessing data and information is becoming easier and easier. The thing all techniques have in common is that they are applying intelligent profiling, matching, and filtering techniques to solve the problem of obtaining the best possible result from an inquiry, which involves analysing and searching across large volumes of complex data. A solution which does this well, can provide a quantum leap in the quality of analysing, sharing and applying the store of knowledge available within any organisation.
Chris Knowles is Business Manager, Financial Services for Zuno Ltd, a division of Mitsubishi. He can be contacted at:
Dr. Innes Ferguson can be contacted at:
|1 Terry Finerty, Knowledge - The Global Currency of the 21st Century, Knowledge Management, Aug/Sep 1997.|
|2 As reported in the New Scientist, no 2129, 11th April 1998.|