Feature
posted 1 Jun 1998 in Volume 1 Issue 6
Six techniques for better matching,
filtering and profiling stored knowledge
One of the negative side effects of the knowledge society
is information overload – giving users too much information for the human mind
to understand, process and act upon. Chris Knowles and
Dr. Innes Ferguson
examine six ways in which computer processing
tools can use artificial intelligence to identify and retrieve relevant
information.
The argument has been won. Everyone now agrees that knowledge management
is critical to business success. Senior managers and directors recognise that
the stock market value of their companies depends more on intangible than on
physical assets. Future revenue, profits and growth depend on the company’s
ability to develop new products and get them to market quickly, to understand
their customers better, to cut down wasteful administration, and increase
productive time staff spend in front of customers. All these issues depend on
what the company knows, not what it owns, and the key business issue is how to
improve the management and utilisation of the great fund of knowledge within the
organisation.
For
those companies who haven’t yet got the message, all the major Management
consultancies are building up their knowledge management practices, running
workshops, and advising their clients on knowledge management strategies.
But why, then, is there
so much uncertainty among practitioners as to what knowledge management really
means, and what are the practical business benefits? A glance through the
articles in past issues of this magazine shows many different interpretations
and approaches. Areas of concern include how to change the culture of the
organisation, how to influence people’s behaviour, and how to understand, define
and classify the different facets of knowledge management.
I believe the debate is now about to
shift from the “why” we need knowledge management, to the “how” to implement
practical solutions that offer clear business benefit. As organizations move
through the phases of understanding, strategy formulation, and allocation of
resources, the next step will be the specification, design, and implementation
of specific projects to meet defined business goals.
This article provides a brief
description of six techniques that are now available to assist in the
implementation of one aspect of Knowledge Management: the provision and use of
large volumes of relevant, accurate and up to date information, available from
multiple sources of stored knowledge, located both within and outside the
organisation.
A
large number of products claim to offer Knowledge Management solutions. In some
cases this has involved simply taking an existing database, publishing, or
document management package and giving it a new name. Rather than look at
individual products, this article examines some of the underlying technologies
and attempts to answer the question “what are the key technologies which can
bring about a quantum leap in the quality of sharing stored knowledge?”
Aspects of
Knowledge Management
The following classification structure
for aspects of Knowledge Management is adapted from a scheme proposed in a
previous issue of this magazine by Terry Finerty of Arthur Anderson.1
People who think knowledge is about
skill and experience put people in touch with other people
“I know a man who can”
People who think
knowledge is about creating something new, build communities
“Let’s get together
and brainstorm the answer”
People who think knowledge is about
being able to do it themselves, teach others, or go on training courses
“Let me show you, and
then you can do it yourself”
People who think knowledge is a thing,
which can be captured and stored, build databases
“I know where I can find the
answer”
The
first three aspects are essentially management, leadership and training
issues.
Connecting people with other people who have different skills and
experience is part of good management practice. A good manager finds the right
people with the right skills and experience, and gives them the necessary
resources to solve a particular problem. Creating a community of people to
develop something new is a classic team leader role, for example in an R&D
or a new business development team. And there is nothing new about investing in
training and staff development.
Good companies have been managing,
leading and training their staff for decades. It is a sign of the adverse
effects of the last 10 years of Business Process Reengineering, downsizing and
flat management hierarchies, that these skills have fallen by the wayside, and
it has taken knowledge management to remind people that they are
important.
On the
other hand the fourth aspect - the ability to capture and store large volumes of
information and data in a form that can be accessed quickly and easily,
regardless of time and place - is very new. It depends entirely on the massive
growth in the capability and power of electronic systems for the storage and
processing of digital information.
Knowledge management practitioners
have been very critical of naïve attempts to introduce inappropriate technical
solutions, before the cultural and Organizational issues have been addressed.
Knowledge is not simple, and depends on the application, context and local
environment. However, at some point in every program, there will be a need to
capture and store the information and learning gained. Ideally this needs to be
in a form in which it can be searched and retrieved at a later date, and so
shared with other people within the organisation. This is not a trivial task.
Fortunately the quality and sophistication of technical tools to assist in the
profiling, matching and filtering of information has improved considerably. An
essential part of any knowledge management program is now how to identify the
best tools for the job, and use them effectively.
Six techniques for better matching,
filtering and profiling stored knowledge
It can be easy to get knowledge in to
a database, but very hard to get it out again. The over-riding problem is that
there is simply too much information available for the human mind to absorb,
digest and understand, let alone make decisions and act on them.
This means that, for the
first time, there is a real, generic, business need for computer processing
tools that use artificial intelligence and other similar techniques to identify
and retrieve relevant information. Some of the techniques are well established
and have been used for years within the field of information retrieval: for
example classification and indexing schemes, Boolean logic and full text
retrieval.
Other
newer techniques, based on profiling and pattern matching, can understand
complex requirements, and represent them in a way that can be processed
electronically, and matched against other multiple sets of requirements.
Examples include data mining, probabilistic matching techniques, vector space
modelling, neural networks, collaborative filtering, fuzzy logic and genetic
algorithms. Many of these techniques have been researched within academia over
the past 20 years, but have only recently begun to find commercial application;
for example, to match a user’s search for information to a set of references,
news stories, or Web pages, to match a set of customer requirements to a new
product, or to find the optimum location for a retail outlet based on the
profile of people living nearby.
It is often said that knowledge is
Information in Context. Profiling, matching and filtering are the core
technologies that can create a context in which a task such as information
retrieval can take place. When used well, this can create personalised one to
one services, avoid needless duplication and waste, and ensure we are not
bombarded with irrelevant rubbish.
Six of the most significant techniques
are described briefly below, with particular reference to their use providing
and searching information stored in databases and resources accessible on the
web via the Internet and corporate intranets.
1. Classification and indexing
schemes
Creating an index is the traditional way of searching for and retrieving
information - for example the index of a book; the alphabetical classification
of words found in a dictionary or encyclopædia; a subject classification such as
the Dewey decimal system still used in most libraries; and Yellow Pages and
other business classification schemes: (e.g. “Monumental masons; see also
Funeral Directors or Stonemasons and Drystone Wallers”).
Classification and indexing systems
are still of enormous value. It is surprising how many Web pages provide an
alphabet as an aid to searching, for example in many sites offering company
information. The letters are set up as hyperlinks on the Web page, and you click
on the letter with the mouse instead of turning the pages of a book.
Other simple
classification schemes include date and time order - still the most useful
system for real-time news - where people want simply to see the most recent
items first. The Web is notoriously bad at retrieving any information by date,
something that would be very easy to correct, through standardising the methods
used for date and time, and stamping all pages whenever they are created or
amended.
Indexing
is especially effective as a means of distinguishing items that are
substantially about a particular subject, from others which make only passing
reference to it. Suitable classification schemes can also resolve ambiguities
where the same word may have several different meanings. This is especially
useful in the field of company financial information where articles that are
substantially about a particular company can be indexed and therefore easily
retrieved. For example, news stories about companies with names like Shell, or
Iceland, can be distinguished from other articles that contain the same words
but may be about holidays on the beach or in the cool country near the North
Pole with hot geysers.
In the web environment, such classified or structured indexing
information is often referred to as Metadata, or data about data, which can be
held as tagged text in the source of HTML pages, or on more sophisticated sites
as part of a database that generates dynamic web pages on the fly.
The disadvantage of
classification systems is that they can involve a very high level of work in
manually indexing items as they are entered. This is an expensive process. In
addition, many different and overlapping schemes are available, often not
compatible with each other, and this creates problems searching across multiple
sources. Some suppliers such as the Dialog Corporation are starting to offer
standard classification systems, and automated sorting of documents against the
standard classification. One issue to consider is: would you allow a third party
to own and control the classification and sorting scheme for your own company’s
knowledge base, or do you want to keep control of this yourself?
2. Vector Space
Modelling
Vector Space Modelling (VSM) is based on pioneering information
retrieval work done over many years by Gerald Salton at Cornell
University.
It
was designed to overcome the limitations of Boolean inquiries and free text
searching, which are able to match inquiries against well structured data very
precisely. They are far less effective in matching inquiries against a large
number of documents or articles where there are many possible matches, which all
correspond to the inquiry to a greater or lesser extent. In this case, the user
is typically not looking for an exact match, but is trying to find as many good
matches as possible, whilst rejecting poor and irrelevant matches.
The quality of
the result depends significantly on how well the inquiry has been formulated in
the first place. To give a simple example, nearly all inquiries made by users of
web search engines consist of no more than two words. Given the vast number of
items searched, it is not surprising that two words alone are unlikely to give a
very precise indication of what the user is really looking for. Most search
engines provide significantly better results if more words are entered, but it
is not always easy for the user to think of the right words.
VSM is a
probabilistic profiling and matching technique, which allows any document or
body of text, in any language, to be represented by a weighted vector based on
the frequency of occurrence of words and phrases. This vector acts as a
mathematical representation of the conceptual meaning of the document and can
then be used to identify and match similar documents. In many cases between 20
and 30 terms are needed to accurately represent a typical document, although
this depends on the type of content, document length and variability within any
particular resource or set of documents.
The technique was refined and
developed over many years by Salton and his colleagues and, in comparative
testing, consistently provided better results than structured Boolean
searches.
VSM
techniques are now used in many web search engines. Together with other
techniques including differential weighting based on location, relevance
feedback, and automated metadata extraction, it also forms the underlying
technology for Z-Cast, Zuno’s new intelligent search tool for information access
and knowledge sharing across the organisation.
3. Neural
Networks
Neural network techniques are another way of applying probabilistic
methods to the problems of matching inquiries against a large number of
unstructured textual articles or documents. They are based on computer systems
that mimic the operation of the human brain. Similar actions or events provide
feedback and reinforce each other, in a way analogous to the operation of
neurons firing in the human brain. The system learns as it goes along, and is
refined and fine-tuned as more inquiries are made.
In information retrieval and profiling
applications, such as Autonomy’s Agentware, the system will first examine a
piece of text, which can be a user’s inquiry, or a particular document or set of
documents, and look for patterns. These patterns are represented by the system
in a mathematical form and compared with other patterns taken from other
documents or bodies of text. Where similarities are found, the weighting given
to the pattern can be reinforced, and where no match is found the weighting can
be decreased.
Neural network systems can be very effective, but require an initial
period of training, to build up useful patterns against which new information
can be compared.
4. Genetic algorithms
The use of genetic algorithms is
another technique for resolving the problem of identifying similarities, rather
than an exact match that passes or fails to meet particular search criteria. It
can be used for complex problems, which involve many different variables and
which are capable of being addressed in many different ways. One example is
choosing the optimum locations for a new chain of retail outlets, from a number
of potential sites, in relation to the profiles of, say, all the potential
customers who live in a broad area within 30 minutes drive of each of the
potential sites. The problem is made more difficult because choosing any one
location has an impact on the other locations. If sites are located very close
to each other it can be assumed people will not go to both, and one criteria for
the search may be that the sites should be a certain distance apart from each
other. The sheer number of possible solutions, which can run into many billions,
makes this a very difficult problem to solve.
Genetic algorithms have been used
commercially to identify the best locations for a new chain of pubs/bars, using
a system developed by a new UK software company, Searchspace Ltd.2
In this case the system started with a
selection of possible sites, chosen either at random, or through an initial set
of relatively simple rules. The sites were scored against a set of criteria, and
combinations of five sites received a total score. Different combinations were
then allowed to “swap sites”, and an element of natural selection was introduced
through “random mutations”: i.e. new sites introduced at random into some
combinations.
Combinations that scored high were kept, those that scored low were
discarded, and after many iterations, the process tended to find the best
combinations of sites.
5. Data Mining and Knowledge Discovery
The proliferation of
various electronic indexing and scanning technologies (e.g. bar coding and point
of sale systems), coupled with the growth in both the volume and variety of data
that is being recorded about people and their business transactions (e.g.
employee work performance, consumer buying habits) has led to the existence of
vast amounts of valuable information - information with many hidden patterns,
correlations, and trends, ready to be exploited by the right kind of
technology.
The
technology used to extract such information from large databases or data
warehouses is often referred to as Data Mining or Knowledge Discovery (KDD)
technology. In effect, KDD is concerned with the creation and application of new
tools and techniques for intelligent analysis of databases, using appropriate
techniques from such fields as statistics, machine learning, or artificial
intelligence, to classify, cluster, partition, and generally identify the
underlying patterns, deviations, and correlations that exist among seemingly
unrelated information elements. The extracted patterns can be regarded as
descriptive or predictive models for understanding or revealing the underlying
knowledge contained within the stored data. Such knowledge can then be used to
solve additional business problems such as improving the effectiveness of a
particular process, increasing return on investment or market share, or
maximising the quality of service offered.
A number of successful applications of
KDD technology can be found in such areas as fraud detection, equity portfolio
management, satellite image recognition, marketing and predicting medical
effectiveness, among others.
6. Collaborative
filtering
Most information retrieval techniques suffer from two weaknesses.
Firstly, it is very difficult for a user to accurately represent an inquiry in
words, and secondly having identified one good result, there is no guarantee
that other results, which the user may consider equally good matches, contain
the same words, phrases or patterns, which can be identified and matched using
Boolean, VSM, neural network, or other data matching techniques.
One way round this is
not to match the inquiry against a resource containing a large number of items,
such as documents or articles, but to compare the inquiries and actions of one
user, with the inquiries and actions of other users, and look for similarities
between users. For example my teenage daughter likes Leonardo di Caprio and the
Spice Girls (!). If lots of other people, who like Leonardo di Caprio and the
Spice Girls, also like the Back Street Boys, there is a fairly high probability
that my daughter will also like them. This technique is known as collaborative
filtering and, not surprisingly, it is used increasingly for providing personal
recommendations for books or records you can buy using electronic commerce
services on the web.
Summary
No one technique is best. All those
described, and others, have advantages and disadvantages. In most cases the
greatest benefit comes from how well the technique is applied, rather than from
the inherent benefits of any one approach.
New technology means that collecting
and accessing data and information is becoming easier and easier. The thing all
techniques have in common is that they are applying intelligent profiling,
matching, and filtering techniques to solve the problem of obtaining the best
possible result from an inquiry, which involves analysing and searching across
large volumes of complex data. A solution which does this well, can provide a
quantum leap in the quality of analysing, sharing and applying the store of
knowledge available within any organisation.
Chris Knowles is Business Manager,
Financial Services for Zuno Ltd, a division of Mitsubishi. He can be contacted
at:
chris@cksoftware.co.uk
Dr. Innes Ferguson
can be contacted at:
innes@active-online.net
| 1 Terry Finerty, Knowledge - The Global Currency of the 21st Century, Knowledge Management, Aug/Sep 1997. |
| 2 As reported in the New Scientist, no 2129, 11th April 1998. |
denotes premium content | May 26 2013 



