posted 15 Oct 2003 in Volume 7 Issue 2
Case study: Engineering a thesaural taxonomy
From the structural engineering design of an opera house to the development-planning consultancy for a new high-speed railway line, projects are what Arup do. But, writes Julian Diamond, it is the firm’s ability to access project data in order to apply previous experience to new projects that is crucial to the success of the practice’s business development.
Arup has had a variety of formal project-based information systems in operation for over 30 years. While the systems themselves have changed the basic project data has not. So perhaps the single most critical challenge when upgrading an information system is to ensure data is not lost. This has had the effect of forcing our hand when choosing a new system. Paradoxically, the technology prevalent in the 1960s and 1970s has put us in a position where the system we implemented in the early 1990s is perhaps only now considered state of the art.
Arup’s project records were once stored on a card-index filing system. The details were relatively sketchy: a title, description, a few key dates, names of clients and collaborators, and some cost data. At that time, the key to unlocking the contents of the system was a good memory. With the advent of the first computers, it was a natural progression to store these basic data electronically, so a punched-card system was the next tool to be adopted. This system was then enhanced to allow on-screen access (via a dumb terminal) to the data, but free-text searching was still very much a thing of the future.
Instead, an elaborate system of keywords was introduced. By searching on the code for each term the information could be accessed without having to rely so heavily on a thorough knowledge of the information itself. Codes were assigned to all data that needed to be searchable – these were subject-related terms like ‘hospitals’ or ‘power stations’ or ‘industrialised building systems’ or ‘structural engineering’ – but finding the right code for the right search term relied on looking through a paper-based index contained in an A4 file. Only then was it possible to search for client names, start and finish dates, countries, counties, project directors and project managers (think of having to turn a date – which is, after all, just a code signifying a time period – back into a code).
The business and marketing environments were changing and it was clear that the system we had could not provide the information we needed quickly enough – if at all, or in an appropriate, usable form. Also, the system on which the job-records data was stored (DEC10) was due to be decommissioned. As with many such developments the coincidence of these factors, plus the support of a member of the board, suggested the time was right to secure funding for an upgrade.
The technology available at the time was developing rapidly and free-text searching, string matching, date searching and number searching were all options to consider. The prospect emerged of being able to carry out one single search that would, for example, find all our airport projects started in the last five years with a construction value greater than £10m. The key to the next step lay in the existing subject terms, and though the reasons for them appearing in a project record had long been lost, it was essential they were reused. The hunt for a new system was really quite straightforward given the nature of this legacy data and the arrival of free-text searching. That is, we knew what the right solution would be when we found it.
What we found was a thesaurus-based search engine called BRS. Software consultants Kinesis had combined it with their own free-text search engine, which was marketed as a product somewhat grandiosely called Total Recall. At the time, we believed the thesaural element would be of great benefit for the subject terms vocabulary. However, it soon became apparent that it had huge potential for other key areas of search.
As well as demonstrating which projects we have worked on and how they were done, we also needed to show where we have worked, with whom and which part of Arup was responsible. Therefore, four natural areas for a verified closed vocabulary became apparent and the following are what we ended up with:
- Locations – Countries/regions/continents/cities/boroughs;
- External organisations – Clients/architects/contractors/quantity surveyors;
- Arup operating groups – Offices/businesses/countries/divisions;
- Subject keywords – Project types/sectors/special skills/disciplines/contract type.
These lists, which we call thesauri, work in an identical manner. They are poly-hierarchical taxonomies and enable the use of synonyms, abbreviations, preferred and related terms and keyword descriptors. Their poly-hierarchical nature means that a given lead term (LT) may have more than one broader term (BT). For example, in the locations thesaurus, the term ‘England’ is a narrower term (NT) for both ‘United Kingdom’ and ‘countries’.
Viewing the thesaurus from the perspective of the ‘United Kingdom’ as LT looks like this:
BT - Commonwealth
BT - Opec
BT - OECD
BT - Western Europe
BT - Nato
LT - United Kingdom
NT - Channel Islands
NT - England
NT - Isle of Man
NT - Northern Ireland
NT - Scotland
NT - UK cross-border
NT - Wales
This means that every job to which the country location ‘England’ is added will be found by searching ‘United Kingdom’ or any one of the BTs of ‘United Kingdom’. This is a highly significant fact as it means a huge number of terms will automatically populate the database and data editors will not have to think about assigning them. This automatic population is carried out when a job record is saved. If the thesaurus is amended at any time, the new relationships will be added during what is known as ‘synchronisation’ – a major indexing process, which takes place every weekend. It also means the search will find matches for search terms of which the user is unaware and thus increase the power of the search.
The key to ensuring the robustness and integrity of any of the thesauri is to make certain the NT always belongs to its BT. This is quite straightforward when thinking about countries and continents, for example, as these are easily understood, relatively permanent and semantically undemanding. Problems and difficulties arise when one has to deal with technical terms that have a specific meaning within their context. A particular example is the word ‘compartmentation’ which, to a fire engineer, means a particular method for preventing the spread of fire but, taken out of context, it simply means splitting things up. However, the fact that it exists within the keyword thesaurus as an NT of ‘fire engineering’ means it has context but, and it is a significant but, it relies on the user knowing that he or she must look at a given term’s relationships. In other words, users have to understand how a thesaurus works. And there, in a nutshell, is one of the fundamental problems with thesaural taxonomies – the user’s understanding.
Users do not feel comfortable with looking up a term to search on before they search. In an early version of the software, the system forced users to pick a term from within one of the four thesauri, which they commented was rather boring when they knew the term was there. In response to user needs, we amended the system so that they could type terms directly into the search boxes. It will come as no great surprise to hear that users then complained when they got no hits after trying to search for a term that was not in the thesaurus.
There are occasions when a particular word has more than one meaning and it is useful to state explicitly that word’s context within the term itself. An example would be the word ‘station’, which can mean many things and often the meaning will be in the mind’s eye of the user as they are using the word. We have ‘railway stations’, ‘bus stations’, ‘police stations’, ‘fire stations’, ‘petrol stations’ and others. A slight refinement to this would be a term such as ‘drainage’, which, in engineering terms, can have either a civil or building context. As such we have the terms ‘drainage’ – CE and ‘drainage’ – PH. The qualifiers CE (civil engineering) and PH (public health engineering) provide the context, which allows data editors to assign and searchers to look using the appropriate keyword.
The system is quite complex and helps us to find projects that meet many and various criteria. But we also have a potential user base upwards of 6,000 people, the vast majority of whom don’t have the time or inclination to learn how to use it appropriately. So, I ask myself from time to time, why did we invest in such a beast? The answer takes us back to the prevailing technology.
In the very early 1990s, intranets and networks were still not a considered reality so the need to design a user-friendly, highly intuitive system with occasional, untrained end-users in mind was not very high on the list of priorities, and anyway, we had found this fantastic new technology that was really smart. Having said that, we also had, and still have, a small team of custodians who look after the data and they do understand how it works. These people can make the system sing. In terms of deciding what kind of system to adopt, one has to consider the two extremes – do you have a system that is very easy to use and therefore potentially simplistic to the point of limited value, or do you have a system that is subtle, demanding, feature-rich, perhaps even esoteric, and yet yields excellent results to the initiated? The correct answer is somewhere in the middle, and in my opinion, as a trained and experienced user, it lies closer to the latter. But I would say that, wouldn’t I?
One of the major advantages of our taxonomy is that it exists within what is actually a closed environment. All of the data in the project-records system, which is universally known as Ovabase, relates to our projects – those we have done, those we are doing and those for which we are bidding. This means, notwithstanding the difficulties noted previously in relation to technical language, all the thesaural terms relate to projects. This makes life significantly easier than using a taxonomy for, say, a corporate intranet environment that has heterogeneous content and often lacks overall coherence or unity of subject matter.
A particular benefit we have gained from the use of a thesaural approach to the taxonomy is in the area of our clients and collaborators. The creation of a list of organisations that exist in a hierarchical structure means we can link organisations together, and provided we search on the BT, we will also find all the projects we have carried out with a company’s subsidiaries or previous incarnations. This shows how a thesaurally based taxonomy, which uses the concept of inheritance, can be implemented in ways other than the more common subject application. Furthermore, the ability to link organisations can be applied beyond just ownership structure. Companies may also be sorted into areas such as industry sector. Thus, the thesaurus may also be used as a reference tool. Searching on the term ‘architectural practice’ without other qualification is an essentially pointless exercise since we work with architects on well over half the projects we do. We also tag companies by their location so it is possible to find, for example, all the projects outside the US that we have worked on with American companies.
One of the most significant issues we had to resolve was the creation of the subject-based thesaurus, which was established on an existing, unrelated set of keywords. At that time, there were around 3,000 terms in the keyword list. To make the most of the technology, we had to create the thesaural links, remove duplicates, create synonyms and add appropriate new nodal terms – terms which you would not want to assign to a record, but on which you may well wish to search. An example of a nodal term is 'land transport structures', which, among its NTs, includes 'roads', 'railways' and 'tram systems'. To help us do this, we employed a professional to get us started with the overall structure of the keywords thesaurus. This helped us understand what we needed to do to make the thesaurus work in terms of adding our own technical language and maintaining it. Maintaining the thesaurus is an extremely time-consuming and often quite complex process. We tend to add new terms only when we are sure they are necessary, in fact, it is often a case of actively trying to prove the need for a proposed new term. We usually do this by ensuring there are at least ten projects to which we can add the newly created term. Clearly, adding new terms without assigning them to any project records is not useful as it merely serves to increase the size of the keyword thesaurus without actually enhancing the ability to search.
At this stage, we also carried out a review of the existing keyword list and established that we needed to add and update particular subject-area sections. This meant involving specialists with considerable knowledge about their subject but whom, unsurprisingly, knew little or nothing about thesaural, hierarchical taxonomies, or indeed, the basics of search. In hindsight this was a recipe for confusion and we ended up with some lists containing far too many terms, and others containing virtually none. When dealing with technical terms, if there is a choice between too many or too few, select the latter, unless you are certain that technical people will be searching the system and assigning the terms themselves.
Assigning keywords to a project record is also something of an art; the custodian needs to have a reasonable idea of both the terms that exist within the thesaurus and the sorts of things for which it is useful to search. Finding the right terms in the keyword thesaurus is a skill in itself. Consideration should also be given to the fact that the system uses free-text searching, and as our appreciation of the system has grown, we have come to understand that that mere fact can often obviate the need to create a keyword at all. ‘Revolving restaurants’ is a good example of a term that does not actually need to be in the thesaurus because it is so specific and unambiguous – and it contains its own context. Having said that, a word like ‘bridge’ could be someone’s name or contained in an address – New Bridge Street, perhaps. ‘Jeremy Revolving Restaurant’ is unlikely to exist however. The problem is, of course, how one informs the uninitiated of the idea that some words need to be selected and others don’t. Also, what makes the system easier to manage doesn’t necessarily make it any simpler for occasional users.
Ovabase currently averages about 500 unique users per month, who collectively login over 2,000 times per month. Five hundred users represent just under ten per cent of the total staff at Arup. A significant proportion of the user population have made a conscious decision not to use the system themselves, instead asking the Ovabase team to do their research for them. We also carry out training courses for interested individuals, although the majority of users have not had formal training and use the system for simple searches and one-hit answers, such as who was the project manager on the new Tate Gallery of Modern Art? Searching on ‘Tate Gallery’ and ‘modern art’ in free text is a far more appropriate search method than say ‘art galleries’ (in keywords) and ‘London’ (in locations).
Unlike Google or intranet searches, the answers required of Ovabase will often be lists of relevant projects, for example, ‘Airport terminals in South-East Asia in the last ten years’ and this is, in my view, a fundamental reason why a thesaurus is so useful in this kind of application. The ‘Tate Gallery’ search in the previous paragraph is also an example of a search formulation that will get a sensible set of results. The system allows other qualifications to be added such as ‘not unsuccessful bid job’ or ‘library reference exists’. The database is fully linked to our journals catalogue in which we maintain a record of articles about any of our projects and the keyword signifying an article’s existence is automatically generated.
Since we started using the thesauri the system has been through three different interfaces: Oracle Forms, Visual Basic (which was my personal favourite, but which was very slow over the wide area network) and we now have a browser-based intranet version. The current version no longer uses the BRS search engine and the thesaural elements are totally bespoke; the free-text search engine is Oracle Intermedia.
If implementing a taxonomy in a similarly closed environment, what might you have learnt from our experience?
- If you are starting from scratch, keep the taxonomy simple. Do not over-complicate and do not create terms for the sake of it (as a result of the legacy we inherited, we had little choice);
- Do not create terms that are certain to be in the text of a document or database record and that are, in any way, ambiguous;
- The need to balance usability and sophistication;
- Consider developing a very simple front-end for occasional users, which effectively hides the thesaurus;
- Use taxonomies for areas other than subject searches, such as countries, client bodies and external organisations, and internal organisation structures;
- Find skilled and committed individuals to manage the data. People who are asked to find information – and who take satisfaction from finding the right answers – will inevitably enter good data;
- Try to avoid splitting the data input and output roles at all costs. Make them two sides of the same coin;
- Use thesaural terms as qualifiers for free-text searches;
- Users are not comfortable using thesaural taxonomies;
- Users don’t like being forced to choose from pre-defined lists unless the lists are very short;
- Users expect to find everything very easily and get discouraged when they don’t succeed;
- The thesauri approach enables us to search on more keywords than we assign;
- Work together with subject area specialists to develop mini-hierarchies for their discipline and discourage them from adding every possible keyword. Information professionals should create the hierarchy;
- Creating and maintaining such thesauri is incredibly time consuming.
The fact that over 500 users per month use the system, and not all of them have had training, strongly suggests that the system can still be used successfully in a relatively simple manner. A skilled user combining free text with the four thesauri and using the Boolean operators that exist can perform extremely precise searching with very few miss-hits or missed hits.
Julian Diamond is associate director and business information manager at Arup. He can be contacted at email@example.com.