posted 7 Feb 2002 in Volume 5 Issue 5
Taxonomies in the corporate marketplace
The Factiva experience
While many large corporations are struggling with the problems associated with implementing an effective enterprise-wide taxonomy using commercially available software, Factiva has been categorising vast amounts of data for years. Simon Alterman outlines Factiva’s experience in introducing commercial categorisation software and discusses the lessons the company has learned as it faces the challenge of applying a complex taxonomy to a constantly growing repository of information.
During 2000, ‘taxonomy’ became a buzzword for many major organisations. The rush towards intranet development meant that internal content was of critical importance; narrowing the focus to zoom straight in on the content a user actually wanted at a given time proved trickier. Applying some form of content categorisation looked to many like the logical next step.
By the end of 2001, things had moved on a little. A number of organisations, including some of the largest, had yet to dip their toes in the taxonomy waters. Others had already made their first attempt to implement a taxonomy and categorisation solution. Of those who had taken the plunge, by no means all emerged feeling healthy and invigorated.
The situation was reviewed in a Factiva-sponsored report by international knowledge and information management consultancy, TFPL (www.tfpl.com/consultancy/taxonomies/Taxonomy_research_2001/taxonomy_research_2001.html). The report outlines technological approaches to categorisation, reviews main current vendor offerings in the field and provides a series of profiles of implementations at bodies including pharmaceuticals company AstraZeneca, consultancy KPMG and the United States Postal Service. It sums up the current state of the market thus: “The users have all been through one or more iterations of an architecture, the first phase of which was a reaction to a perceived problem, either information overload or lack of clarity in navigation. This perception was dealt with through acquisition of a software package to tackle the problem, often on a departmental or group basis. In most cases this solution was not successful.”
At a TFPL conference in November 2001, called to discuss the report, users shared their views in more detail. A few adhered with confidence to fully automated categorisation solutions; a clear majority were using or expecting to use systems that assume a heavier degree of human input. Of those who had already embarked on a project, all agreed that the level of human effort required to get the most out of the systems was much higher than they had initially anticipated. One participant described the forum as “a chance to share the pain”.
To Factiva, these outpourings came as no surprise. Years of experience had lead to the conclusion that effective categorisation is indeed extremely difficult.
The Factiva taxonomy
Factiva was established in mid-1999 as a joint venture between publishing and financial information giants, Dow Jones and Reuters. It provides a combined global news and business information service through websites and content integration solutions, and has more than 800 employees in 58 offices across 34 countries in Europe, Asia and the Americas.
Factiva’s global content comprises nearly 8,000 news and business information sources, including newspapers, business magazines, trade publications, business newswires, press releases and media transcripts. Content is loaded in 22 languages and more than 700 non-English sources exist. Approximately 110,000 new documents are added every day.
Factiva’s mission is to be “the indispensable provider of business information and customised solutions that inspire our customers’ best business decisions, globally and locally” and its objective is to embed Factiva news and information in the mission-critical applications (intranets, EIPs, CRM systems, etc) of its clients.
Use of content classification metadata is a fundamental part of this strategy. Correct application of the metadata allows users of Factiva’s research services to locate the news and information they are looking for. Correct exposure of the metadata within Factiva’s XML document format then allows those same users to integrate the data into their EIP and CRM systems.
The Factiva taxonomy has its origins in the Finsbury Data Service classification system, used in the former ‘Textline’ service. Finsbury Data Service was taken over by Reuters in 1986, which then developed and enhanced the original model within the framework of the Reuters Business Briefing product line. When that business was merged with the equivalent business from Dow Jones in 1999, many innovations from the Dow Jones Interactive product line were integrated with those developed by Reuters to form the current system, which is branded as Factiva Intelligent Indexing.
The taxonomy currently consists of four main vocabularies:
- Company: codes for over 300,000 entities, giving exceptionally wide coverage of newsworthy quoted and unquoted companies around the world;
- Region: contains over 370 terms, including all countries, sub-national terms for the US and Canada, and terms for groups of nations;
- Industry: contains over 740 terms for areas of industrial and business activity;
- News subject: contains over 430 terms; the main groups are Corporate, Economic, Market, General, Political, International Political-Economic Organisations and Content Types (covering frequently requested options, such as in-depth analysis, polls, personality profiles and transcripts).
The list of region, industry and news subject terms is kept under regular review and updated every three months, while the list of companies is continuously updated as new firms come into the news or as existing ones change their names or ownership structures.
The taxonomy application challenge
In the words of Anthony Capon, Factiva’s director of content management and the person responsible for ensuring that the content on the company’s information services is correctly categorised: “We have tried just about everything.”
Historically, information was indexed manually by human editors using relevant codes selected from the taxonomy. However, as the volume of data to be processed continued to grow, it ceased to be feasible in terms of either time or cost to process all data this way, and automatic application methods were thus progressively introduced alongside manual application. In cases where the original publishers of news data had already applied coding, mapping tables were developed from publisher codes to the Factiva coding system. Simple tools were also built internally to spot regularly occurring text strings in highly formulaic source material, such as standard headline formats in Reuters news wires, and automatically apply the appropriate Factiva codes. Commercial software was also deployed, which uses a complex rule-based approach where human editors construct queries consisting of text strings with wildcards, Boolean operators and relevancy weightings in order to identify documents that match the concepts in the taxonomy.
The strengths and weaknesses of each system are clear. Manual coding can be of very high quality, though considerable effort needs to be expended in setting guidelines, monitoring application and supplying feedback to ensure that all coders have a sufficiently common view of when to apply a given code and that inter-coder variability is kept to a minimum. However, scalability is extremely difficult to achieve in the face of mounting volumes of material.
Publisher code matching and simple text string rules can be highly effective if publishers are consistent in what they publish and if the taxonomy to be applied is relatively static. But if you have built several hundred publisher-specific files to match your taxonomy and you then change that taxonomy once every three months, maintenance becomes a serious problem.
Complex rule-based approaches can be highly effective in generating accurate results. However, they are labour-intensive to create. Experienced users will tell you that creation, testing and refinement could easily take six to eight hours per topic, a significant cost when spread over a multi-term taxonomy, and they also tend to be much more suited to precision than recall. And again, they have a tendency to decay over time if not properly looked after; in many taxonomies, particular news-focused ones, the language used to describe a particular subject will change over time and, without regular maintenance, performance of rule-based topics will decline.
Trying a new approach: selecting a vendor
By the end of the last decade it was clear that the systems in use were already feeling the strain, and it was evident that a new approach was needed. Sophisticated techniques for computer understanding of linguistics and for performing statistical analysis were available on the market. These skills were not available in-house, so it made sense to seek outside help. Reuters, by then one of Factiva’s two parents, kicked off a process to find a technology partner capable of fitting the bill.
While finding a technology that could be proven to work was a prerequisite, the selection criteria were far wider than that. The chosen vendor also had to demonstrate that it had both the commitment and the financial backing to stay in the market; it had to show it was committed to developing and enhancing its technology over time and in an ever-changing technical landscape; it had to prove it was sufficiently plugged into current academic research in natural-language processing; and it had to have the clear capability to handle material in multiple languages, including non-European dialects. It also had to show that it would view its relationship with a major business information supplier as a long-term partnership and collaboration, rather than simply taking a vendor-customer approach.
California-based Inxight came out at the top of the vendor shortlist. “It didn’t have a product at the time, so it wasn’t the obvious choice,” says Jo Rabin, who is now an independent consultant advising companies on information dissemination strategy and who led Reuters’s vendor selection process. “But when we looked at all the selection criteria, it was the clear leader. For example, it had proven expertise at handling multiple languages. Other vendors just didn’t have this dimension. It had ongoing access to research into linguistics and statistical methods and it had access to Xerox PARC – the birthplace of the mouse and Windows. So clearly Inxight was going to remain among the leaders in the field for a reasonable period.”
Inxight’s Categorizer technology adopts the approach of coding by example. The software uses its linguistic analysis skills to detect similarities between the document it wants to code and the evidence in a ‘training set’, an archive of stories that have already been correctly coded. It then codes by example, using its statistical algorithms inferring the probable coding for the new document from the evidence in the training set.
The perceived advantage of the system is that Inxight effectively creates its own set of rules for deciding how category codes should be applied, eliminating the need for human editors to spend time constructing and maintaining complex string-based queries. It then uses these rules to apply codes to new articles as they are submitted to it.
Inxight was asked to develop a prototype automated categoriser and put it through an initial proof of concept, in which experienced human coders scored its output. First signs were good, contracts were signed and a full implementation was planned.
Testing, testing, testing
This was only the start of a long story. Given its extensive knowledge of the difficulties of applying accurate and consistent coding, Factiva never expected to unpack a piece of software, fire it up and get going. The expectation was that several months would be spent getting the system ready for operational use.
The first major task was to establish the training set of well-coded sample data, with a minimum of 30 examples per code. The initial hope was that Factiva could take a corpus of news items it had already coded and input it directly to the system, but tests on that body of data were disappointing. Analysis showed that the distribution of codes on the initial data set was too narrow to give adequate representation for every code in the taxonomy. Also the software was highly sensitive to any inconsistencies in the coding of the sample data. Closer analysis revealed that not all of the codes on the pre-coded data were perfectly applied.
After this first setback, it was decided that the only way forward was to put together a new body of sample data and manually re-code every single article to ensure that it was of the highest possible quality. Given the range and complexity of the taxonomy, this in itself proved a major exercise. The good news, though, was that the work effort required to establish a set of sample data for a straightforward topic was in the range of two to three hours, significantly less than the six to eight hours required to set up a query using a previous-generation, rule-based categorisation system.
A parallel step was to unpack and learn to use the various analysis tools associated with the Inxight Categorizer. This, too, took time. The initial Categorizer package was delivered as an API toolkit and the editorial users had first to understand what the tools could do, and then learn to articulate to Factiva’s own technical staff how they wanted the tools to look. The editorial users were former manual coders and had not previously had to work this way, so the learning process was steep.
The results of this experience were fed back to Inxight, which responded by developing an out-of-the-box graphical user interface to help users more easily build up an initial corpus of training data, identify suspect codes, test stories, and add and remove training evidence in order to improve performance.
The third part of the task was to test and optimise the performance of the system in order to meet Factiva’s business requirements. Many man-weeks were spent testing the Categorizer’s performance against different scenarios. Could it handle all types of code in the Factiva code set? Could it cope with stories written in different styles; newswire stories, more discursive feature-type articles, stories in publications where the writer’s first language was demonstrably not English? Could it handle the introduction of new topics to the taxonomy? Could it code consistently over time and cope with changes in news flow and vocabulary? Most of all, could it make the right judgements as to which articles could be automatically loaded and which could not?
It was always the expectation that the Inxight system would be able to do some things perfectly, fully automatically, but that there would be other concepts that it would always struggle with. As Inxight’s algorithms assign a confidence level for each code applied, the intended approach was to ensure that any article that had low-confidence codes would be sent for manual editorial review before being posted to Factiva’s services. Expected levels of confidence can be set for each individual code.
For instance, with subjects of high business interest to Factiva’s customers and with wide-ranging vocabularies, such as mergers and acquisitions or government regulation, very high levels of expected confidence would be set so that any borderline cases would be manually checked for accuracy. More predictable and less business-oriented topics, such as sports results stories, might require lower levels.
Intensive testing was done on typical sample publications to tweak the parameters and ensure that the right articles were kicked out for review, and that articles were only automatically loaded if all the right codes were applied. The initial contractual target was that Inxight’s Categorizer should be able to code 45 per cent of stories automatically to perfect quality standards, where an expert human coder would neither add nor remove any codes. Initial performance during testing was well below that. Yet by using the Inxight analysis tools over several days, grooming and refining the training data and tweaking the expected levels of precision and recall on each code, Inxight’s ‘judgement’ was brought into line with that of a human editor. Accuracy soared from initial levels and final performance comfortably exceeded the target.
It was long, it was slow, it was hard, but seven months after implementation began the service was approved for operational use and passed into production. Even after that the development process continues. New codes are introduced, language shifts, news focus changes; but now, unlike with the rule-based approaches, the system itself provides an alert to these shifts. When it encounters material it is less familiar with it will apply codes at lower confidence levels, send them for review and the editorial staff will be immediately alerted to the need to make further adjustments.
The experience confirmed the Factiva belief that technology by itself is not the answer, but that the intelligent combination of technology and human expertise is. The Categorizer works for the company not simply because it is smart technology, but because it is deployed drawing on the skills and experience of the former elite manual coders who first set it up and who now monitor its performance, testing, tweaking and fine-tuning.
The TFPL information architecture study concludes: “It is clear... that there are no overall solutions provided by software alone. All of the experience in the case studies would point to the need to augment, significantly in some cases, the capability of the software with human intellectual processing. At best software provides the starting point.”
Factiva would wholeheartedly agree. The process is much like flying an airliner. The cockpit is stuffed with sophisticated computers, but you still need a pilot to fly the plane, and it is only because of all the accumulation of human skills and knowledge about how to fly that the computers could be built in the first place.
What next? Expect the process to be a tough one for a while longer. Expect several areas of cost: defining your initial taxonomy; buying some software tools to do categorisation; and using human effort to get the tools to work in the way you want them to, and to tidy up what the tools still cannot do. Expect to have to invest in search and navigation tools to leverage the metadata you have applied (the range of offerings that allow an end-user to construct an easy metadata-enabled search for ‘this and this concept, but not that concept’, rather than drilling straight up and down a tree structure, is still astonishingly small) and expect to keep on monitoring, improving, enhancing and extending. Expect technology to improve, vendors to consolidate, customers to become more realistic in what they hope to achieve and more willing to seek outside expertise – editorial as well as technical – to help them achieve results.
But expect it to be worth it. The information you need in order to be able to do your job is probably already there in your organisation somewhere and it is worth finding. Just as the news article that answers your business question is there in the 200-million-plus documents that Factiva holds in its products. Yet having it is not the key. Finding it is.
1. Mahon, B., Hourican, R. & Gilchrist, A., Research into Information Architecture: The Roles of Software, Taxonomies and People (TFPL Ltd, November 2001)
Simon Alterman is vice president and director, editorial and content management, at Factiva, a Dow Jones & Reuters company. He can be contacted at: email@example.com