posted 14 Jan 2005 in Volume 8 Issue 4
HP: Broaden your horizons
Developing and applying effective search strategies in a multilingual environment. By Daniel Amor
As many organisations are now becoming global, operating in numerous diverse regions, there is a growing need to develop business and internet strategies that reflect this. Over the past few years, it has become clear that global businesses are only successful if they act locally, which requires the adoption of local customs and languages. While these factors improve worldwide business initiatives, they have created a more complex environment in which finding the right information is a significant challenge.
To manage vast quantities of information, an effective strategy involving high-quality information architecture and information-management practices is required. A good starting point for evaluating what needs to be done is to implement a formal assessment of the existing business environment. Consolidation of systems, solutions, and content and search-engine technology can often be improved. Likewise, new taxonomies can be created to manage documents and data with increased efficacy.
Caring for your environments
To decide what steps need to be taken, it is crucial to understand the types of content environments that exist and where your own company sits in relation to these. Each environment requires a distinct strategy. The four most common environments are: single language with dedicated content (a standard environment, not covered in this article); multiple languages with identical content; multiple languages with localised content; and, single language with localised content. Typically, multilingual content sites are a mix of the latter three environments and are therefore extremely complex.
Multiple languages with identical content
This type of environment is typically used for content categories such as manuals, global news items, global marketing information and global job descriptions, and is the standard for websites in countries like
Multiple languages with localised content
This environment is usually used for content categories such as product information, local news, local marketing and local job descriptions, and is typical for pan-European or global-brand websites. In this environment, information is shared across languages, target groups and countries, but adapted for local needs. For instance, the top-level navigation of a global brand will be the same for all countries, but the actual product offerings, news items and legal pages may differ on a country-to-country basis.
Single language with localised content
This category encompasses international websites that contain product information. Websites covering countries like the
The next stage, and in order to analyse existing content within your organisation, is to establish content categories and identify which category fits into which environment. In global corporations, content is typically hidden in many different systems. Ideally, content should be stored in a content-management system, but the reality is usually a hotchpotch of financial systems, training systems, production and human-resources systems. What’s more, there is usually more than one version of each of these systems, which can vary greatly depending on the country or business-unit in question.
For example, a few years ago there were several HR systems in place at Hewlett-Packard, making directory searches and the retrieval of HR-related information particularly complex. To resolve this situation, we had two options available to us: we could either connect the existing systems, creating a complex search solution to span them all, or we could consolidate the HR systems into a single system and create a simple search solution. While the first approach would have been quicker to implement, the complexity and cost would have made it difficult when we needed to implement changes further down the line. Hewlett-Packard decided on the second approach, and now has only one HR system in place, which operates globally and across all business units. Information from legacy back-end systems is stored in various formats, and format translation needs to take place before this information can be presented over the web.
Regardless of the approach you choose, however, once you have consolidated content from a set of systems, content categories can be created. In Hewlett-Packard’s case, we were able to create HR content categories such as personal information, financial information, HR internal information and so on. These categories were designed to be independent of the information source, which means that the company portal does not need to understand where different content comes from and how it is produced. Rather, the portal is able to access content through a single mechanism, which is crucial because search engines will use the same mechanism to index information and present the search results to the end user.
Content categories represent concepts that are the same irrespective of the language used. Due to local differences, not all content categories will be used in all languages, but having these content categories available at a global level allows a clearer view of what content is available across the entire organisation. My own website, www.ebusinessrevolution.com, is available in 12 languages, for instance. The major categorisation is always the same in the navigation: ‘Homepage’, ‘Books’ and ‘About the author’. The subcategories, though, vary from language to language, based on the books available in that language and the accessible content.
Multilingual content management
Before you begin working on a taxonomy, it is important to make sure that enough content is already developed to enable the taxonomy to be tested against some real data. If content is not readily available, the taxonomy may not fit your requirements further down the line. Unfortunately, of course, this creates a vicious circle: if content is developed, it needs to be categorised, but for categorisation to take place, content is required. In my experience, there is no simple solution to this problem. However, if an existing website is available, companies can use the taxonomy that is already in place for content-development purposes, and start to work on the taxonomy itself once a few content items per category are in place. While you add content, check that the taxonomy is still valid and adapt as you go. If a taxonomy is being created from scratch, start with a high-level taxonomy, develop the content and fine-tune the taxonomy at regular intervals.
While this approach is independent of the languages a global company may need to support, there are additional factors that need to be taken into consideration when creating content. Now that each content category can be attributed to one of the three environments described above, the content workflow also needs to be adapted for these environments.
If you are working with identical content in multiple languages, it makes sense to identify a master language. The translations can be made internally or externally and included in an automated workflow. Once a content item is ready in the master language, it will be sent out for translation; when the translations come back, the content item can be published. In this case, the creation of a taxonomy is the same for all languages, as the concepts are identical.
If the content category falls into the second group (where each language has differing content), it is still possible to use a master language, but during the publishing process those who represent the other languages/countries involved can decide if they want to translate, adopt or refuse to use the content developed in the master language. In addition, they may add content that is specific to that language or country. In both instances, a taxonomy needs to be developed for each of these segments.
Third, in cases where there is only one language but different localised content, content should be revised each time it appears on the web and the taxonomies must be kept separate.
Besides the formal categorisation, additional metadata should be added to content items. For all items, a language needs to be defined. This can be done relatively easily by using XHTML, which is very similar to HTML and is understood by all newer browsers. If a whole web page is in a single language, the language can be defined in the header. In the following example, the XHTML defines the web page as being in English:
If pages are multilingual, it is possible to define the language of individual content items. Below is a snippet of the code for the screenshot shown on this page:
Le dot.com forse hanno segnato il passo, ma Internet è viva e vegeta!
Le dot.com forse hanno segnato il passo, ma Internet è viva e vegeta!
Dot.coms may be dead, but the Internet is alive and kicking.
Dot.coms may be dead, but the Internet is alive and kicking.
Search engines are able to use this type of information to develop a better understanding about which language content is based on. Most search engines are able to make ‘educated guesses’, but the more information you provide, the better the search results will be. Don’t forget to add keywords and a description to the header as you would do in single-language environments, and make sure that you denote the language there as well:
Encoding issues relating to the different languages used also need to be addressed. Typically, the character set employed is ISO-8859-1, which includes all western European characters. If you plan to develop a Russian, Arabic or Chinese version of your website, however, this character set will not be able to display the characters correctly. In my experience, it is not advisable to use single-language code pages, as every additional language will require a new code page. Try using UTF-8, which includes almost all character sets, making it relatively simple to add another language to the website.
Legal, cultural and social issues
In multilanguage, global environments, it is not possible to simply translate content and taxonomies. Local needs, such as legal, cultural and social issues, must be taken into account, both during content and taxonomy creation, and search. The way content is displayed, for example, may be influenced by legal, cultural or social issues, as some countries restrict the display of personal information.
Search tools will also be used in different ways according to the social context. Key words, concepts and even images will be used in different ways to locate information. There may also be information on a site that could be offensive or confusing in certain cultures. It is worth involving local users and native speakers to ensure their experiences and knowledge is incorporated into the site and the search functionality.
Creating a meaningful taxonomy
In order to ensure search results are meaningful for the end user, it is necessary to create a taxonomy that takes into account and supports the users themselves. Users can be segmented into different groups and content items mapped by user group to ensure the search results are relevant. For example, ‘I live in
As such, a taxonomy needs to be created for each segment of the target audience, as defined by region or country, language, role and so on. The effort involved here should not be underestimated; the number of variables and the number of people that need to be involved can be enormous.
To make it easier to find documents, you may also decide to limit the vocabulary employed in each segment. Ensure that the keywords used are understood by the target audience. Decipher those abbreviations used by the organisation and, where possible, refer to synonymous terms. Every sub-taxonomy must fit the needs of the relevant user group, but companies should also be pragmatic in their approach so as not to place too much pressure on company resources.
Again, the taxonomy-creation process represents something of a vicious circle. But don’t waste too much time trying to decide whether to focus on content and user categories or the formal taxonomy first. In the long run, it won’t actually matter. Start with one and, once both are in place, go back and forth to refine the definitions. Only by doing both will you be able to see what is possible on a pragmatic level.
The ongoing challenge
Inevitably, a taxonomy must evolve as the organisation does. Over time, new categories will need to be added and the hierarchy brought up to date. Relationships between terms should also be closely monitored as content grows and new terms for existing concepts are introduced. A taxonomy is not static; rather, it is constantly changing, just as your website and search requirements will change. But if you plan for change from the start, your business will benefit accordingly.
Taxonomy maintenance should be initiated through regular reviews. These reviews should in turn be based on two specific quantitative measures: noise, which evaluates the number of retrieved documents that should not be retrieved from a given search enquiry; and, silence, which relates to the number of documents that should have been retrieved but weren’t. A more qualitative evaluation can be obtained by assessing the experiences of end users, perhaps through a formal satisfaction survey.
Above all, of course, and as with any project of this nature, it is essential that you fully understand your business requirements before you begin. This in turn requires you to establish what type of content you have, how you can segment it, who the target audience is and what information they would expect to find. Only once you have answered these questions should you consider the technical architecture and how you can map your business requirements against that framework. By working in this iterative manner, you will eventually be ready to implement, test and roll out a successful multilingual site with first-class search capabilities.
Daniel Amor is chief technologist for Hewlett-Packard
SIDEBAR: The process for creating a meaningful taxonomy
Assess the current situation. A survey should be conducted in order to reveal the current state of affairs within your organisation. How is information represented and managed, and what needs to be changed in order to make these processes more effective?
Use a top-down approach. Establish a high-level taxonomy by relying on existing internal and external vocabularies (each term must be accurate and have a unique meaning for each segment). If you start from the bottom and work up, you are unlikely to reach a point where you can incorporate all sub-taxonomies into a single hierarchy.
Test. Develop a ‘trial’ taxonomy for each segment and test it with members of the target user group.
Validate your approach. Validate the taxonomy with the stakeholders involved and establish how you will measure its success from a qualitative and quantitative perspective.
Keep a record of your activities. Log any major decisions relating to the taxonomy. This will enable you to roll back to an earlier version should a problem arise.
Work in stages. Add the details required to establish the relationship between terms and develop the application rules surrounding the terminology using an iterative approach. Don’t try to include too much detail at the beginning, otherwise it will be impossible to track the effects of modifications.
Stop before it gets too complicated. Know when to leave the taxonomy alone. An overly detailed taxonomy can cause confusion among users and increase maintenance costs. The key, as ever, is to strike the right balance.
Think about technology last. Consider how technology could help in the building and maintenance processes, but only once you have defined these processes. Otherwise, the danger is that you immediately restrict yourself to working with a particular technical solution. Put your business first.
SIDEBAR: Developing a multilingual search function
Ensure you use multilingual tagging, localisation and apply thesauri. Create taxonomies based on the environments discussed at the beginning of this article and prepare for continual change.
Make sure that you segment your content and understand that every segment may require a different strategy. Creating a powerful multi-country/language site is not a simple or a cheap task, although it will be more economical in the longer term that maintaining several standalone infrastructures.
Involve specialists and your target audience right from the beginning.
Technology is not the solution; it is only a tool to be used to address a given business problem.
Power users should be given more control to play with search parameters, potentially generating more ‘noise’ and ‘silence’ (see main article). Novice users should be afforded only limited control, which will result in very little noise and silence. The less knowledgeable users are, the more your solution should be able to help.
Provide means to add context to a search term, such as navigation paths or user profiles.