posted 30 May 2007 in Volume 10 Issue 8
EI Cover feature: Storage
Fixed in time
Fixed-content storage systems are a relatively new but increasingly popular weapon in the battle to ensure that unstructured data archives remain highly accessible.
By Jessica Twentyman
As obesity levels continue to rise in the UK, so does the prevalence of diabetes within the population. With a diagnosis of diabetes, meanwhile, comes the risk of diabetic retinopathy – a non-inflammatory condition of the retina that can cause blindness, but which can also be prevented if detected early enough using digital photography.
That creates a huge digital content challenge for any healthcare organisation and the Heart of England NHS Foundation Trust is no exception. As part of the National Health Service (NHS) screening programme for diabetic retinopathy, diabetic patients within the Trust’s catchment area have digital photographs of the back of their eyes taken on an annual basis by local high-street optometrists. Those images are then sent via a virtual private network (VPN) to the Trust’s IT environment, where they are analysed by specialists and subsequently stored.
The data storage demands that this system creates are two-fold, according to Andrew Mills, head of IT at the Trust. First, these electronic images take up a lot of drive space, he says. “We currently screen about 100,000 patients per year and that number is expected to grow at around five per cent per year. For each patient, at least four initial images are taken each year – two of each eye – and each image takes up 500 kilobytes of space.” In any one year, a significant number of patients will go on to have additional images taken, he adds, if signs of retinopathy are detected and capacity must be available to store these additional images, too.
Second, he says, images need to be retained for up to eight years to meet UK regulations, but must be easily accessible for use by doctors in follow-up consultations with patients, says Mills. “With diabetic retinopathy, tracking the progression of the disease is vital, so screenings from the last three to four years in particular can be very pertinent to case history. In some cases, it’s necessary to delve even deeper into the past – but either way, images for each patient are regularly retrieved for comparison with images from earlier screenings,” he says.
These pressures convinced managers at the Trust that a major investment in storage hardware was necessary. That hardware, however, had to fulfill some exacting requirements: it had to be designed to deal with rich digital content (in this case, retinal images); it had to store that content as records that could not be altered or tampered with once stored; and, it had to be able to locate and deliver archived records rapidly on demand.
This list of requirements pointed to a need for some form of a disk-based (rather than tape-based) system for fixed-content archiving. The Heart of England NHS Foundation Trust ultimately settled on the Centera system from storage giant EMC, which competes primarily against Hewlett-Packard’s Reference Information Storage System (RISS), Network Appliance’s NearStore and IBM’s DR550. Other vendors in this market, meanwhile, include Hitachi Data Systems and Sun Microsystems (with the acquired StorageTek product range). These kinds of system are referred to using various terms: fixed-content storage; nearline content storage; or, content-addressable storage.
Whatever the nomenclature applied, at heart, these systems have a simple aim: to improve the management of the colossal amounts of unchanging data that organisations are generating. The growing volumes of such ‘fixed’ content that they need to store – especially driven by compliance requirements and the rampant ‘digitisation’ of information, such as e-mail, invoices, cheque images, movies, X-rays and dental records – is challenging current storage architectures, which struggle to cost-effectively scale upwards and meet growing demand.
Fixed-content storage products, then, effectively sit in the middle tier of a storage hierarchy, filling a gap that exists between online disk and tape. “For many companies, the need to store data inexpensively rules out high-performance disk as a suitable archiving target for fixed content,” explains Stephen Watson, HP’s StorageWorks programme manager for the UK. But at the same time, he adds, the need to retrieve records frequently and at speed means that, for certain kinds of content, tape is not a suitable medium, either.
“If you just want to archive content and won’t need to retrieve it again, then tape is probably your best bet. But there are plenty of companies that do know that they may be called upon to retrieve data to satisfy regulatory or legislative demands and are looking for a nearline alternative based on less costly forms of disk storage,” he says.
As a result, organisations are looking to bring this fixed information online so that it can be leveraged by enterprise content management (ECM) systems and other enterprise applications. Accordingly, fixed-content storage systems are emerging to solve some of these challenges and, whatever nomenclature is applied to them, these systems differ markedly from high-performance online disk storage in terms of the capabilities they offer, according to Stephanie Balaouras, an analyst with IT market research company Forrester Research.
“The storage requirements for a large reference store differ significantly from those of an online transaction processing (OLTP) application,” she says. “Storage must scale to potentially hundreds of terabytes inexpensively, require little manual storage administration, store relevant metadata about the content itself, provide search and indexing capabilities, ensure data integrity and integrate with management software from third-party software vendors.”
With such systems, data is stored and accessed using a unique logical address based on the nature of the information it contains, rather than its physical address. With EMC’s Centera machine, for example, each data object is given a unique, location-independent, digital identifier.
Metadata containing this address, the location and other identifiers (for example, the retention period) is stored in an index, which is accessed when users need the object. “In effect, this provides a storage search engine, which makes it easier for users to locate documents,” says Paul Gilmartin, an account manager at EMC.
In addition, low-cost storage – typically based on Serial ATA [Advanced Technology Attachment] disks – is combined with write once, read many times capabilities to guarantee data’s authenticity and immutability for compliance requirements.
Another key advantage of fixed-content storage is its ability to assist organisations in avoiding the persistent problem of file duplication. With traditional storage, if there are 100 different copies of an e-mail file attachment, all 100 copies are saved in a backup. For long-term archival storage, this kind of inefficiency can quickly exhaust available storage space.
Fixed-content storage systems get around this problem by using data de-duplication capabilities, sometimes referred to as ‘commonality factoring’, which eliminate duplicated blocks of data. Only one iteration of an individual file is saved and subsequent copies of the file are simply referenced back to the one saved copy.
So if each of those 100 attachments is two megabytes (MB) in size, archiving to a fixed-content storage system would only take 2MB to save all 100 copies, instead of 200MB with an ordinary disk system.
Not only is storage space saved, moreover, but backups and restorations to tape or optical media can be accomplished much more quickly and for off-site backup, the data volumes that need to travel across wide-area network (WAN) links to remote locations are much lower as well, speeding up this process, as well as reducing bandwidth requirements. Lost or damaged files can also be restored directly from disk without the time or trouble of locating those files on other media.
Fixed-content storage systems do have some drawbacks, however. Because they tend to incorporate archiving-related data management intelligence, they tend to be more ‘appliance-like’ than other storage systems. “And, as intelligent devices they may appear more expensive, because they are often priced against an equivalent solution of separate hardware and software pieces,” says Gartner analyst Stanley Zaffos.
Another major problem is the overwhelming lack of standards that exists between different fixed-content systems from different vendors – a problem that the Storage Networking Industry Association (SNIA) is hoping to solve with a proposal known as the Extensible Access Method, or XAM. The aim of this initiative is to provide a standard application programming interface (API) that supports all content-addressable storage systems, irrespective of the technology or vendor.
The XAM specification defines a standard interface (access) method between ‘consumers’ (application and management software) and ‘providers’ (storage systems) to manage information. XAM annotates objects with metadata that provides for the management of information at a semantic level. This coupling enables external information lifecycle management-based policy services to make intelligent decisions about the management of objects without referring back to the application and without impacting the application.
“The alternative to life without XAM is not pretty. The ever growing chaos and complexity of owning and managing trillions of data objects is insurmountable with conventional approaches based on discrete, and often competing, management processes. Even the promise of ILM is incomplete without a robust metadata standard. When complete, XAM will have a positive impact on all types of information systems,” says Christina Casten of the XAM committee within SNIA.
This should provide a major boost to the popularity and practicality of using a CAS system for archival storage, XAM proponents claim. And XAM might also allow CAS systems to be used in ways that currently are difficult to implement.
However, these systems have already been met with some enthusiasm from certain highly regulated industries, particularly the healthcare and financial services sectors.
That has been great news for EMC, the pioneer in this area, which has made more than $1 billion in revenue from Centera since its 2002 launch.
In fact, on 27th March 2007, executives at EMC celebrated Centera’s fifth birthday, with an announcement pointing out that, to date, more than 3,500 organisations worldwide had purchased the system. Those customers account for more than 150 petabytes of storage capacity – enough, say EMC executives, to hold 32.2bn 5MB digital photos or 2.1 trillion 80KB e-mail messages.
That has, in itself, created a market that numerous other storage vendors are eager to be part of, says Steve Duplessie, an analyst at Enterprise Strategy Group: “Five years ago, no one was talking about the impending tidal wave of fixed digital content, and certainly not about using intelligent disk for compliance when it came to e-mail or file data. Now you can’t have a conversation about storage without those topics coming up.”
BOX: In practice: Coda Group
Accountancy software company Coda Group sells a range of products that help its 2,500 customers worldwide tackle the web of tricky compliance issues that surround financial reporting and, as such, the company’s executive team know it is important that Coda is seen to practice what it preaches.
But meeting the high standards expected by corporate stakeholders can be a challenge when sensitive information relating to customer contracts and software licensing is sent and received in thousands of emails each day, according to Coda Group IT manager, Richard Hall.
Until recently, he says, e-mails were often stored in distributed, insecure personal folders. There was no way of preventing them form being tampered with or modified, he says, and this approach did not support the high standards of adoption that CODA wished to adopt.
“Growing volumes of e-mail were stretching our computer systems to the limit, with the result that it was becoming increasingly costly to manage storage, backup and disaster recovery,” says Hall. “Information retrieval was also a problem: it was taking more and more time to search through past e-mails to recover a particular, business-critical communication.”
Initially, Coda believed that its difficulties could be solved by implementing a simple e-mail archiving product. However, it soon became apparent that the company’s requirements were rather more complex. Hall was also disappointed to realise that many so-called compliance solutions were nothing of the sort.
“We looked at several vendors who claimed to provide compliance solutions. We started to realise that these piecemeal solutions didn’t really offer the complete solution we were looking for. Being able to store information in a secure, non-tamperable manner was key to implementing a compliance solution we felt happy with,” said Hall.
Then Coda looked at the HP StorageWorks Reference Information Storage System (RISS), an integrated system for archiving e-mails and other forms of fixed content. With features such as write-once read-many storage and electronic fingerprinting, RISS was ideal for Coda’s needs, because it is designed specifically to support compliance with regulations governing data management. Furthermore, as an active-archiving platform, it was capable of transforming unstructured data into accessible, exploitable information.
“With RISS, we can prove the origin of our business information and we can demonstrate that it hasn’t been altered in any way. This would be of central importance should we ever have to defend our archiving practices in court,” said Hall. Looking ahead, he says, Coda plans to use the solution to archive other documents previously stored in paper form.
“The ability to search through information and retrieve it quickly is key for an organisation such as Coda. With RISS, the corporate knowledge associated with several years’ worth of e-mails and other digitised documents is now accessible,” he says.