Data as a Gas

A turn of the century paper on managing the ever increasing data storage issues facing small to medium enterprises. This paper largely focuses on storage of electronic mail data, but can be extended to documents and general file stores.

Data is a gas – given the opportunity, it will expand to fill all available storage space.

As time goes by, users will inevitably store increasing quantities of data. As they start to reach the quota levels assigned to them, and subsequently start to exceed their allocated quota it is clear that something will need to be done to resolve this situation.

In an ideal world, users do not want to have their concerns made the first priority. They want to their concerns to be the only priority. It is only natural, after all. It has repeatedly been demonstrated that nature is selfish Altruistic behaviour has been be shown to be either deferred selfishness, or selfishness on behalf of one’s community – Charles Darwin [The Origin of Species], Richard Dawkins [The Blind Watchmaker], and George C. Williams [Plan and Purpose in Nature] to name but a few.

This document primarily deals with data as accessible through Microsoft Outlook – whether it be in the user’s mailbox, archives, personal folders or shared public folders, but is equally applicable to data stored in the user home directories.

Accordingly if there is a great deal of server storage available, given the chance, every user will, in time, fill it entirely.

Each and every user, either consciously or subconsciously has an importance threshold. Documents, messages and data have their importance checked against this threshold. Any documents with importance higher than the threshold are kept, those lower than the threshold archived or deleted.

Generally speaking, users’ importance thresholds lower as the amount of freely available space increases. The more space allocated per user, the more information a user will find significant enough to retain. Increase network space sufficiently, and everything becomes important enough to keep.

It is self-evident that network space can be classed as one of Frederick Hertzberg’s hygiene factors [Motivational Theory] – rather than acting as a form of incentive, its presence is a given; its absence is a crisis.

The user sees increasing their network storage quota in any form as a right and an obvious requirement, rather than a special case or an exceptional situation. Lowering a user’s network storage space is equally obviously an outrageous imposition.

One user perception is that disk space is inexpensive, which is on the face of it correct. Disk space for a single user is inexpensive. Disk space for 600 users is not quite so inexpensive. At some point in time, no matter what quantity of money is available for storage space, there comes a time when the cost to the business of increasing the storage capacity and performing backups, managing the system, and so on outweigh the benefits of increasing the capacity.

Storage capacity must be managed.


Storage Management Strategies

Business practices, user habits, administration processes and the like all have their part to play in determining the best storage management strategy at any given moment – or for any given user community.

Depending on the given situation, any one of a number of policy decisions and process implementation models may be appropriate:

The Draconian Enforcer
The Liberal Expansionist
The Flexible Negotiator
The Diligent Archivist

The policy model chosen in any given situation will bring with it advantages and disadvantages, regardless. There is no perfect model that covers all users in all circumstances. As in all such cases the various needs and desires of the users, the business, and the system administrators must be taken into account.


The Draconian Enforcer
or Finite Totalitarianism

The Draconian Enforcer Model is simply to specify quotas for all users, and ensure that users are aware that these quota levels are set in stone, and are non-negotiable. This has the advantage that users are forced to carry out housekeeping tasks and assess the importance and relevance of all data they wish to store.

One disadvantage is that it requires determined enforcement and committed agreement by senior management that this is The Way It Shall Be. With a rigid policy it is imperative that the policy is literally that: rigid.

Not only must it have 100% coverage, it must be seen to have 100%, and be unwavering. Once a single exception is made – or more importantly, is seen to be made – the policy inevitably cracks and, in time, is pushed aside.

This also requires a senior management point of contact with whom the final responsibility for saying no to the inevitable user requests will rest. It should be down to this person to be the final escalation point if necessary for all such requests or complaints, as the case may be.

Accompanying the viewpoint and strictures of the Draconian Enforcer model is the need for recognition that situations change, and a willingness to review the currently enforced quota levels at regular intervals with a view to raising the storage limits across the board for the user population. In this way the disk storage allocated to the user population – and accordingly the data that must be managed, maintained and backed up – expands in a series of substantial step changes rather than growing gradually over time.

The Draconian approach forces users to assess each and every document they wish to keep – when they have limited space, a conscious decision needs to be made as to whether the document is important enough to them, and to the business, to justify keeping. In some cases, where a user is constantly close to their limit, they may need to decide whether the new document is more important than one or more of the existing documents which must be deleted or archived to make room for it. The Draconian Enforcer standpoint raises the Importance Threshold significantly.

From a System Administration point of view, the Draconian Enforcer standpoint requires a great deal of willpower, determination and naturally, committed backing from someone very senior in the hierarchy.


The Liberal Expansionist
or The Exponential Mass

The Liberal Expansionist Model applied to storage issues is simply to allow users free reign to store what they wish to store, in whatever manner they wish to store it. When storage capacity is reached, more storage is purchased and brought on-line. Whether this is done in a global fashion or purchased and commissioned on a department-by-department manner is a further decision that needs to be made.

The advantages from a user point of view are self-evident: unlimited storage capacity. The administrator advantage is essentially that there are no policing and enforcement issues. Users will never go over their quota, and never need to be warned about disk space.

The disadvantages include the fact that both network storage and backup capacity will increase enormously, while for the users as there is no incentive towards housekeeping files and documents will become increasingly difficult to find. In addition, the hand-over process when an employee leaves becomes increasingly difficult as they inherit a potentially vast mass of data.

The expansion in storage capacity must keep pace with the expansion in user data – until some equilibrium point is reached at which users find that their mass of data would become unmanageable if it grows any further. At this point, theoretically, expansion would slow and storage would only need to be added at a slower rate.

The optimistic viewpoint is that equilibirum will be reached fairly quickly, and users will manage their own data, deleting any data, messages or documents that are no longer important enough to keep. The fallacy with this view, however, is that given seemingly boundless storage capacity, no document becomes unimportant. The importance threshold lowers to nothing.


The Flexible Negotiator

The Flexible Negotiator Model falls between the previous two models. Policies and quotas are implemented and enforced, with the caveat that users can formally apply to their line manager for a disk space increase. This needs to be accompanied by some form of examination or inspection of their existing quota usage to ascertain why they need more space. In some cases this will be because they simply have a large amount of important data. In others it will be because they have a large amount of unclassified, unsorted data which could perhaps be tidied and reduced and the situation reassessed.

It is this analysis stage that is the key to a workable flexible system, otherwise this model will become the Liberal Expansionist model with a greater degree of automatically rubber-stamped documentation.

The advantages of this model, if well-enforced, are that users who genuinely need it get more space allocated, and this requirement is traceable at a later date, and that data volumes should theoretically not escalate to unmanageable proportions.

All this is provided the system is operated and adhered to correctly and that actual data requirements are assessed when more space is requested. There is an inevitable tendency for all space increase requests to be approved without consideration, and for perceived storage requirements to be taken as being the actual storage requirements.

This occurs simply because it is easier for an immediate manager to accept that an employee does indeed require the additional space they are requesting, rather than encourage that employee to justify the additional space they feel they require.


The Diligent Archivist
or The Pedant’s Utopia

The final and perhaps preferable model from an overall perspective is that of archival and housekeeping. This relies on users being educated and encouraged towards good working practices and diligent housekeeping. Effectively archiving and storage takes place as and when new documents are created or received. In this manner each user is responsible for creating their own filing system and structure, and keeping track of their own requirements. Good housekeeping would theoretically separate valid, important business documents from personal, temporary or social documents that do not need permanent storage space set aside.

Three levels of data could be identified: the vital and current, the important and archive-ready, and junk. Vital and current data are those documents and messages which require dynamic, on-line, immediate storage and access. Important and archive ready documents and messages can be stored in read-only media, or saved off-line by some other means e.g. archived to tape for later retrieval as and when they are – temporarily – required. Junk covers those documents which are either transitory with only a short life-span in terms of relevance, or are social in nature, with no business functionality or importance.

With good housekeeping and working practices documents should theoretically be duplicated as few times as possible, as they will be correctly stored and archived, or made available to the relevant people as and when it is applicable. When more space is required it is easy to ascertain whether this is because the business needs of the individual dictate this or whether it is bad hoarding habits.

The natural disadvantage of this model is that it relies on users being diligent and keeping on top of their filing and archiving. It is a difficult system to put into place retrospectively as users may have a mass of unfiled, unorganised documents to archive and organise. It relies ultimately on user co-operation, which is far from guaranteed.

The natural tendency is for users to perceive that they only ever create, receive, store or require documents that are current and vital. Only a few users perceive that they have documents that are important and archive-ready, and the user who admit to creating or receiving documents that fall into the junk category is a rare thing indeed.


Filing Strategies

In any finite, bounded storage environment it is inevitable that some form of filing and archival strategy will be required. The goal of a successful archival strategy in such an environment is to produce coherent logical blocks of data that can, at some point in time, be labelled as closed. A closed block is a collection of data – documents, messages, etc – which can be deemed to be static. It may still be required as a reference but will not change or increase. The other aspect of a successful archival strategy is that documents be readily locatable which is readily achievable provided a strategy is employed that is meaningful for the document or group of documents in question.

All filing strategies have their strengths. All strategies have their weaknesses. Some are closely linked with a user’s personal working practices. Some are more closely linked with the back-end system management systems. The strengths and weaknesses of a particular strategy are clearly tied very closely to viewpoint.

All the possible filing strategies can be employed in combination, applying to different sets of messages, authors or date ranges depending on the nature of the working environment or an individual’s working practices.

Originator Filing

Messages are stored in subfolders depending on the message originator. This is most suitable when particular correspondents deal with particular defined topics, and stay closely aligned with those topics.

From a user point of view this is good when there are specific issues or projects being dealt with in relation to individual correspondents or groups of correspondents. It is increasingly difficult to operate when there are a number of projects or issues being dealt with by a number of different people a many-to-one relationship in either direction makes this cumbersome – one sender, many projects or one project, many senders.

From a back-end administration viewpoint, it is difficult to separate the data out into useful sections. The question is how to determine that a block is closed unless it is the case that correspondence with a given person falls under a specific, time-bound project. Unless there is to be no further correspondence with the same person, this is unlikely to produce closed blocks with any frequency.

Topic Filing

Messages are stored according to overlying topic or theme, message content determines placing. This is most effective in a clearly defined project-based or task-based environment.

From a user point of view, this is most suited to projects or task based working practices so that all messages – regardless of date or sender – can be filed together. All items relating to a given task are neatly filed and easily co-located.

From a back-end administration viewpoint it is easier to delineate the blocks of data. Projects or tasks can be taken individually and handled in a group. If projects have a life span it is possible to determine that the block of data relating to that project is finished with, or at least closed.


Date Filing The Continuum

Messages are stored primarily by date, regardless of sender or content. This is most suitable for those environments where projects or correspondents are of less significance to the classification of a message than date tracking, or when tasks are short-term or cyclical in nature.

From a user point of view – and a software point of view – this is the simplest. Automatic filing is reliable and easy to implement based on last access date auto-archive. This is carried out in a continuous process with messages automatically being archived when they have aged beyond a defined point. This can lead to it being difficult to determine at first glance where a message may be – what the system believes constitutes an old message may not match the user definition.

From a back-end point of view, this is very difficult to manage as the archive files and current files form a seamless if not overlapping set that need to be dealt with together. This continuity means that defining blocks or chunks is near impossible, and a closed block is liable to very rare indeed.

Date Filing Time Blocks

The alternative to the continuum is to clearly file messages in blocks according to their date – only really suitable when the importance of a message has a short life span or messages relate to short-term repeated tasks. In this case all items can be filed by, for instance, the month in which they were received. From a user point of view it is possible to easily find all messages dealt with in a specific time period, but this is of prime value when messages fit a limited number of categories or types.

From a back-end viewpoint, this can clearly be broken into manageable chunks with a working practice of considering that anything older than, for instance, 6 time periods 6 weeks, 6 months, etc. is now closed.


Combinations

Using a Hierarchical Filing schema, messages are classified in a set hierarchical manner, such as by originator, then topic, then date, or vice versa. This can lead to long trees of folders and subfolders, and the potential for conflicting classifications, depending on the importance of a classification to a message. The topic may have more significance than the date for a given group of messages, but the folders may be configured to make the date the most significant classifying factor.

The alternative, and perhaps most common schema is a Hybrid Filing system. Different sections of the data store may employ different filing strategies, depending on the context of the messages. In this way some groups of folders may be subdivided by date, others by author, and others by topic. Independent hierarchies can be built up, leading to unclear classifications for new messages or documents.


Conclusions – Filing Strategies

Importantly, there is no right or wrong filing system. Most filing systems look right from the inside and wrong from the outside, and there is only one good system: the one you use yourself. Inevitably the varied pressures and requirements from a user perspective, a business perspective and an administration perspective mean that there will have to be compromises on all sides.

No matter what measures are in place for archiving or filing messages, unless there is some way of determining that a data block is closed, the data volume issue will be merely postponed. Data may be moved from the mail server to one or more personal folders but these must be stored somewhere. If such folders can be marked as closed, it is conceivable that they could be archived to off-line storage, whether backed up to tape or written to CD-ROM.


Key concepts

Mailboxes, Archives and Personal Folders

Users typically have a set quota of server storage within a Microsoft Exchange environment. This can be configured on a global or an individual basis. Increasingly users are hitting the boundaries of their allocated quota. One strategy for decreasing the server storage for electronic mail is that of archiving older messages as required.

Archiving within Outlook can be carried out as an automated, semi-automated or manual process. In the automated or semi-automated form of the process, an Archive folder is automatically created and messages transferred from the live Mailbox to the Archive Folders.

In the manual process users create their own personal folders and transfer messages as and when they feel they need archiving. There is no difference between the automatic Archive Folders and the manually created Personal Folders. Archive Folders are simply Personal Folders that happen to have had their display label changed from the default value.

One issue with using Personal Folders is that it merely moves the issue of large storage from the Exchange environment to a general file environment, where data will then build up. Using Archive Folders in preference to Personal Folders is therefore not a solution as this is merely a different way of expressing exactly the same storage strategy.

It should be noted, however, that although the transfer of data storage from the Exchange Server to other file servers, and thence to read-only media is just a passing of the problematical parcel, the nature of the problem and its implications are different in each environment.

Adding additional storage to the Exchange Server, for instance, whilst possible, brings with it greater difficulties than adding additional storage to file servers. Expanding the storage used by Exchange will inevitably increase the size of the internal databases, adding to processing that must be carried out by operations within the server. This will increase response time, reducing performance, and reducing usability for the users. Adding storage to the file servers does not have this same side-effect, nor does this affect off-line read-only media such as CD-ROM’s.

Similarly, one problem with archiving to CD-ROM’s – that the media can be lost or mislaid – is not likely to ever affect network file servers or Exchange based storage.


Working Practices

Duplication should be minimised wherever possible.

Duplication should be minimised wherever possible.

It is greatly preferable to share a single copy of a document than to produce duplicates, particularly in the case where the document is substantial, or transitory in nature, such as working drafts.

Similarly, if a document already exists on the network – either in a home directory or in a departmental shared directory – it is far more appropriate to provide references and links wherever possible, rather than to distribute copies of the document electronically to all interested parties. The reasons behind this are manifold.

Documents distributed to the readership are potentially out of date the moment they are copied into an e-mail. When the document is updated, it is necessary to send the messages out again, and again, and again.

The duplication created in this process swells the quantity of information stored. If a document is created, and sent to three people there is the original – either on a server or local hard drive copy 1 – the versions in each of three mailboxes copies 2 to 4 inclusive, and the copy in the sent items folder of the original sender copy 5.

Similarly, if it is the working practice for a manager to forward copies of all incoming mail to a given assistant the duplication mounts up. The original document created and sent internally the original – copy 1, a copy in the sent items folder – copy 2 and received by the intended recipient copy 3, automatically forwarded to the assistant copy 4 and copied to the recipient’s sent items folder in the process copy 5


Conclusions

The key to the data management and storage problem lies with users’ working habits. No matter what policies are put in place, the guidelines and system management practices will be constantly stretched unless working practices fit in with them and vice versa.

System management policies and restrictions must be in place to facilitate and channel the working practices of the user population but cannot be expected to be the total solution to the problem.

Users must be educated and informed as to the reasons behind the policies and restrictions, which in turn, must be matched to the needs and requirements of the business and to a lesser extent, to the users.

As with so many aspects of IT support, users must be educated to distinguish between the requirements of themselves as individuals, and the requirements of the business. The two are not necessarily the same.

jon m wilson Written by:

As half of the team behind 101projects101days.com, I am a serial starter of things, beginner of projects. I work in bits and in bytes, in words and paragraphs; I work in wood, metal, and paper, in fabric and in leather; I work in fits and in starts. Most of all I work intermittently and inconsistently.

Comments are closed.