Professional Documents
Culture Documents
Document Retention
Policies, Law and
Issues
Impacts and issues in the software development process
Michael Corsello
10/18/2008
Abstract
Document retention has become an area of increasing importance including a dramatic increase in
regulation regarding the organizational policies for standardizing the retention practices for documents
and content in general.
This paper will discuss and describe some relevant regulation covering document retention overall and
specifically detail impacts and issues in the specialized area of software development. The development
of software includes generation of source code and documents which detail the design and process by
which the software is developed, begging the question as to what is a document and which content
needs to be retained for regulatory purposes. Furthermore, the software being developed will be
subject to regulation for content retention as hidden requirements that may dramatically increase the
overall cost of developing software applications.
In the practice of software development, there is little distinction between documents and content as
one is simply a semantically constrained subset of the other. By altering the definition of document, the
concept of content becomes largely indistinguishable from a document. It is for this very reason that
this paper covers the concepts as being interchangeable.
CSci 175 Information Policy Michael Corsello
Background
Document retention is a subset of the larger concept of content retention. Content retention is the
collection of policy and practices surrounding the standardization of practices involving the collection,
storage, tracking, security, retention and disposal of any content. Any data produced in the course of
conducting business that is pertinent to the business is content. Content retention is a portion of the
larger concept of content management, which consists of the portions of management involving the
disposition of the content from creation to destruction.
Regulations on retention
In recent years the requirements imposed on organizations by the governments of the world has
dramatically increased with respect to content retention. The public failures of organizations practices
such as the Enron scandal and the Veterans Affairs data loss are partially responsible for the new
regulations. At the federal level, laws that impact the content retention policies of organizations
include:
Clinger – Cohen (National Defense Authorization Act for Fiscal Year 1996)
Sarbanes – Oxley (Sarbanes-Oxley Act of 2002)
HIPPA (Health Insurance Portability and Accountability Act)
DoD 5015.02-STD (Electronic Records Management Software Applications Design Criteria
Standard)
In addition to these regulations defining explicit requirements on content practices, there are many
other regulations that directly or indirectly require standardization of content management practices.
Retention practices
Content retention overall involves the practice of “holding” or retaining content from the time it is
captured or created to the time it is “released” or destroyed. This concept introduces two sides of the
paradigm of retention: to retain and to destroy. These will continually play against one another in this
paper.
standardization and documentation of this duration and the practices involving the transition of the
content between these states is the primary goal of retention.
Disposal processes
Once a document has outlived its standard period of benefit it will be disposed of. The disposal of
content must also be standardized and documented. Disposal will involve the actual process of
discovering expired content among the full corpus of content and the process for removing expired
content from the organization storage repositories. This should also document the results of the
destruction to provide the level of confidence that the content is unrecoverable once disposed of.
Content Retention
To appreciate the complexity of content retention the individual concepts of content and retention must
be understood.
What is a document
Prior to a discussion on content and document retention, it is critical to understand what each of these
concepts truly represents. Content can be any information of any type, structured or unstructured. This
concept of content can include something as simple as a single word. When content is placed in a
context such as an order form, that content becomes a record. A record that is stored to some
persistent media is a document. In that manner, an order submitted in a web form is a document once
saved to a database or printed out. This makes the structure of the persistence mechanism the actual
form of the saved content and therefore a critical issue to the developers of software persisting such
content.
In a software application such as an online shopping site, which will persist the orders as records in a
database, the persisted structure of the data representing the “document” will in no way resemble the
format of the “document” presented to the user. This presents a number of considerations to a
developer:
For the example of an order, several pieces of information comprise the document:
Customer information
Billing information
Shipping information
Order items
Metadata
o Date of transaction
o Date of shipment
2
o Date of arrival
o Disposition of order (returns)
What is retention
Retention is the entire lifecycle of content on persistent media. The concept of retention must include
the eventual destruction of the content from the media and potentially the destruction of the media
itself when no longer viable for re-use. For the media itself, retention must also cover applicable reuse
of the media once the content it contained is disposed of. For paper, this must include scenarios such as
secondary use of paper for fax machines. It is obviously critical to ensure that sensitive documents
printed on paper are not re-used for fax paper once out of date.
All of the practices regarding keeping, re-using and disposal of content and media are within the scope
of content retention. Since the coupling of retention is so tight with the practices of management of
content, the two areas are largely interchangeable, though management also involves other practices as
well.
All evaluation of all content must be on an even and level playing field to ensure proper handling and
disposition. Overall, any information can illustrate both good and bad points depending upon who has
the information and how they attained that information. Therefore, it is quite important that all
information that can be disposed of be disposed of as soon as possible to minimize the potential liability.
This includes the destruction of information on backup and COOP media in addition to all production
media. Backup usage and planning should also consider this and forbid the use of backups as a standard
mechanism of restoring content due to use fault. This practice would count as an accepted form of
content retrieval and thereby make backups considered production media as well. Backup and COOP
content must be restricted to use during disasters resulting in hardware failure only as part of the
retention plan.
primary concern in CM is the tendency to desire to retain content. In CM practices, content is generally
versioned over time to illustrate the history of content. From a retention perspective, this must be
balanced with the need to purge content as its use is diminished over time.
High-level introduction
The process of software development involves several phases during which a specific portion of the total
system becomes defined. The basic development phases include inception, elaboration, construction
and transition. Prior to starting a development project, the customer and software provider commit to
a contract. This “pre-inception” phase involves the aggregation of business processes to automate, the
scoping of the effort, identification of the key stakeholders, base lining an anticipated timeline for
completion and a rough cost estimate.
Inception
The inception phase of the development effort involves the creation of a set of requirements that depict
the business processes to automate, all applicable regulations and policies, any performance constraints
and the general constraints on the overall construction. Once completed, this will result in several
formal documents including meeting notes, possibly audio or video of meetings, rough sketches and
business documentation from the client. All of this content is managed through the CM processes.
Elaboration
During the elaboration phase the content produced in the inception phase is analyzed to produce a
workable design for the system. The design may also include prototype code for demonstrating design
concepts. Again, several formal documents are produced and all content is managed through the CM
processes.
Construction
The construction phase is where construction, testing and validation of actual production quality
software is performed against the documents produced in the earlier phases to ensure compliance with
the stated requirements. Again, there are formal documents produced as well as the source code for
4
the system itself. The CM processes are used to manage all of this content. Technically, at this point,
the construction is complete and all deliverables are provided to the client. Therefore, it could be argued
that there is minimal value in the retention of any content produced under this contract at this time.
Transition
The transition phase involves the integration of the new software into the client business and the
continual maintenance and upgrading of the software over time. If the same company has the
maintenance contract, the entire body of content may be useful to evolve the software. The
management of all content over time during the ongoing transition phase is still performed under the
CM processes.
The general theme of CM in the software development lifecycle (SDLC) is to retain content including all
revisions to all source code throughout the life of the project and beyond. In many cases, content is
applicable to multiple contracts and as such is desirable as a source of content to expedite content
creation. Unfortunately, the contracts themselves often do not discuss the legality of content reuse at
all and simply the nature of unrealistic time expectations drives the reuse of such content. The balance
of the value argument to the liability of perpetual or non-standardized retention is not generally
realized.
Process documents
Over the course of the SDLC there are several documents produced to support the construction of the
software system.
Business processes
Generally business process documentation is produced by the client and delivered as-is to the software
development team. These processes are entered into the CM library (CML) for retention as
documentation of the processes to be automated. This serves as accountability for the development
team back to the client to ensure compliance of software. These process documents are generally
accompanied in bulk by a signed inventory sheet depicting the versions and delivery of these documents
to the development team.
If any changes occur to the formal processes followed by the client during development, any resulting
changes to the software being developed can be “at cost” to the client by referencing changes to these
documents.
Meeting notes
Notes are stored in the CML for each meeting throughout the development process. Audio or video are
often captured for meetings such as requirements elicitation meetings. The size of audio and video
content is an issue for CML storage, but when captured it is also stored.
Requirements
Because of the requirements process there are several documents generated. The primary
requirements documents are the Software Requirement Specification (SRS) and the Requirements
Traceability Matrix (RTM). These two documents form the basis for all work performed during the SDLC
and are the most critical to retain.
Design documents
Based upon the content of the SRS the software design will depict the expected structure and function
of the application to build. The design will consist of one or more documents collectively known as the
Software Design Document (SDD). The SDD is mapped to the SRS in the RTM where each design artifact
in the SDD is mapped to the requirements in the SRS that design artifact will partially or completely
realize.
Design as a process takes a significant amount of time and is argued as being of little practical value in
an “Agile” development methodology. The risk of not performing a detailed design may reduce the
accountability and tracking of requirements if not properly documented.
Testing documents
Each portion of the software application must be tested to ensure it works to design specification and to
requirement. The testing process, the tests performed and the results of each round of testing are
documented and stored in the CML. Once all tests pass and the system is delivered, the results of the
incremental tests leading up to a passing score are of little business value.
The library
All configuration content is stored in a repository known collectively as the configuration management
library or CML. The CML includes all content across the entire lifetime of the project. The CML is
responsible for the maintenance of proper naming standards (and their enforcement), versioning of
content and accountability for access and dissemination of content. The only official source of content
in a development project is from the CML.
Responsible parties
The configuration manager and their team manage the CML. A client representative generally will have
visibility into the content within the CML. The Information Assurance Officer (IAO) will also have
visibility into the CML and oversight to ensure the management of the CML follows the defined content
management policies. Finally, all contributing personnel are responsible for submitting content to the
CM team for inclusion in the CML.
In general, the source is the only thing that is content managed over time. However, compiled files are
tracked via the testing process to ensure only tested files make it to use by other developers and to
production. Changes are tracked in the SCM automatically by “deltas” or saving what has changed with
each edit. This however must be managed for which changes are significant and make it through the
testing process. If a change is made to a source code file, it is not significant to track alone. Instead,
changes are defined more by progress over time than static edits themselves. Likewise, daily changes to
code do not represent changes, but instead more provide a means of sharing code between developers
to aid in productivity.
Overall, due to the nature of the SCM it should NOT be part of the CML, but instead be governed by the
CM Team to ensure proper management of the source in the SCM. The only source code that should be
tracked in the CML are baselines, or releases that have meaning to the schedule or otherwise to the
client. These should be stored outside of the SCM to ensure distinction from the code in the SCM. In
practice, this is rarely done and the SCM is considered a key part of the CML. This is largely because of
how an SCM works.
view the entire application at once (often many thousands of individual files), a time is selected from the
calendar to represent the view of the system to acquire. This view will show the state of all files in the
system at that point in time. This is used practically as a means of “rolling back” when a change is made
that is later found to be less than beneficial.
As development proceeds, a label may be placed in the SCM on the current state of all files in the SCM
at that point in time. This label indicates a version for the code base and is often a release milestone.
Given the SCM has this power “built-in” it is often simply adopted as the de facto means of managing
source code content.
Retention or disposal
Since source code is the application and it evolves over time based upon changes made to the source
there is a high value placed upon the code itself. The source code in the SCM is considered to be the
primary source of value in all development projects and may often be reused in part or in whole across
projects. While there are issues of intellectual property rights at stake with source code, the time
demands for completing a development project often outweigh any considerations for replicating effort
for similar work product.
Due to this high-perceived value proposition and due to the inexpensive nature of storing this content it
is rarely every disposed of until it is entirely out of date. This often results in the retention of source
code for years beyond the conclusion of a project including all edits ever made during the development
process.
Since the cost of developing software is so high and the demands upon development teams are
generally quite unrealistic, many sloppy processes exist which are poorly followed. The proper disposal
of software development content including source code is of high importance and is rarely done
properly. There is a tremendous opportunity for investigating how software is actually developed to
respond to a disappointed client as mostly all project leave circumstantial evidence around to be
retrieved many years beyond their practical usefulness.
The entire process of software development has never been required to address the issue of content
disposal practices, as IT professionals are primarily concerned with retaining information. Overall, a
guidance package is required to illustrate the liabilities of not actively planning, scheduling and following
a standard process for content retention and disposal.
New software applications are created to solve specific practical problems in business. These solutions
generally are not planned based upon legal or liability implications of how the applications are used. It
will become increasingly important to ensure that software applications are developed based upon a
dynamic set of uses that can be modified to adapt to unplanned purposes.
For emerging regulatory requirements, few applications can currently support those demands. That
results in a requirement to modify existing applications to support the requirements or to fulfill those
requirements outside of the system (often manually). Defining what content in an application must be
regulated and purged is a challenge when the client will often not understand the legal implications of
the application data storage.
A major area of development in software applications is the use of data for new purposes such as data
mining and analysis. This will have growing legal implications as advanced analytics become easier to
produce. As applications are developed that increasingly centralize data into consolidated databases,
these databases may violate regulation implicitly via data aggregation due to poor alignment of
regulators understanding technology and technology implementers understanding regulation. The
centralization of data and security is a major area driving application architecture and enabling
enterprise analytics. These converging aspects are happening to maintain high levels of performance
given increased data volumes at a cost of separation of concerns.
Politically and socially the free sharing of information is at the forefront of progress in the area of
information technology. However, the issues of security, privacy and piracy are most likely of higher
importance. There is an increasing number of sophisticated attackers attempting to compromise
systems and information for profit. Regulation to protect this information and privacy must be in step
with technologies to ensure both can be realistically implemented and enforced given the workforce
and tools available. Regulation that is too difficult, technical or costly to implement will not be and
skilled workers will not become available as education systems are already being streamlined to
increase the rate of production.
Software is arguably the most complex undertaking of mankind with technology being implemented in
dependent layers. Each layer of technology relies on the one below it, with the lower layers each being
older than the preceding layer. Older technologies tend to have less emphasis on security and multi-
user synchronization. Therefore, we should have no expectation of “fixing” our problems any time soon
without unrealistic costs. Over time the best solution will be to replace technologies to implement the
required capabilities prior to becoming regulation.
impacted by content retention issues in a much more significant way than is currently addressed in that
application developers will be required to construct applications that enforce compliance with retention
policies and regulations.
Document retention legislation has a significant impact on the software industry and the personnel
responsible for the construction of applications overall. The skills of developers in the industry are
already stretched with much higher demand than supply of skilled workers. Clients do not understand
the implications of implementing compliant systems and allowances for time costs will likely not be
acceptable. The practical reality of the need for document retention practices will remain
overshadowed by the practical costs of doing so for some time.
Information sharing is of increasing importance to the businesses using information technologies. This
need to share culture that is developing will further expand the issues of content retention and
technological implementations to address the social implications of this content sharing. Overall,
technology is focused on opening up information and capabilities for widespread use, while little
attention is paid to illegitimate or illegal use of this information.
In summary, the concepts of content management including retention and disposal are in need of
immediate attention from technologists and policy makers alike. The emerging trends around sharing,
privacy, security and discovery must be addressed to ensure a sustainable approach is defined and
followed by technology implementers and users alike.
10