You are on page 1of 32

Trends in

Information Technology Infrastructure


in the Ocean Sciences
OITI Working Group Members
Tom Powell, University of California, Berkeley (chair)
Mark Abbott, Oregon State University
Scott Doney, Woods Hole Oceanographic Institution
Marjy Friedrichs, Old Dominion University
Dale Haidvogel, Rutgers University
Carl Kesselmann, Information Sciences Institute
Larry Mayer, University of New Hampshire
Reagan Moore, San Diego Supercomputer Center
Ray Najjar, Pennsylvania State University
John Orcutt, Scripps Institution of Oceanography
Rob Pennington, National Center for Supercomputing Applications
Nick Pisias, Oregon State University

Additional Workshop Attendees


Larry Atkinson, Old Dominion University
Peter Cornillon, University of Rhode Island
John Delaney, University of Washington
Eric Itsweire, National Science Foundation
Marlon Lewis, Dalhousie University
Andrew Maffei, Woods Hole Oceanographic Institution
Stephen Meacham, National Science Foundation
Frank Rack, Joint Oceanographic Institutions
Jim Yoder, National Science Foundation

Workshop Speakers
Danny Cohen, Sun Microsystems
Kevin Kalajan, Sun Microsystems
Jason Leigh, University of Illinois, Chicago
Paul Morin, University of Minnesota
Atul Prakash, University of Michigan
Bill Pulleyblank, IBM
Michael Speer, Sun Microsystems
Frank Vernon, Scripps Institution of Oceanography

Preferred Citation
Ocean ITI Working Group. 2004. Trends in Information
Technology Infrastructure in the Ocean Sciences, 24 pp.,
www.geo-prose.com/oceans_iti_trends
Trends in
Information Technology Infrastructure
in the Ocean Sciences

Report of the National Science Foundation Ocean Sciences


Workshop on Information Technology Infrastructure

Held May 21-23, 2003


at the Skamania Lodge, Stevenson, WA
asgjkfhklgankjhgads
Contents
I. Context ............................................................................................................ 1
A. The First Ocean Information Technology Infrastructure (OITI) Report ........ 1
B. Emerging Components of ITI ......................................................................... 1
C. The Emergence of Cyberinfrastructure at NSF ............................................... 4

II. The Information Technology Infrastructure Trends Workshop .............. 5


A. Key Trends ....................................................................................................... 5
B. Networks and ITI ............................................................................................ 7

III. Next Steps ................................................................................................... 19


A. Challenges ..................................................................................................... 19
B. Pilot Projects and Ocean Information Technology Infrastructure ................. 20
C. Some Possible Pilot Project Areas ................................................................. 21
D. Challenges in Developing an OITI Management Structure ......................... 23

Acronyms .......................................................................................................... 24
asgjkfhklgankjhgads
I. Context
A. The First Ocean Information Technology
Infrastructure (OITI) Report
To assess the ocean sciences community’s current computing resources (CPU cycles, memory, storage,
and future information technology infrastructure and network bandwidth). The need for re-coding
(ITI) needs, the Ocean Information Technology models for new hardware architectures, data visu-
Infrastructure (OITI) Steering Committee, with alization, and community models was also noted to
support from the Office of Naval Research (ONR) be acute. Lastly, the OITI Steering Committee also
and the National Science Foundation (NSF), noted the extreme shortage of trained technical
gathered input from the ocean sciences community, staff accessible to the ocean sciences community.
supercomputer centers within the United States,
and from ITI experts (OITI Steering Committee, The OITI Steering Committee developed four
2002). Particular emphasis was given to the needs recommendations based on its study and analysis:
of the modeling and data assimilation projects 1. Improve access to high-performance computa-
that were being developed through the National tional resources across the ocean sciences,
Oceanographic Partnership Program (NOPP). The 2. Provide technical support for maintenance and
OITI Steering Committee determined that the upgrade of local ITI resources,
ocean sciences community was experiencing severe 3. Provide model, data, and software curatorship,
bottlenecks in the availability of high-performance 4. Facilitate advanced applications programming.

B. Emerging Components of ITI


Although An Information Technology Infrastruc- and research groups. While the need for HPC
ture Plan to Advance Ocean Sciences (OITI Steer- will continue, it is also apparent that many HPC
ing Committee, 2002) focused primarily on the services will be provided through local, distributed
ocean sciences community’s needs for access to sources in addition to centralized systems such as
high-performance computing (HPC) resources, HPC centers.
the report commented that access to local ITI at
ocean research institutions is a long-term issue. The National Science Foundation convened a
Given the increasing and continuing trend towards workshop on “Trends in Information Technology
commodity pricing for basic ITI services includ- Infrastructure in the Ocean Sciences” to exam-
ing computation, storage, and network bandwidth, ine trends in networking and their impacts on
it has become extremely challenging for research the ocean research community. The focus of the
institutions to maintain their ITI capabilities in the workshop, held May 21-23, 2003, was on both the
face of rapid change. The forces driving this change services that would be enabled through ubiquitous
are primarily the needs for the entertainment and networks and the technical and human infrastruc-
business sectors. Moreover, frequent shifts in hard- ture that would be needed to develop and support
ware and software architectures impose a signifi- these services and capabilities.
cant human cost, especially for smaller institutions


The oceans sciences community is planning many These functional requirements include network and
ambitious long-term observing programs such data security (in terms of data integrity, access, and
as the Ocean Research Interactive Observatory intellectual property), real-time, high-availability
Networks (ORION), the Integrated Ocean Ob- (24 hours, 7 day/week), and intelligent methods of
serving System (IOOS), as well large field and sorting and seeking relationships within data and
modeling programs including Climate Variability information. As ocean scientists tackle more com-
and Predictability (CLIVAR), Global Ocean Data plex and interdisciplinary problems, the require-
ment for remote collaboration becomes critical.
Moreover, the need to accommodate rapid tech-
Adequate ITI capabilities at each local nological change should be viewed as a functional
requirement.
research institution are essential for designing
and implementing new scientific programs Meeting these new functional requirements im-
poses serious challenges in a widely distributed ITI
and new observing systems that are needed to
architecture. With technical evolution occurring
address complex, interdisciplinary problems more rapidly at the fringes of the network rather
in ocean research. than at centralized HPC facilities, the acquisition
of new ITI may follow different paths than what
we have relied upon in the past. Nearly 20 years
ago, the first UNIX workstations from Sun Micro-
Assimilation Experiment (GODAE), and Ocean systems were beginning to appear on the desks of
Carbon and Climate Change (OCCC). In terms of the research community. IBM personal comput-
ITI, there are hardware and software requirements ers and Apple Macintoshes were becoming more
in, for example, computational power, data storage prevalent. The rapid changes in price and perfor-
and access, and visualization, such as, “We recom- mance since then are well known. What is perhaps
mend...initial purchase...[of ] capacity equivalent to less understood are the implications of this tech-
roughly one-third of a leading terascale machine” nology for research organizations and for individ-
(OITI Steering Committee, 2002). These items ual scientists. We have moved from a centralized,
can be discussed in terms of design or implemen- mainframe computer world where advances in
tation requirements such as, “the system needs to technology were largely driven by the needs of the
provide 1 terabyte (TB) of on-line storage.” There scientific and technical communities to a distrib-
is a higher level of requirements known as “func- uted computer world that focuses on commercial
tional requirements.” These requirements describe needs and is driven by mass-market forces.
the intended behavior of the system in terms of
the services, tasks, and functions that it will per- The pace of change has quickened with signifi-
form. For these emerging programs, there is a new cantly shorter product life cycles. The scientific
set of functional requirements that is increasingly community was used to communicating directly
important beyond those that we usually describe in with industry representatives who understood both
terms of hardware and software implementation. the corporate product line and the needs of the


customer. With the decline of the mainframe as by charging user fees; the distributed system hides
a scientific computing tool, obtaining and using these costs in researchers’ salaries and potentially
high-performance ITI has become much more lower productivity on the part of programmers and
complex for the scientific user. The mainframe ven- other technical support staff who are now engaged
dor provided a complete, end-to-end solution. This in system design and maintenance rather than data
is now a task for the user, although one now has analysis and modeling. Moreover, the “centralized”
the freedom to assemble a cost-effective solution services that support a distributed ITI such as
from various hardware and software components. network fabric, switches, and disk arrays, must still
Moreover, planning for frequent infusions of ITI is be funded and maintained, which at most academic
daunting. research institutions is primarily through individual
research grants. The consequence is that these es-
ITI has expanded beyond the traditional com- sential services are often severely underfunded.
puter/disk drive/output device to include labora-
tory instrumentation, environmental sensors, and Attendees at the Workshop on Information Tech-
even building-control systems. TCP/IP protocols nology Trends in the Ocean Sciences recommend
are now used for many devices essential for ocean moving towards the next stage of ITI, emphasizing
research, both directly and indirectly. Ocean in- local, distributed computing needs, and integrat-
strumentation and observing systems are moving ing these with the earlier OITI Steering Com-
towards more widespread, adaptable, always-con- mittee recommendations. Moreover, and more
nected architectures using satellite, cellular, Wi-Fi importantly, the time is ripe to begin implementing
and other communication methods. Ensuring links integrated plans for ITI in support of ocean re-
among these sensor/instrumentation networks search. This workshop report focuses on the needs
and the storage/computation networks will be an of individual researchers and research institutions,
important task in the coming decade. which are distinct from needs best addressed at
larger HPC centers. In addition, the report focuses
The net result of these changes is that while re- upon imaginative solutions that take into account
searchers may be developing and maintaining their the link between resources at local facilities and
own information technology environments, they those at remote sites.
need more-sophisticated technical support than in
the past. For example, a scientist can assemble an Adequate ITI capabilities at each local research
ITI system with capabilities that once were the ex- institution are essential for designing and imple-
clusive domain of supercomputer centers. Although menting new scientific programs and new observ-
these changes to a distributed IT environment have ing systems that are needed to address complex,
made it easier to customize systems for a particular interdisciplinary problems in ocean research. New
analysis or observation problem, there are hidden ITI capabilities will support more-effective data
costs. More time is spent researching, develop- management and dissemination, advanced analyses
ing, and maintaining these systems, and increased of complex data sets, and communication with the
technical skills require increased salaries. The public and policymakers.
centralized computer center recouped these costs


C. The Emergence of Cyberinfrastructure at NSF
In 2003, the NSF released its report on “cyber- those identified in the report of the OITI Steer-
infrastructure,” which “refers to an infrastructure ing Committee (2002): (1) providing integrated
based upon computer, information and com- and balanced solutions, with attention to “last mile”
munication technology (increasingly) required needs remains challenging in a diverse, distributed
for discovery, dissemination, and preservation of scientific environment, (2) encouraging more-effec-
knowledge” (NSF Blue Ribbon Advisory Panel tive collaboration across disciplines and especially
on Cyberinfrastructure, 2003). This NSF report between computer scientists and environmental sci-
identified cyberinfrastructure (CI) as a “middle entists, and (3) balancing innovation with the more
layer” that would act as a first-class tool to enable a mundane requirements of providing IT services,
new level of science. Using global connectivity, CI which continues to keep computer scientists and
will support next-generation HPC, instruments environmental scientists from working together ef-
for observation and characterization (including fectively. There are also programmatic challenges as
organization, activation, and manipulation, such as different federal agencies pursue different compo-
laboratory instrumentation), knowledge manage- nents of an overall CI without adequate coordina-
ment, collaboration services, and visualization ser- tion. Long-term, sustained support for CI does not
vices. The CI report recommended that NSF invest fall within the traditional research-grant focus nor
one billion dollars per year in new funds in research the one-time infusions of infrastructure associated
and development, provisioning of operational with the Major Research Equipment Facilities
services, support for domain-specific CI activi- Construction (MREFC) program at NSF.
ties, and education and engagement of the broader
community. Thus, CI is not merely a research and Specific, near-term opportunities were identified
development activity for computer and information for Environmental CI, including:
science; it also must move forward as a partnership • Development and deployment of more-effective
between the domain science communities and the and economical collaboration tools,
computer/information science communities. • Improvement in the accessibility and utility of
computational tools and interactive capabilities
In the wake of the NSF CI report, the three divi- of data systems, data archives, and repositories,
sions of NSF’s Geosciences Directorate (Ocean • Model simulation,
Sciences, Atmospheric Sciences, and Earth Scienc- • Continued improvement in scalable computing
es) have each established CI working groups, and capabilities and access to such capabilities, and
each has pursued independent, but complementary • CI community building.
paths to meet the CI needs of their communi-
ties. Integrating the recommendations of these The Environmental CI workshop participants
three working groups is a continuous process. One made two recommendations: (1) NSF should
of the first steps toward integration was sharing establish a cross-cutting program to support the
ideas at a workshop on Cyberinfrastructure for development and deployment of CI for environ-
Environmental Research and Education (NCAR, mental research and education, and (2) NSF should
2003). This workshop identified several challenges ensure that this new program is coordinated with
and opportunities, many of which were similar to relevant activities at other federal agencies.


II. The Information Technology
Infrastructure Trends Workshop
Building on the Environmental CI workshop identified by the Environmental CI report as the
report (NCAR, 2003) and the earlier work by the primary capabilities that would be provided by the
OITI Steering Committee (OITI Steering Com- CI “middle layer.” The ultimate goal of the May
mittee, 2002), the Ocean ITI working group was 2003 workshop was to develop recommendations
charged by NSF to identify technological trends to NSF’s Division of Ocean Sciences that could be
in the areas of HPC, data grids, sensor webs, col- used to develop specific CI initiatives, with specific
laboration, and visualization. These topics were emphasis on distributed ITI capabilities.

A. Key Trends
1. Changes in Workflow. The extensive availabil- to the Internet. Thus, we move data and knowl-
ity of broadband networking is fundamentally edge through a highly distributed and rapidly
changing how the ocean sciences commu- changing network, rather than from static data
nity interacts with its observing systems, and storage systems and centralized computation
analyzes, visualizes, and manages its data sets. systems. Although everyone is aware of Moore’s
Workflow structure is changing, focusing more Law in which the density of transistors on a
on near-real-time analyses and highly distrib- chip doubles approximately every two years,
uted collaboration and computation, rather the doubling times of other critical param-
than cruises that may be analyzed over a period eters are even shorter. For example, processing
of years by a small group of single-discipline speeds double every 1.6 years, storage densities
oceanographers who may be located at one every 12 months, and networks’ speeds every
institution. As noted earlier, the IT industry 9 months. Four years from now, chip densities
is now focusing more on “commodity” tech- will have increased by a factor of 4, processing
nologies that are highly distributed and highly speeds by 5.6, storage by 16, and network speed
connected. With TB storage systems cost- by 40. The 1 TB disk that costs $1300 today will
ing less than $1300 each, the need for highly cost only $81, and today’s bleeding edge net-
centralized systems with TB storage capacities work at 10 Gbps will be operating at 400 Gbps.
diminishes. Similar advances in desktop com-
putation capabilities and other IT components 2. Distributed Computation. Regional net-
are occurring at the edges of the Internet in the works have traditionally provided access to
office, laboratory, and home. With high-speed, distant HPC resources, but the fundamental
broadband connections there is the potential model is identical to the job submission/com-
to provide a centralized view of this highly puter center model of 30 years ago where a
distributed system. Moreover, networks are single computing center provided all the tools
now pervasive and components of the ITI that and jobs were queued and executed according
were formerly isolated (especially observing to a local priority system. However, new ap-
systems) are now becoming globally connected proaches such as the computational grid rely on


the integration of distributed HPC assets. For successor, OpenDAP (Open Data Access Pro-
example, one center may provide a particular tocol); and the Storage Resource Broker (SRB)
type of computational engine, another a mas- have provided standardized methods to publish
sive data store, and another a data-rendering and access such local data holdings and data
and visualization engine. Grid technologies stored in centralized archives. New methods,
provide a unified system view of these resources such as XML (Extensible Markup Language)
that are transparent to the user. In contrast, and data ontologies, will provide more capabili-
some distributed computation technologies rely ties for researchers to organize and publish their
primarily on distributed and underutilized com- data holdings and for locating and accessing
putational resources to partition one task across other researchers’ data. However, long-term
many machines. SETI@Home is one example storage and data curation will continue to
of a distributed computation system, but most require a more centralized approach as they
HPC applications require a uniform operating exceed the capability of a single investigator. An
environment in both hardware and software. emerging approach has been the development
However, for most compute-intensive ocean of intermediate-size, special-focus data centers
applications, simply using “unused cycles” in the such as the Joint Global Ocean Flux Study
manner of SETI@Home is not realistic. ( JGOFS) data system that focus on the needs
of a specific science program. This approach
3. Data Storage and Access. The notion of maintains the connection between the data sets
central data repositories has shifted in response and science knowledge in an effective manner.
to changes in network capacity and local stor- No matter what approach is followed, the costs
age technologies. Central archives such as the of effective management are substantial.
National Oceanographic Data Center (NODC)
have struggled with the increasing volume and 4. Data Analysis and Visualization. Data visu-
variety of oceanographic data. Moreover, the alization and analysis continue to be challeng-
National Research Council (NRC) Commit- ing as the volume and complexity of observa-
tee on Data Management and Computation tions and model output increases. Interactive
(CODMAC) reports from the early 1980s em- dynamical analysis, data mining, and seman-
phasized the importance of maintaining close tic-based analysis tools must be developed for
links between active scientific investigations ocean sciences applications. To solve scientific
and data archives to ensure that data remained problems requires an interdisciplinary approach
relevant to contemporary science. These forces and, therefore, multiple investigators must be
have resulted in a more distributed and ad hoc able to share data and insights. Analysis and
approach to data management. Investigators visualization tools must preserve the provenance
are making data sets available over the Internet and lineage of data sets and analysis products.
through Web-based interfaces to local data Simply producing compelling graphics and
holdings. However, it is often difficult to locate animations is not sufficient. As with HPC and
such data holdings without personally know- data storage, many of the advanced capabilities
ing the investigator. Technologies such as the have required specialized tools and technical
Distributed Ocean Data System (DODS); its staff, but increasingly such IT is available at the


local level. In part because of digital games, ad- significant challenge for communication and
vanced computer graphics and rendering tools networking technologies, some components of
have now moved into commodity machines, future ocean observatories (e.g., Internet links
many of which have graphics processors with to research vessels [HiSeasNet], deep-sea cables,
capabilities that once were the exclusive domain buoys, shore-based sensors) will be able to be
of HPC machines. Moreover, higher-band- connected to the Internet.
width networks now make it possible to share
and create complex visualizations within a real- 6. Collaboration Tools. The ocean sciences are
time, collaborative environment. Thus far, such inherently interdisciplinary. The increasing
capabilities are restricted to networked games, complexity of scientific questions and data sets
but it is easy to imagine scientific applications requires more-sophisticated collaboration tools.
being developed in the near future. Moreover, the scientific expertise required for
addressing these scientific problems is distribut-
5. Smart Sensors. Continued miniaturization ed among many institutions. Effective, network-
of IT technology has now led to a wide range based collaboration tools will be an important
of embedded devices used as environmental component of the scientific process. However,
sensors. Lightweight operating systems (such present tools are relatively limited in their capa-
as TinyOS and Linux) and hardware-based bilities, and most network-based collaboration
software stacks, such as http and TCP/IP, have relies simply on email and ftp.
enabled the development of “smart” sensors
that, with sufficient network connectivity, can To address these six issues, the Oceans ITI Trends
now form “sensor webs.” These webs can be fully workshop organized a series of presentations by
integrated with the data analysis and delivery experts from both industry and academia. These
system, thus bridging the existing gap where presentations were used as the basis for a set of
data are collected in isolation for later analysis technical reports that are available on the Web
and archiving. Although the ocean presents a (www.geo-prose.com/oceans_iti_trends/).

B. Networks and ITI


The OITI report (OITI Steering Committee, on local institutional needs is as important as the
2002) recommended several actions regarding need for centralized resources. It is not an either/or
access and use of information technology. Most situation: a balance must be struck between IT ca-
of these recommendations focused on the need to pabilities at central nodes and those at the fringes
provide high-end resources to address computa- of the network.
tionally intensive research problems. These went
beyond the need for flops to include issues such Networked capabilities for HPC, data manage-
as technical support, training, and visualization. ment, visualization, environmental sensing, and
However, advances in network capability as well collaboration require significant investment at
their increasing pervasiveness imply that focusing local institutions, all the way down to the desktops,


laptops, and other personal devices. The workshop nodes, together with an interconnected network
hosted presentations on: (1) networks, (2) massively that allows communication among these systems.
distributed high-performance computing, (3) data A user on the grid will typically access the grid at
grids and related issues, (4) wireless sensor grids, some node and then run an application on a node,
(5) distributed visualization, and, (6) distributed or set of nodes, on the grid. The choice of node(s)
collaboration. Each workshop topic demonstrated a on which to run applications will depend in part on
form of distributed data manipulation. the characteristics of the nodes themselves (proces-
sor capability, memory, available capacity), in part
1. Grid Computing on the data available at the node, in part on the
communication capacity among nodes, and in part
Computation is becoming widely recognized as on access controls that form part of a grid infra-
a third component of scientific research, comple- structure.
menting experimentation and theory. As compu-
tational models become increasingly sophisticated, One advantage of a grid is that it permits high-
they require significantly more powerful platforms. level integration of a broad variety of systems. In
Two factors that continue to drive this need are the principle, a user can select a node appropriate for
desire to deal with multiscale phenomena (such as the application as well as arrange for distributing
mesoscale weather models) and the requirement load across a system. The grid also has the potential
to manage huge amounts of data. Parallelism may to make more-efficient use of resources by allow-
provide the needed level of computation, relying ing them to be used where needed, independent
on vector architectures, symmetric multiproces- of low-level demand on the node containing the
sors (SMP), or message-passing architectures. resources. Indeed, one of the early motivators for
In the context of networked applications, “grid” development of grid computing was the desire to
architectures are being used where the problem can “scavenge” the unused compute cycles on the thou-
be parsed among several compute nodes distrib- sands of PCs and workstations that spend most of
uted over high-speed networks, or where various their time sitting idle.
compute functions (e.g., storage, computation, and
visualization) are distributed to different locations. A more important use of the grid model may be in
Grid computing uses sophisticated interconnec- sharing data across multiple nodes. For example, a
tion technology to apply distributed processors to user may have a real-time satellite feed of weather
such a problem. It makes use of a relatively mod- data at one grid node, sensor data at other nodes,
est number of large computer systems, linked by and a need to run a model for ocean/atmosphere
TCP/IP, to assemble sufficient compute resources. prediction at another node that has high computa-
Note that for many emerging networks such as the tion capability.
National Lambda Rail (NLR), TCP/IP is far too
slow to manage high-speed data transfer. Because of their distributed nature as well as
the high latency experienced by communication
Grid computing provides a user with an integrated among nodes, grids are normally more suitable
view of many computing systems. A grid consists for very coarse-grained, parallel applications that
of a number of separate computing systems, called do not require extensive communication among
nodes during execution of application components.


However, they do provide good support for on- Data are transmitted over wireless networks at
demand computing and for autonomic computing low to moderate bandwidths, over telephone lines
(this refers to the ability of a machine to self-man- and the Internet, and over optical networks at
age much of its operation with minimal human 10 gigabyte per second speeds. The size of data
intervention)—enabling applications to deal with sets, and the rates at which data are moved vary
capacity disruptions and problems at grid nodes. widely. Collection types range from lists of point
data, measured in megabytes, to sets of image data
Grid computing represents a fourth step in the measured in tens to hundreds of terabytes. The ac-
evolution of the growth of Internet-based com- cess mechanisms range from portals for Web-based
puter systems. In the first stage, the Internet was interaction, to application programmer interfaces
intended to provide efficient connectivity between (APIs) that directly manipulate data-set structures,
a terminal and a variety of applications running to digital libraries that support sophisticated query
on a (possibly remote) node. In the second stage, and discovery. Data are stored on small caches in
it provided the foundation for e-mail—point-to- remote sensors, on-board ships with satellite com-
point communication between two people. In the munication, on land-based disk caches, and in deep
third stage, we saw the evolution of the World archives. The challenge of managing ocean sciences
Wide Web, permitting rapid exchange of informa- data lies in the remarkably broad range of types
tion among members of a global community. Grid of data sources, storage repositories, and access
computing extends this communications system mechanisms.
to allow applications to be run across a variety of
nodes. Data grids provide both the storage repository
abstractions (an abstraction is a general concept
2. Data Grids formed by extracting common features from
specific examples or instances) needed to manage
The ocean sciences community requires a software interactions with heterogeneous storage systems,
infrastructure that is able to manage the het- and the access abstractions needed to support the
erogeneity inherent within data sources, storage desired access mechanisms. Data grids form the
repositories, and access mechanisms. Example data core data management technology used to support
management systems, currently in use within NSF, scientific disciplines, providing the ability to share
NASA, DOE, and NIH projects, individually ad- data across administration domains. Data grids
dress many of the challenges of the ocean sciences may be used as part of digital libraries, through
environment. By integrating appropriate capabili- the addition of services to manipulate information.
ties from the extant systems, a viable infrastructure Data grids may also be used to support persistent
can be assembled to support data sharing (data archives through the addition of services to manage
grids), data publication (digital libraries), and data technology evolution.
preservation (persistent archives).
Within the ocean sciences community, there are
The ocean sciences community works with real- multiple existing efforts for the management of
time data from sensors, simulation output, observa- data, upon which appropriate ocean sciences cyber-
tional data collections, and historical data archives. infrastructure can be assembled. The projects cover


the range from data collection to data organization tals can also be integrated with grid services for
and storage, to data access. Example technologies the remote execution of jobs on systems such as
include: the NSF Teragrid.
• Data analysis – visualization systems. The most
• Data acquisition – Object Ring Buffer (ORB). challenging data sets manipulated within the
ORBs manage access to real-time sensor data, ocean sciences community are multi-dimen-
supporting queries on recent observations. They sional satellite observations and simulation
interface to a variety of sensor types, managing results of time-evolving 3-D fields. It is possible
access latencies, network losses, and disparate for simulation output file sets to be larger than
communication bandwidths. local disk. The 3-D rendering of large data sets
• Collection building – Storage Resource Broker is then done through the paging of data from
(SRB) metadata catalog. Descriptive metadata a remote archive, for example, using a 3-D vi-
assigned to each sensor observation or simula- sualization toolkits developed at the San Diego
tion output are managed in metadata catalogs. Supercomputer Center (SDSC), the Electronic
The metadata are organized in collection hierar- Visualization Laboratory at the University of
chies that can be queried and browsed. Illinois at Chicago, and the Visualization Cen-
• Collection federation – SRB data grid. The ter at Scripps Institution of Oceanography.
integration of ORBs and the SRB creates a
Virtual Object Ring Buffer or VORB, enabling Examples of the piecewise integration of the above
access to sensor data stored in multiple ORBs. systems exist. A VORB is being used within the
The SRB data grid is also used to integrate NSF RoadNet project at Scripps Institution of
disk caches with archives, enabling access to Oceanography to support real-time sensor data.
the most recent observations and archived data The metadata catalog is used to manage both
through the same access mechanisms. . descriptive metadata about the sensors, and ad-
• Data access – OpenDAP. Access mechanisms ministrative metadata for data location and access
for data retrieval vary from bit-based access control. The OpenDAP protocol has been integrat-
through UNIX file system semantics supported ed with the SRB as part of a NASA data grid for
by the SRB, to structure-based access through the Data Assimilation Office at NASA Goddard.
the OpenDAP protocol. OpenDAP supports The system provides access to individual variables
direct extraction of variables from files, mak- through the DODS interface to data stored within
ing it possible to request a named data variable. the SRB data grid. Grid Portals are used at SDSC
OpenDAP provides a way to describe the struc- to support interactive access to jobs running on the
ture of data sets, and a way to apply semantic NSF Teragrid. Grid Portals access data on local file
labels to the structures. systems, as well as data stored within SRB collec-
• Data analysis – Grid Portals. Management of tions. Simulation output files are stored in the SRB
application codes that support data analysis and along with descriptive metadata. Finally, the 3-D
manipulation is typically controlled through visualization toolkit at SDSC is integrated with
portals that link simulation output and input the SRB to optimize access to remote data through
files with access to the remote collections. Por- a data paging system. The 3-D visualizations
toolkit has been used to generate fly-throughs of


multi-terabyte simulations that are currently being • Knowledge management – organization, stor-
displayed in the Rose Planetarium at the American age, and manipulation of relationships. An
Museum of Natural History in New York. example is support for querying across feder-
ated collections, including electronic journals
The ocean sciences community needs a data man- now commonplace in ocean and Earth sciences.
agement system that integrates all of the capabili- Each collection may have a unique set of at-
ties described above. This is not without substantial tributes that have been assigned to the data. The
computer-science challenges. Development of attributes are listed in the data dictionary, and
generic data manipulation systems requires input relationships among the attributes are described
from the computational science community on the in the concept space. A knowledge-based
following issues: query uses the relationships present within the
concept space to identify equivalence of terms
• Data formats – generic characterization of within the federated collections, to provide a
structures within data sets. Multiple mecha- relevant response.
nisms are currently being used to describe and
manipulate file structures, including Hierarchi- These computer-science issues are highly depen-
cal Data Format, OpenDAP, and DataCutter dent upon the data models and semantics used by
data subsetting filters. A common representa- a discipline. Progress in building generic infra-
tion is needed that can be used within any of structure for creating and manipulating knowledge
these approaches to data subsetting. requires the active involvement of both ocean and
• Data models – generic description of semantic computer scientists. While many of the issues are
labels applied to data structures. OpenDAP and being explored within the Global Grid Forum and
the SRB though XML provide a way to apply through Digital Library Initiatives, the ocean sci-
semantic tags. ences community must also be involved to ensure
• Data dictionary – list of semantic labels used to that infrastructure relevant to their discipline is
describe oceanographic quantities. The ocean created. The true test will be use of data, informa-
sciences community needs a set of commonly tion, and knowledge management technology in
used terms to describe all variables recorded by direct support of ocean sciences community data
sensors and generated by simulations. collections.
• Digital ontology – characterization of the
relationships that are applied to data set struc- 3. Grid Visualization
tures. Examples include mapping of data arrays
to coordinate systems, mapping of time stamps The chief requirement for obtaining useful visual
to time coordinates, and mapping of coordinate displays of data is the collaborative and adaptive
systems to geometries. real-time fusion, querying, and display of data from
• Concept spaces – characterization of the rela- a wide range of data sources, such as distributed
tionships among semantic tags defined in data data stores and real-time remote sensors. This
dictionaries. requirement will have the greatest impact in the
ocean sciences in the 21st Century. This capability


must interoperate on a variety of display systems, especially important if the measurements are used
including Web browsers, immersive stereoscopic in real-time emergency response applications such
displays, and scalable, high-resolution displays. as in the prediction of tsunamis.

Visual displays of data are most powerful when Currently, it is difficult for scientists to use the
they show multiple data sets simultaneously. For same piece of visualization software across the wide
example, to study ocean circulation, data from a variety of platforms they may want to access, for
large number of sources must be integrated. In example, laptops, immersive displays (GeoWalls),
an ideal world, this would be on a single display. and Scalable High Resolution Displays (SHRDs).
Bathymetry from databases and shipboard sonar, Every available visualization tool is targeted at a
wind and temperature data from buoys, and sea specific computing architecture, and either per-
surface topography and wave height all must be forms poorly on other platforms, or is not reusable
available and integrated on-demand to provide the at all. Future generation visualization systems need
most complete picture of ocean state as possible. to be smarter and more flexible. They need to be
Visual data fusion also allows us to leverage data aware of the underlying networking, data, compute,
sources intended for other disciplines (e.g., geology, and rendering services that are available on the
atmospheric science). Earth and ocean sciences are Grid to create an optimum visualization package
unique because they are integrative and multi-scale; for the viewer. One example of multiscale visualiza-
there are environmental and natural hazards ap- tion is the Visualization Center at Scripps’ visual
plications that can significantly benefit from these objects, built on Interactive Visualization Systems’
enhanced visual displays. Fledermaus, which can be displayed on all of these
system effectively.,
As visual data fusion requires rapid access of data
from a variety of data sources and sensors from SHRDs are needed because Earth’s resolution is
around the world, high-performance international not 1280 X 1024. Data sets are being acquired at
networking is required. Although Internet2 already higher and higher resolution and they need to be
provides a significant amount of bandwidth to displayed in their full glory. We know from moving
universities in the United States, it is already satu- from typewriters to word processors that every-
rated. Researchers trying to download data have thing does not need to be printed. We can now
to compete against students who are downloading collect bathymetry data at a resolution of a few
anything from music to cookie recipes. This will centimeters. There is no way to view all the data
continue to be the case even when more bandwidth from a survey on a screen and there is no way that
is appropriated by Internet2. A different strategy we can print it all. We need to develop scalable
is needed. Oceanographers need to partner with resolution displays that can fit in between these
NSF programs that are developing extremely high extremes.
bandwidth, dedicated, and perhaps even router-
less, all-optical scientific networks to ensure that Currently there is a significant lack of good vi-
there is uninterrupted bandwidth available. This is sualization tools for SHRDs. Tools are available
for computer scientists to develop visualization
applications, but there are few, if any, visualiza-


tion applications that truly harness the capabilities 4. Sensor Grids
of SHRDs for specific application domains such
as oceanography. Furthermore, commercial tools Data grids, today’s cutting-edge technologies for
are few, and the few that can drive SHRDs are information management and data manipulation,
not able to manage very large data sets. Research are distributed networks that integrate data and
should be conducted to take advantage of trends computational resources. Existing and evolving
in the commodity market to develop lower-cost grid projects deal with static data files, and until
SHRDs for oceanography. Furthermore, there the genesis of the ROADNet program, no one had
should be research into developing stereoscopic attempted, from a “grid” perspective, to tackle the
versions of scalable displays because much data in unique challenges associated with the management
the ocean sciences are inherently 3-D and time- and manipulation of continuous data streaming
dependent. Stereopsis has already been shown to from myriad sensors. The goal of the ROADNet
be invaluable in the geoscience community with program is to develop an integrated, seamless, and
the adoption of 200 GeoWalls in a period of two transparent environmental information network
years. The same is expected to be true in the ocean that will deliver geophysical, oceanographic, hydro-
sciences. Current, leading-edge research in visual- logical, ecological, and physical data to a variety of
ization uses clusters with commercial, commodity end users in real-time.
graphics cards to drive display walls (Geowall2)
with resolution of 50 megapixels and more. Limi- Considering that a vast majority of environmen-
tations remain in reliance on 32-bit processors tal data collected today is not being captured in a
with limited memory addressing. Shared memory way that makes it available to multiple users, it is a
systems with TB memory are becoming available challenge to create a real-time (or near-real-time)
to approach scaling consistent with Earth observa- data management system. ROADNet was designed
tions and models. to meet the following requirements:
• Capture, process, control for quality, and inte-
There needs to be aggressive introduction of grate real-time data streaming from different
modern visualization and computing systems into sources—collected for different purposes, on
the ocean sciences curriculum, especially at intro- different temporal and spatial scales, and mea-
ductory levels. Today’s undergraduates are already sured by different methods,
comfortable using computers to access data from • Make heterogeneities in platforms, physical
the Web. Both students and researchers are likely location and naming of resources, data formats
to spend more time analyzing their data on com- and data models, supported programming in-
puters than going on ocean research cruises. Both terfaces, and query languages transparent to the
communities, therefore, need to be shown the user,
additional benefits that are afforded by the use of • Adapt to new and changing user requirements
grid visualization, data, and computing services. for data and data products,
However, introduction at the undergraduate level is • Dynamically reconfigure with the addition or
crucial to ensure adoption at higher levels. removal of observational equipment and/or
scientific instrumentation,


• Provide Internet access to integrated data col- • Update metadata in real-time for automatic
lections along with visualization, data mining, processing and for data archives
data discovery, analysis, and modeling capabili- o All site characteristics
ties, and • Latitude, longitude, elevation, etc.
• Build on or extend current data collection o All instrument characteristics
initiatives (e.g., digital libraries), promoting • Detailed responses
broad-based user access and long-term data • Orientation angles and precise location
stewardship. relative to reference point
o Ancillary information
The foundation of the ROADNet data telemetry • Data format descriptions
and distribution system comprises middleware that • Communication characteristics
enables users to access real-time data. The middle- • State-of-health parameters
ware, designed for data management computers, • Adopt a relational database formalism for gen-
has the following attributes: eralizing the archiving of time series data plus
• Server-client approach for managing the ring metadata plus all processing results,
buffer, • All formatting at application level,
• Real-time data processing, • System should be symmetric with respect to
• Multiple simultaneous read and/or write clients; read/write/local/remote access,
• Clients can be anywhere that is network acces- • System should be error-free, robust and gener-
sible, ally of “commercial grade,”
• Data packets may be of any size, format, infor- • Impervious to communications failures,
mation content, • Impervious to computer shutdown-startup
• Highly robust and error-free, (generally means non-volatile buffers, auto-re-
• Adaptive to data bandwidth availability, connects, auto-restarts, etc.),
• Receive and process data from all types of in- • Accurate timing provided by the data logger.
struments,
• Exchange data with other data servers, The data management system that provides these
• Reliably transfer near real-time data over wired, functions creates a real-time data grid that is reli-
wireless, and satellite TCP/IP connections, able, flexible, and enables users to access real-time
• Prioritize data sets to be telemetered, data streams. The ROADNet data grid will ad-
• Optimize the use of available telemetry band- dress system and data interoperability issues, but
width, it does not yet address semantic interoperability
• Locally archive data, and information integration issues (i.e., techniques
• Recover data from local archives after telemetry that move beyond simple distributed access to data
outages, files). In addition, ROADNet does not include
• Provide network security shielding instruments features for persistent archives and for user tools
from unauthorized manipulation, unauthorized and interfaces. Commercial software packages can
addition of sensors, and access control, where provide many of these capabilities, and they should
needed, to data, be leveraged to minimize development time and
costs. A fundamental principle is that instruments


and sensors being sampled adhere to TCP/IP stan- • The development of an appropriate information
dards. This is, today, generally not the case in sensor model (XML) for describing scientific data,
networks, but is essential for the suites of sensors • The ability to organize collections dynami-
whose numbers will likely grow geometrically in cally using XML Document Type Definitions
the coming years. (DTDs),
• The use of Extensible Stylesheet Language
In practice, sensor network data collections: (XSL) style sheets to create interfaces to collec-
• Are designed to meet the specific needs of an tions that can be tailored to the requirements of
experiment or agency, separate user communities (or domains), and
• Are created by unique experimental platforms • The development of interoperable systems that
and methods, support the federation of data collections.
• Make measurements at different temporal and
spatial scales, With these new technologies, access and man-
• Employ different data (and metadata) formats agement of scientific data collections becomes a
and management systems, matter of manipulating an associated information
• Are highly specialized and customized for par- model. This is an active area of research in com-
ticular applications, and puter science. The ocean sciences community’s data
• Lack well-documented and generalized ab- management toolkit is being expanded by commu-
stractions for representing data and processing nity-led efforts to develop digital library services,
results. parallel compute platforms, distributed computing
environments, and persistent archives. There is a
Until recently, these inherent characteristics have strong synergy among these efforts; each requires
had a negative impact on the broader utility, and an ability to manage knowledge, information, and
subsequent longevity, of many scientific data col- data objects. This synergy has had a significant but
lections. Thankfully, rapid developments in infor- unforeseen consequence: although the require-
mation management technologies are, for the first ments driving infrastructure development within
time, enabling us to move beyond these inherent specific communities are different, development
constraints and to discard old communications, efforts have converged on a uniform architecture.
networking, and data management paradigms. The This architecture provides a suite of functionalities
need for an interoperable infrastructure that en- seamlessly assembled to form a “grid.”
ables new science based on distributed computing,
data sharing, and information integration is driving End-to-end systems should provide support for:
many national-scale projects in several disciplines. • Information Discovery – ability to query across
Furthermore, the exponential growth in storage multiple information repositories to identify
density and network speed with doubling times data of interest,
of significantly less than a year has eliminated the • Data Handling – ability to read data from a
need for centralized data archives. remote site for use within an application; in fact,
the specific location should be immaterial and
Several key factors are driving new technologies for transparent,
managing scientific data collections:


• Remote Processing – ability to filter or subset The proposed architecture for ocean sciences is
data before transmission over the network based on a grid-type infrastructure design adapted
• Data ingestion or publication – ability to add for real-time data collection and integration. The
data to collections for use by other researchers, core of the system comprises:
and • A data-handling system that will enable access
• Analysis – ability to use data in scientific simu- to data repositories and fast data caches across a
lations, for data mining, or for creation of new distributed network connecting heterogeneous
data collections. storage systems. The data caches will be used
for data staging of real-time data that will be
Technology research and development should moved after preliminary processing to near-line
focus on: storage. Managing real-time data accumulation
• Network design flexibility that accommodates and management across distributed caches and
new data sources, ever-increasing data rates and archival storage systems is a key IT research
volumes, changes in Internet performance, and goal.
variations in user demands and display devices, • An information discovery system that will
• Real-time data delivery and quality control to integrate multiple metadata repositories and
facilitate personalized, seamless, and transparent enable users to discover data based on data
access to sensor data streaming from the field characteristics instead of location. The informa-
and in archival data collections, tion discovery system of the knowledge network
• System design that maximizes network reli- will manage multiple levels of metadata, from
ability and configurability, enabling system IT-centric metadata to discipline-standardized
components to be reconfigured as determined metadata to sensor/application-level metadata.
by shifting priorities for data capture based on Metadata integration at multiple levels and
real-time environmental events and triggers across disciplines, from ocean sciences metadata
(i.e., earthquakes or oil spills), to geodetic metadata to seismic metadata, will
• Integration and dissemination of information be a primary research goal.
for time-critical analysis facilitated by XML- • An integrated execution system or scientific
mediated query processing, work flow that will provide operations on data
• Continuous archives of raw, processed, and and data streams at multiple locations in the
analyzed data, data management corridor, including near-sen-
• Access control systems that rank user priority at sor operations such as data validation to data
the network (bandwidth consumption) level and storage sites; operations such as subsetting,
authorize data delivery at the Internet interface, metadata extraction and reformatting; and near-
and application operations including data layouts
• Exploitation of the bidirectional communica- and presentation. Extraction of metadata from
tions in the Internet to allow interactive instru- real-time data flow, as well as metadata fu-
ment and network control. sion across multiple sensor data, is an essential
research goal.


The key feature is the development and integration Collaboration systems for desktop document shar-
of these components to generate a higher-order, ing is mature, but systems for scientific collabora-
multidisciplinary, combined functionality that can tion employing large data sets and high-resolution,
be used for large-scale data processing and sharing. real-time graphics are not. There is currently no
widespread market for tools of this kind—conse-
5. Networked Collaboration quently, companies have little incentive to develop
such technologies for the scientific community.
The construction of group collaboration systems AccessGrid is the only currently widely adopted
has been pursued in both the commercial and aca- infrastructure for scientific collaboration, and the
demic sectors through the use of both proprietary development of future science-oriented collabo-
and open-source tools. For example, the University ration tools should attempt to adopt the AG2.0
of Michigan has developed such systems to support framework. Tool development should follow the
scientific research, including NSF-funded col- guidelines being formalized in the Advanced Col-
laboratory projects in the space sciences. Partly as a laborative Environments Research Group in the
result of experiences from few successes and many Global Grid Forum.
failures of these collaboratory projects, several suc-
cessful spin-off projects have resulted, including Below is a list of lessons learned from our past
the University of Michigan Coursetools (http:// experience with collaboration tools and future
coursetools.ummu.umich.edu), which is now used research challenges.
by a significant fraction of University of Michigan 1. Research creating user-level network overlays
instructors to put their course materials online over available topology is an important issue in
and support class discussions, and UM.Worktools building scalable and high-performance col-
(http://worktools.si.umich.edu), which is used laboration systems. Firewalls and NATs, for
to support distribution and archiving of material example, often hinder use of collaboration tools.
related to various committees and groups at the Simplified ways must be found for developing
university. high-performance collaboration tools and mak-
ing them available for use over such networks.
Collaboration is natural for oceanographers. Dur- 2. New collaboration paradigms need to be
ing an oceanographic cruise, it is common for a explored. Examples of popular paradigms are
wide variety of scientists to be involved in all the email, instant messaging, and peer-to-peer
different aspects of data collection and interpreta- infrastructures. But each has weaknesses and
tion. For example, in core interpretation, a single strengths. Examples of potential extensions of
core will be analyzed for chemistry, stratigraphy, these systems to make them more powerful
sedimentology, microfossil species, and other core from the collaboration perspective are email
attributes. Interpretation can progress much more services that include file sharing and “presence
smoothly if each scientist’s interpretation, smear awareness” (that is, knowing who is on line),
slides, well logs, and handwritten notes can be instant messaging services that can include at-
made and stored in such a way that they can be tachments and support for “disconnected” users,
called up on demand. All of this can be extended to and peer-to-peer infrastructure for conferencing
remote, Internet-based collaboration when ashore. and better security.


3. Security policy management needs to be simpli- 9. If adoption is the goal, it is often best to stick to
fied in group collaboration systems. stable, widely available technology. This is par-
4. Security will need to be implemented at the end ticularly true of collaboration software because
points and made largely transparent. Collabora- if even a fraction of people don’t participate, the
tion systems make it difficult to manage trust group may choose to give up on the collabora-
and security at the servers alone. Infrastructure tion tools.
will need to ensure privacy during collabora- 10. Technology adoption challenge is often un-
tions. derestimated. This is particularly true of col-
5. Distinguishing good data from bad data is an laboration technology because it can cause a
important research challenge in open collabora- cultural shift, may cause paradigm shifts in the
tion systems. Collaboration systems generally way people do things, and often requires mass
make it easy for people to contribute data, but adoption to succeed. Experiences from the
not to delete data. Even if deleting data were Upper Atmosphere Research Collaboratory
simple, few incentives exist for people to delete (UARC) and the Space Physics and Aeronomy
any data and often there is no reliable indicator Research Collaboratory (SPARC) collaboratory
of ownership. People will need more control of projects, as well as CourseTools and WorkTools
their service subscriptions (e.g., research in con- projects to highlight the technology adoption
tent-based, publish-subscribe systems becomes aspects—some of these tools turned into niche
relevant). tools while others continue to be heavily used
6. User education and administrative support can at the university. The differences in where these
be crucial. tools are now used are probably not a result of
7. For success, think big, and support many com- technology differences—“marketing” and the
munities. If collaboration tools are developed nature of the audience targeted by the tools
only for a specific community (e.g., ocean probably played a significant role as well.
sciences) and an attempt is not made to extend
them to other communities for a wider adop- In summary, basic information technology is fast
tion, the effort is less likely to be successful. becoming a commodity. In research, we thus need
8. Understand user requirements and show clear to place more emphasis on issues such as security,
benefits to BOTH individuals and groups. Un- reliability, and making it easier for people to adapt
less individuals see benefits from the tools, such the technology. This is particularly pertinent for
as improving their personal productivity, they collaborative infrastructures.
are less likely to use group tools, or contribute
data to them, just for the benefit of others. For
example, CourseTools benefited instructors by
speeding up creation of a protected course Web
site; class lists were automatically updated with
information from the registrar’s database. This
tool thus became widely adopted.


III. Next Steps
A. Challenges
Both the OITI report (OITI Steering Committee, Many of today’s issues, ranging from climate-
2002) and the NSF/Geosciences Cyberinfrastruc- change research to homeland security, cross tradi-
ture report (NSF Blue Ribbon Advisory Panel on tional disciplinary boundaries. They require syn-
Cyberinfrastructure, 2003) highlighted many of thesis of information from many sources, and they
the same issues regarding the importance of CI must adapt in response to changing requirements
for the “domain sciences” (i.e., non-computer sci- and policies. The information systems supporting
ences), as well as the challenges in moving forward. these complex, multidisciplinary efforts must also
CI, if properly implemented, will enable a new be adaptable. However, our information models are
level of cross-disciplinary science and knowledge generally static, pipelined, and product-focused.
delivery. Moreover, the rapid pace of technological
change has made CI, which was once the domain Modern CI is technically challenging as the line
of supercomputer centers, within the reach of the between hardware and software functionality blurs,
individual scientist. and the role of the CPU expands beyond simply
providing floating-point operations. We are ef-
The nature of the scientific questions in ocean fectively building complex “ecosystems” of CI, and
sciences today requires a more wide-ranging ap- as with all systems, there is emergent and often
proach that does not lend itself to the traditional unexpected behavior. The ocean sciences commu-
pipeline processing model where it is relatively easy nity is struggling to develop flexible information
to allocate specific functions to specific hardware systems, and sometimes lapses into the old pipeline
and software components. In the past, we could model with defined, static workflows. Such an ap-
collect our data (often recorded on paper charts or proach leads to data and knowledge silos, which are
notebooks) at rates that were easily manageable by not suited to modern cross-disciplinary research.
a single individual. Our analysis tools were gener- For example, optimizing a compute system to run
ally home-grown (and written in FORTRAN). a particular benchmark may not be effective for
We published only in peer-reviewed journals, the ocean sciences community where the models
sometimes years after the data were collected. We and the underlying hardware architecture change
are now asking complex, multidisciplinary ques- frequently. Our workflows must be understood and
tions, and the answers need to be reconstituted in a be able to respond to changing science require-
manner to serve a much broader community than ments, and we need solid hardware, libraries, and
simply our disciplinary peers. The volume and com- compilers that are well-integrated to provide robust
plexity of the data sets go far beyond the capacity systems.
of a single researcher to manage either from a data
management or a data interpretation perspec- The concept of dynamic workflows that go all the
tive. The entire process of collection, analysis, and way from data collection, analysis, assimilation, and
publication is dispersed, complex, and often per- publication requires a new class of software that
formed in near real time. With the advent of ocean seeks to capture the basic essence of the scientific
observatories this trend will accelerate. Thus, our structure and dynamics. Moreover, the sensor net-
approach to CI must be much more dynamic and works themselves can be dynamically reconfigured
iterative; there are no more “point solutions.”


and adapted to the outputs of assimilation models. and programs. We do not need, necessarily, more
Developing the necessary software frameworks for programmers; we need individuals familiar with
such a vision will require that we identify object ocean sciences and with modern IT. Such individu-
attributes and their relationships. In this case, sen- als are increasingly difficult to locate and retain,
sors, analysis tools, and other components of the given the soft-money nature of ocean sciences.
workflow are conceived as objects that have defined
interfaces and methods. Information systems for the 21st century will
require a systems approach to both hardware and
An ontology is one approach for developing the software; our traditional focus on CPU perfor-
underlying conceptual models and the relationships mance, while continuing to be important, is not the
among data objects. Furthermore, in the context central focus. Issues of distribution and scalability
of a knowledge management framework, new rule will become dominant. Our approach will need
sets can be defined and extended for use by compu- to be based on a firm understanding of the pro-
tational processes. Such an approach would begin cesses involved, instead of the tiered architecture of
to bridge the gap between the physical world (e.g., compute, storage, and networking. The challenge is
sensor networks associated with ocean observato- that this introduces a higher level of development
ries) and the digital world (e.g., shoreside analyses complexity; typical hardware and software vendors
and publications). Methods such as ontologies and cannot work at this level of integration, which
the semantic web will accomplish this bridging means that the scientist must be able to define
through the encapsulation of knowledge in soft- requirements and identify strategies. Thus, the
ware. widespread proliferation of networks presents both
an opportunity and challenge to the ocean sciences
Along with the technical challenges of hardware community. Technical issues such as authentication
and software, there are challenges with the human and security clearly require considerable thought
resource side of information infrastructure. Build- and development, but there are science issues such
ing and retaining a large pool of technically capable as data provenance that will require the communi-
people is difficult for oceanographic institutions ty’s involvement as well.

B. Pilot Projects and Ocean Information Technology Infrastructure


A common theme that emerged during the oceans structure, 2003) and with the results of the NCAR
ITI workshop was the imperative for cyberinfra- Environmental Research and Education workshop
structure projects that focused on domain-spe- on cyberinfrastructure (NCAR, 2003). The level
cific applications of new IT technology. Both the of discussion and interaction between computer
academic and private-sector computer scientists scientists and oceanographers at the oceans ITI
emphasized the need to develop new IT capabili- workshop demonstrated that such partnerships can
ties in the context of real-world problems. This is be an effective means to achieve the goals of the
consistent with the NSF cyberinfrastructure report NSF cyberinfrastructure program.
(NSF Blue Ribbon Advisory Panel on Cyberinfra-


An initiative in ocean information technology in- research agendas of both computer science and
frastructure would be a timely investment that po- oceanography will be advanced. The workshop
sitions the ocean sciences community to confront established the following criteria for selecting such
the impending data deluge. The amount of sensor precursor and pilot projects:
data and simulation data that is expected in the
next five years will dwarf all current systems in size • Precursor projects should achieve demonstrable
and diversity. The ocean sciences community needs success in two to three years,
to act now to develop the necessary data manage- • Pilot projects should significantly advance both
ment infrastructure. The challenges that must ocean and computer sciences,
be met include not only data management and • Both types of projects should address important
preservation, but also data analysis to generate new community-wide problems,
knowledge. Information management technologies • Both should build bridges between ocean sci-
will be part of the emerging cyberinfrastructure. ences and computer sciences,
The generation of information and knowledge will • Both should justify increasing financial support
require the ability to analyze entire data collections. over time,
The implication is that ocean information technol- • Both should have a quick, substantial buy-in
ogy will need to be linked to the NSF Teragrid to from the ocean sciences community.
support comprehensive analyses of ocean data and
numerical models. Pilot and precursor projects should test each com-
ponent of the information management infrastruc-
Two types of projects emerged during the work- ture. The projects will be implemented most rapidly
shop discussions: precursor projects and pilot proj- by building upon existing systems that already
ects. Precursor projects are those where substantial support data collections for Earth systems science
progress can be made over a two- to three-year projects. Many of the pilot projects should be dem-
period on a community-wide issue. For example, onstrations of either the integration of appropriate
definition of semantics for oceanographic data systems to provide better management and access,
could be one such precursor project. Pilot projects or the application of an existing system to a new
are based on more-exploratory concepts where the collection to show generic utility.

C. Some Possible Pilot Project Areas


There are numerous possible pilot projects that will method for selected data set formats for sensor
demonstrate either the integration or application data. The data would be retrieved from the sensor
of emerging information technologies for use by networks, which in turn would access either a sys-
ocean scientists. tem containing the most recent data or an archive.
The combined environment would demonstrate
Project Area 1: Integrate modern data-access the integration of data-collecting technology with
mechanisms on top of the emerging technolo- digital library technology for access, data grid tech-
gies for management of sensor data. Projects nology for distributed data storage, and persistent
would demonstrate a standard data manipulation archive technology for long-term preservation.


Project Area 2: Integrate multiple existing col- Project Area 5: Develop “off-the-shelf” mod-
lections through a common data federation ules that can be easily extended and provide
mechanism. Within the ocean sciences commu- simple-to-use interfaces. Such a “collection in a
nity, multiple data collections are being formed, box” (perhaps “model in a box”, or “data in a box”)
both on the basis of individual projects, and on or “data bricks” (commodity-based disk caches—
the basis of different originating institutions. A also called “grid bricks” or “cyber bricks”) would be
demonstration is needed of the ability to provide widely available and robust, with easily accessible
uniform services across separate collections, as well documentation, and formal training of some kind,
as the ability to federate disjointed collections into for the user. Moreover, they would be designed
a community library. Possible approaches are to use to be used in a parallel environment so that the
peer-to-peer federation mechanisms such as those highest-end computing facilities are not necessary.
implemented in the SRB, or knowledge-based An example of such a “collection in a box” might
mediators that impose a domain-specific concept be a modest community model, or a data set for
space across the collections. A very aggressive a multidisciplinary program in its early stages of
project is to seek the integration of the knowledge- development.
based mediation approach with the peer-to-peer
federation technologies, ensuring both semantic Project Area 6: Develop new methods for data
and physical access to the data. preservation, especially for data sets not as-
sociated with federal archives. This effort goes
Project Area 3: Develop standard approaches well beyond mere archiving; it involves thoughtful
for processing systems. A standard approach to replication of the data resources, ready access via
the multitude of processing threads is needed to online facilities, embedded manipulation tools for
help manage workflow in an increasingly complex ease of use and interpretation, and authentication
computing environment. Workflow management so that the user may have confidence in the data
may have special application to HPC, as tools such he or she is accessing. Every large data set in the
as multidisciplinary data assimilation, or the de- ocean sciences, existing or conceived, faces these
velopment of nested models that involve complex challenges.
links, are increasingly employed by oceanographers.
Project Area 7: Develop tools to foster knowl-
Project Area 4: Develop standard services edge discovery and collaboration. Widely
for analyzing and manipulating data. These available tools that are interactive, and could be
services could be applied to all elements of a data linked readily to specific scientific questions, would
set. For example, the ability to generate images be welcomed. There is another area of interest to
in the same fashion for all data studies—all data all, extending well beyond the borders of ocean
sets—would be a great boon for those attempt- sciences—“the Web.” Concerns include (but are
ing to bring research tools into the classroom. not limited to): security, access control, service
Real-time monitoring of remote instruments, or in and discovery functions, resource management,
laboratories on ships, would profit from a standard authentication, data authenticity, and updating the
set of protocols. community on available solutions.


D. Challenges in Developing an OITI Management Structure
A management structure must be developed for The University National Oceanographic Labora-
OITI to ensure its success. Although the tradi- tory System (UNOLS) model is one possible ap-
tional NSF approach of workshops and community proach to managing and developing CI resources.
input works for many large programs, there may It could enhance the capabilities of individual
not be sufficient commonality in the requirements investigators as well as ensure the continuing sup-
to allow this approach to work. CI is not a ship nor port of local and regional infrastructure within
a program. CI is wide-ranging in its capabilities the national capability for CI. However, particular
and in its implementation. CI has the potential to attention would need to be paid to research and
be used for a wide range of NSF-funded activities, development so that new capabilities can be incor-
not just research. porated into the overall CI. The focus should be on
providing services to the geosciences community,
Many IT projects have failed in the past because where flexibility and evolution is a key require-
of an inability to define requirements sufficiently ment. Fundamentally, there are no “solutions” to
(or correctly), especially when coupled with rapid the community’s needs for CI. A research and
technology evolution. NASA’s EOSDIS is a clas- development program that focuses on the needs of
sic example. As it was being defined, Web access the geosciences community could ensure that new
to data was becoming an accepted practice in the technology is identified and incorporated into CI.
scientific community, yet the centralized, large,
industry-supported EOSDIS struggled to provide A project office to manage programmatic elements
this level of service. of the NSF/OCE cyberinfrastructure initiative
would ensure a successful, community-wide ef-
The challenge to NSF is twofold. NSF is comfort- fort. The office would develop a five-year plan for
able funding large, focused pieces of infrastructure community-wide elements of the program (the
such as HPC centers. It is also comfortable in plan would be revised at regular intervals), leaving
funding individual investigators. The result may be individual projects to the traditional NSF process.
centers that are struggling to find their place in a The project office might operate under a coopera-
commodity IT world coupled with a dispersed and tive agreement, along the lines of the Integrated
uncoordinated set of local resources provided to Ocean Drilling Program, one of a number of pos-
individual investigators. Neither strategy leads to sible structures that have proven to be successful in
a sustainable program in CI. The center approach large-program management.
cannot keep up with the rapid changes at the
fringes of system in regards to computation, data
management, and data analysis, while the princi-
pal investigator approach focuses solely on the IT
needs of a specific scientific project.


Acronyms
AG2.0.............. AccessGrid 2.0
APIs ................ Application Programmer Interfaces
CI .................... Cyberinfrastructure
CLIVAR .......... Climate Variability and Predictability
CODMAC ...... Committee on Data Management and Computation
CPU ................ Central Processing Unit
DOE ............... Department of Energy
DODS ............. Distributed Ocean Data System
DTDs .............. Document Type Definitions
EOSDIS.......... Earth Observing Station Data & Information System
ERE ................ Environmental Research & Education
GODAE ......... Global Ocean Data Assimilation Experiment
HPC ................ High-Performance Computing
IBM................. International Business Machines Corporation
IOOS .............. Integrated Ocean Observing System
IT .................... Information Technology
ITI ................... Information Technology Infrastructure
MREFC .......... Major Research Equipment Facilities Construction
NASA ............. National Aeronautics and Space Administration
NAT ................ Network Address Translation
NCAR ............. National Center for Atmospheric Research
NIH................. National Institutes of Health
NODC ............ National Oceanographic Data Center
NOPP ............. National Oceanographic Partnership Program
NRC ................ National Research Council
NSF ................. National Science Foundation
NSF/OCE....... NSF Division of Ocean Sciences
OCCC............. Ocean Carbon and Climate Change
OITI................ Ocean Information Technology Infrastructure
ONR ............... Office of Naval Research
OpenDAP ....... Open Data Access Protocol
ORB ................ Object Ring Buffer
ORION ........... Ocean Research Interactive Observatory Network
ROADNet....... Real-time Observatories, Applications, and Data management Network
SDSC .............. San Diego Supercomputer Center
SETI ............... Search for Extraterrestrial Intelligence
SHRDs ............ Scalable High Resolution Displays
SPARC ............ Space Physics and Aeronomy Research Collaboratory
SRB ................. Storage Resource Broker
SMP ................ Symmetric Multiprocessors
TB ................... Terabyte
TCP/IP ........... Transmission Control Protocol/Internet Protocol
UARC ............. Upper Atmosphere Research Collaboratory
UNOLS .......... University National Oceanographic Laboratory System
VORB ............. Virtual Object Ring Buffer
XML ............... Extensible Markup Language
XSL ................. Extensible Stylesheet Language
WWW ............ World Wide Web


References
OITI Steering Committee. 2002. An Information Technology Infrastructure Plan
to Advance Ocean Sciences. 80 pp., www.geo-prose.com/oiti/report.html
NSF Blue Ribbon Advisory Panel on Cyberinfrastructure. 2003. Revolutionizing
Science and Engineering through Cyberinfrastructure. Arlington,VA, National
Science Foundation, 84 pp., www.cise.nsf.gov/evnt/reports/toc.cfm
NCAR. 2003. Cyberinfrastructure for Environmental Research and Education,
Boulder, CO, 15 pp., www.ncar.ucar.edu/cyber

Editing and design by Geosciences Professional Services, Inc.


www.geo-prose.com/oceans_iti_trends

October 2004



You might also like