Data Management in Climate Research
by Kerstin Kleese
Parallel environmental models are currently some of the most demanding
codes we have. They push the machines available to their limit and are
still in need of more resources. However the bottleneck in running these
codes is not so much the code performance as the data handling strategies
employed. The High Performance Computing Initiative Centre at CLRC, Daresbury
Laboratory (UK) is setting up a new project to investigate this important
issue. Besides analysing existing solutions, the project will also look
for new, more flexible and portable approaches.
The UK environmental research community relies heavily on High Performance
Computing (HPC) to facilitate its studies. Fortunately the environmental
model codes have a large potential for parallelization, which has been
well explored by numerous research groups in the UK (eg UGAMP and OCCAM).
Still increased resolutions, longer runs or larger ensembles have a significant
impact on the HPC resources required and can easily overwhelm current systems.
Often data handling has the most dramatic influence on model run times
or determine whether a model can be run at all. Thus optimal data handling
is vital for this type of application. Unfortunately the problems do not
stop with the end of a successful model run. These codes produce vast amounts
of data which have to be archived for future analyses. Current data storage
systems leave something to be desired in speed and ease-of-use. The situation
for data retrieval is even worse. Tedious searching for archived data,
long waiting times and no selective extraction possibilities are common
problems for modern scientists. Sometimes it is faster to run a model again
instead of retrieving data from a previous run.
It is already clear that the data requirements of the community will
increase even more over the coming years. Big centres like the European
Centre for Medium Range Weather Forecast expect the volume of their data
archive to double every 18 months. New machine architectures allow potentially
larger models to be run. Data handling presents a severe bottleneck for
today's science, without new strategies it might prevent future progress.
First steps
For our project it was decided to take a holistic approach, identifying
four main areas of interest:
- data handling within the model codes
- file access during run time (on different platforms)
- data archival
- data retrieval.
Although all these areas have been investigated separately, and some
interesting in-house solutions exist, little has been done to offer a portable
solution that can be easily adapted to the actual requirements of different
sites.
The first step is to analyse the current situation.
Research results of leading scientific groups concerning data handling
strategies within model codes have been examined. A lot of work has been
done in this area over the past years, and we can certainly benefit from
that. We would like to compare the different approaches, trying to find
similarities, differences and tendencies that could be useful for the community.
Secondly a list is in preparation covering the different file access mechanisms
during run time on various systems. This gives a clear overview about what
is available, how fast is it and what the user can do to make the most
of it. This information will serve as a base for further investigations.
In connection with vendors and other sites we have started to gather more
information about the data archival and retrieval systems that are in use
today. The clear message so far is that there is a desperate need for more
intelligent solutions.
Future
Our project will continue to analyse existing solutions. It will test
which data handling strategies within model codes are best for which type
of application. We will frequently investigate new machines or relevant
changes to existing system architectures. A collaboration with a leading
systems house has just started, to determine which off-the-shelf products
could be used to provide more flexible data archival and retrieval mechanisms
for scientific data.
Please contact:
Kerstin Kleese - CLRC
Tel: +44 1 925 60 3207
E-mail: k.kleese@dl.ac.uk