Scientific Computing – Data Services Group
The Data Services group runs datastores for storing, archiving, preserving, analysing, and backing up scientific data, with a nominal capacity well over 100 Petabytes. Most of the data is from the Large Hadron Collider, the second largest by volume is currently climate modelling, and STFC’s own facilities are also growing in volume.
As many of us are ourselves scientists, we also participate in projects and other research into high end data management. The aim is to increase the knowledge and capability of data management supporting research globally, to improve the services we run by making use of research, and to increase the economic and societal impact of our data by providing expertise and facilities for using open data.
The data services include
- Tape backed storage, with optionally 1, 2, or 3 copies on tape – the most paranoid users have one copy in a tape robot, one in a fire safe, and one off site. Most of the tapestore capacity is based on CASTOR, the same storage system run by CERN; but we also run DMF and our own in-house data service.
- Database services, mainly based on Oracle and MySQL. We run 13 Oracle RAC databases in production, over about 38 nodes, serving over 10,000 calls per second on average.
- Preservation services – we are one of the first science users of Preservica Enterprise edition, a service we run for long term preservation of science data from ISIS. It is available also for other science customers.
Our datastores provide a range of interfaces to enable data to be deposited and read back:
- Storage Resource Manager (SRM) serves the LHC and GridPP in particular, and other global grid communities, driving data with GridFTP.
- xroot is also used to move data internally: together, xroot and GridFTP and CASTOR’s native RFIO (and, eventually, WebDAV) protocols routinely deliver up to ten gigabytes per second for LHC alone, most of it going into our own LCG Tier 1 clusters, and the rest copied to Tier 2s or other Tier 1s across the world. As moving data is critically important to these services, we spend a lot of time fine tuning transfer parameters to optimise the transfer rate.
- We run both SRB and iRODS services, the latter mainly serving the EUDAT project at the moment. We have run SRB since forever, and iRODS since it was developed; iRODS is seeing increasing production use in EUDAT.
- We provide GlobusOnline endpoints – currently to disk-only storage.
- Some of our data is available over the web, either via dedicated web server endpoints or data portals.
- We have, of course, internal interfaces to the datastore, such as NFS.
The group’s expertise lies in providing high end scientific data services to support research, as well as the research that supports building new data services. We are often involved in projects, providing expertise in high end data management and data security. The group’s expertise includes:
- Big (volume) data for research – working repository, archiving, preservation.
- High availability services.
- Data security – while data integrity is our main security concern, we have considerable expertise in practical data security, including single sign-on.
- Data for specific areas of research: high energy physics, astronomy, fusion,
- Scaling, scalability testing
- EUDAT – delivering a shared data e-Infrastructure for a diverse range of user communities.