Data Curation

Data Curation: Managing data to ensure they are fit for contemporary use and available for discovery and reuse.

Archiving and preservation are subsets of the larger curation process, which is a much broader, planned, and interactive process. The curation process is well understood by the UC San Diego Libraries and has long been applied to non-digital scholarly materials.

Archiving: Ensuring that data are properly selected, appraised, stored, and made accessible. The logical and physical integrity—including security and authenticity—of the data are maintained.

Preservation: Ensuring that items or collections remain accessible and viable in subsequent technology environments.

Why is data curation an important enterprise for UC San Diego?

Data curation is critically important for a research institution because it provides two vital services: (1) data are not merely stored, but are preserved in order to overcome technical obsolescence inherent in any storage system; and (2) data are documented in such a way that they can be referenced in, and linked from, scientific publications and meet the requirements of funding agencies.

Members of the UC San Diego research community regularly produce large amounts of data that need to be stored, analyzed, and preserved. These data sets and their derivative output (e.g., publications, visualizations, etc.) represent the intellectual capital of the University. They have inherent and enduring value and must be preserved and made accessible for reuse by future researchers.

Today’s interdisciplinary research challenges cannot be addressed without the ability to combine data from disparate disciplines. Researchers need to know what relevant data exist, as well as how to retrieve, combine, mine, and analyze data using the latest tools.

Granting agencies understand this fundamental need and are increasingly making it a condition of funding that researchers have a plan for preserving their data and for making it discoverable and available for reuse.

If UC San Diego is to remain competitive, we need to invest in baseline data services that respond to these new realities.

How is data curation being done at UC San Diego?

Curation services at UC San Diego are provided jointly by staff of the UC San Diego Libraries and the San Diego Supercomputer Center (SDSC).

The Libraries provide curatorial oversight and bibliographic control and integration services. SDSC staff provide the back-end technology services needed to actively store and maintain data. Staff from both organizations provide metadata services necessary to ensure that data remain discoverable and accessible.

The data itself is housed on campus in the UC San Diego cyberinfrastructure campus storage facility. This is merely the first level of storage needed. For true long-term preservation, it is essential to plan for more.

Because data are subject to loss caused by environmental, organizational, or technological disruptions, it is imperative that campus research data be replicated in at least two remote sites—geographically, organizationally, and technically independent of each other—and that the entire enterprise be anchored within a reliable source of revenue, as even a temporary interruption of proactive curation can lead to irreparable loss.

For this reason, another layer of service is required that stores exact duplicates of the data offsite. This important service would be modeled on Chronopolis, a ground breaking project started by the UC San Diego Libraries and SDSC and initially funded by the Library of Congress.

Chronopolis is now a joint partnership of SDSC, the UC San Diego Libraries, the National Center for Atmospheric Research (NCAR), and the University of Maryland. These sites have joined in order to provide the largest scale preservation environment in the US with 100 terabytes of federated storage—actively monitored, maintained, and managed at sites widely dispersed and on differing hardware platforms.

For more information, please contact David Minor at minor@sdsc.edu.