Open source data repository technologies

From the AOSP landscape study, it was clear that open access institutional repositories are well established (179 African repositories registered on OpenDOAR) on the African continent. The majority of the repositories use DSpace open source software, and great capacity exists among African system administrators and librarians. Alternative options to data repository software include: Invenio 3 (open source for large scale repositories, highly scalable up to 100+ million records and petabytes of file) and Dataverse (open source research data repository with many features – see https://github.com/IQSS/dataverse/releases/tag/v4.10 ), as well as the technology options mentioned by the World Bank Toolkit (a great resource to guide you in terms of setting up your data repository service). Ideally a data repository should form part of a science gateway, including shared equipment and instruments, computational services, advanced software applications, collaboration capabilities, data repositories, and networks.

For those interested in looking at DSpace as an option: the following information on how DSpace can serve as a data repository was recently shared via the DSpace mailing list (Bram Luyten):

Examples of DSpace used as data repositories

University of Exeter https://ore.exeter.ac.uk/repository/handle/10871/14881 (single item, multiple TBs of data)

University of Nottingham Research Data Management Repository https://rdmc.nottingham.ac.uk/

Swiss Federal Institute of Technology in Zurich (ETH Zurich) https://www.research-collection.ethz.ch/

University of Cambridge https://www.repository.cam.ac.uk/browse?type=type&value=Dataset
DRYAD https://datadryad.org/

Indiana University https://dataworks.iupui.edu/

Smithsonian Libraries https://repository.si.edu/handle/10088/27850

Strengths of DSpace as a data repository

– File type agnostic. You’re not limited any specific file type or particular size.

– No theoretical file size limit. Even though there might be limits in other places (OS, underlying software), DSpace itself has no known limit of data size.

– Flexible metadata schemas, allowing you to align with DataCite and other metadata schema’s.

– DOI integration with DataCite (connected with DataCite for automatic DOI minting).

– Different workflows and rules are possible on a per collection basis, giving an excellent starting point for a mixed Publication/Data set repository.

– Advancing URLs, e.g. where a researcher wants a permanent URL for their data set, so they can send it to publishers, but they would also like to refer to the permanent URL of the published paper in the dataset submission. DSpace-CRIS can generate the links to the datasets while submitting the publication, and DSpace-CRIS generates the reciprocal link (from the dataset to the publication) automatically, without the need for repository administrator to reopen the dataset items and manually add the link to the publication.

DSpace-CRIS consists of a data model describing objects of interest to Research and Development and a set of tools to manage the data. Standard DSpace is used to deal with publications and data sets, whereas DSpace-CRIS involves other CRIS entities: Researcher Pages, Projects, Organization Units and Second Level Dynamic Objects (single entities specialized by a profile, such as Journal, Prize, Event etc; because any profile can define its own set of properties and nested objects). For more info, see: https://dspace-cris.4science.it/handle/123456789/15