Migrating EPrints data into DSpace
1 About this document
- Author
- Corey Wallis, RUBRIC Technical Officer
- Purpose
- This technical report outlines how to use the EPrints data migration scripts to ingest MARC records into a DSpace repository.
- Audience
- RUBRIC Project Partners and other users of EPrints and DSpace
- Requirements
- A running instance of DSpace
- An XML export from an EPrints repository
- Python 2.4 installed on the system to run the migration
- The libxml2 library, including the Python bindings, installed on the system to run the migration
- The libxslt library, including the Python bindings, installed on the system to run the migration
- References
- Official DSpace websitehttp://www.dspace.org/
- Official DSpace Item Importer documentationhttp://dspace.org/technology/system-docs/application.html#itemimporter
- Official DSpace Dublin Core specificationhttp://www.dspace.org/technology/metadata.html
- Official EPrints websitehttp://www.eprints.org/software/
- Official Eprints documentation on the export_xml commandhttp://www.eprints.org/documentation/tech/php/export_xml.php
- Official Python websitehttp://www.python.org/
- Official libxml2 websitehttp://xmlsoft.org/python.html
- Official libxslt websitehttp://xmlsoft.org/XSLT/python.html
- Official py.test tool and library websitehttp://codespeak.net/py/current/doc/test.html
- Official utf-x websitehttp://utf-x.sourceforge.net/
- Official Subversion websitehttp://subversion.tigris.org/
- Notes
- The EPrints to DSpace migration script has been developed on a Linux based system. Python is a cross platform programming language and therefore the scripts should also run under Microsoft Windows, and the OSX operating systems. This has not been tested.
- The installation of the Python programming language, the libxml2 and libxslt libraries, including Python bindings, is outside the scope of this technical report. Many Linux distributions, such as Ubuntu, will have these already installed.
- The EPrints to DSpace metadata transformation was developed using sample data from the University of Southern Queensland. Therefore adjustments may need to be made to the XSL transformation to take into account other institutions requirements.
- Due to the modular design of the EPrints to DSpace data migration, these adjustments can be made without the need to modify the Python script, or the other supporting files. If modifications are to be made it is strongly suggested that a copy of the development files be checked out via subversion. This will provide you with the utf-x tests used to develop the original files and provide a basis for customisation using the same utf-x framework.
2 Background Information
A component of the work undertaken at RUBRIC-Central is the development of various data migration strategies. These strategies are designed to assist RUBRIC Project Partners to migrate data into, and out of, various systems. The data migrations specifically target the three institutional repository solutions under consideration as part of the project.
Interest was expressed in being able to migrate records, from an Eprints repository, into a DSpace repository. This technical report, and the associated Python scripts, comprise the strategy to achieve this goal.
The Python script creates a DSpace simple archive that can then be ingested into a DSpace repository. The output of the script has been tested using DSpace version 1.4 alpha. It is possible that the output can be ingested into other versions of DSpace, at the time of writing this had not been tested.
The Python scripts have been developed using a unit testing approach using the testing framework provided by the py.test tool and library. More information about the library is available at http://codespeak.net/py/current/doc/test.html. These scripts have been developed using Python version 2.4, and may work with earlier versions. However this has not been tested.
The conversion of the EPrints metadata into the Dublin Core supported by DSpace is achieved using XSL transformations. The stylesheets have been developed using a unit testing approach using the framework provided by the utf-x suite. More information about the utf-x suite is available at http://utf-x.sourceforge.net/.
3 The Python Scripts
The data migration work is carried out by a script written in the Python programming language. The following files make up the script used in the data migration.
- config.xml
- The configuration file defines options used by the eprints_to_dspace.py script
- eprints_to_dspace.py
- The Python script that does all of the work
- ./xsl/eprints_to_dspace.xsl
- The XSL transformation used to convert the EPrints metadata into DSpace compliant Dublin Core
- ./xsl/ASRC.xml
- An XML file containing all of the ASRC subject codes. It is used by the eprints_to_dspace.xsl file to replace the ASRC subject number with the complete description. For more information on this file see the RUBRIC Technical Report: ASRC Subject codes for DSpace
4 Downloading the Python Script
All of the data migration scripts, and associated code libraries, modules and files, are made available via a publicly accessibly website at the following URL.
https://rubric-central.usq.edu.au/svn/Public/code
It is possible to download files from this website in two different ways. The first is via a standard Internet browser. The second is via a subversion client.
4.1 Download the Distribution Package
The Python script, and associated files have been packaged for ease of downloaded as a tar.gz file. The file is available via the following URL:
The procedure for configuring, and using, the scripts is outlined in section 5 of this technical report.
4.2 Download the Python Script via Subversion
If you have the subversion client installed you can download the Python scripts, test files, and other files used during development. The URL that you will need to check out is as follows:
https://rubric-central.usq.edu.au/svn/Public/code/legacy_code/eprints-to-dspace/development/
5 How to Migrate the EPrints data
The following sections of this technical report outline the procedure for using the Python script to migrate the EPrints data into a DSpace simple archive and ingest this archive into a DSpace repository.
It is assumed that you have installed Python, the libxml2 and libxslt libraries, including Python bindings. Specific installation instructions for these components is outside the scope of this technical report.
5.1 Installing the Python Script
Download the eprints_to_dspace.tar.gz file from the URL outlined in section 4.11.https://rubric-central.usq.edu.au/svn/Public/code/legacy_code/eprints-to-dspace/eprints_to_dspace.tar.gz
Extract the contents of the file
tar -xzf ~/eprints-to-dspace.tar.gz
The Python script, and supporting files, will be extracted into the following directory
~/eprints-to-dspace
5.2 Configuring the Python Script
The configuration options for the Python script is stored in the config.xml file. Each option is outlined below, adjust the values to your local configuration.
- doXSLTransformation
- If set to True, this option instructs the script to conduct the XSL transformation and convert the EPrints metadata into DSpace compliant Dublin Core
- doCreateArchive
- If set to True, this option instructs the script to create the DSpace simple archive, using the output of the XSL transformation
- xslLocation
- The location of the XSL transformation used to convert the EPrints metadata into DSpace compliant Dublin Core. Leave this configuration at the default value
- xslOutputLocation
- This configuration option defines where the output of the XSL transformation should be stored. If the createDirectory attribute is set to True, the directory will be created if it does not exist, and deleted then recreated if it does exist.
- eprintsMetadataLocation
- This configuration option defines where the XML file created by the EPrints xml_export command is stored
- dspaceLocation
- This configuration option defines where the DSpace simple archive will be created. If the createDirectory attribute is set to True, the directory will be created if it does not exist, and deleted then recreated if it does exist.
Once all of the configuration changes have been made, save the file and exit out of your editor.
5.3 Creating the DSpace Simple Archive
The creation of the DSpace Simple Archive is a two step process. First it is necessary to convert the Eprints metadata into DSpace compliant Dublin Core. The second step takes the Dublin Core and creates a DSpace Simple Archive.
5.3.1 Converting the EPrints metadata
Ensure the doXSLTransformation configuration option is set to True
Ensure the doCreateArchive configuration option is set to False
Ensure all of the other configuration options are correct
Invoke the script using the following command
./eprints_to_dspace.py
Upon completion a series of XML files will be created in the directory specified by the xslOutputLocation configuration option
5.3.2 Creating the DSpace Simple Archive
Ensure the doXSLTransformation configuration option is set to False
Ensure the doCreateArchive configuration option is set to True
Ensure all of the other configuration options are correct
Invoke the script using the following command
./eprints_to_dspace.py
Upon completion a the DSpace simple archive will be created in the directory specified by the dspaceLocation configuration option
5.3.3 Creating the DSpace Simple Archive in one step
Ensure the doXSLTransformation configuration option is set to True
Ensure the doCreateArchive configuration option is set to True
Ensure all of the other configuration options are correct
Invoke the script using the following command
./eprints_to_dspace.py
Upon completion a series of XML files will be created in the directory specified by the xslOutputLocation configuration option
Upon completion a the DSpace simple archive will be created in the directory specified by the dspaceLocation configuration option
5.4 Ingesting the DSpace Simple Archive
Ingesting the DSpace simple archive is a three step process. The first is to run a test ingest, if successful continue with a true ingest, and finally re-index the repository to ensure the indexes are updated with the new objects.
If for any reason it is necessary to delete the imported objects, section 5.4.4 outlines this process.
Before ingesting the items it is necessary to identify the handle of the collection that these items are to be associated with. To achieve this, complete the following procedure:
Visit the DSpace homepage in your favourite Internet browser
Click on the Communities & Collections link in the browse menu
Hover the mouse over the link for the collection you wish to import into
Determine the handle for the collection by examining the URL displayed in the status bar of your Internet browser. The numbers after the word handle are the handle for this collection.
5.4.1 Ingesting the DSpace Simple Archive – Test
If necessary copy the DSpace Simple Archive to the server running DSpace
Ensure the DSpace Simple Archive is accessible to the dspace user
Change to the dspace user
Navigate to the bin directory of the DSpace installation. For example on RUBRIC systems this directory is at the following location
/usr/local/dspace/bin
Invoke the following command replacing:
[eperson] with the email address associated with a DSpace eperson with sufficient privilages to add items to the repository
[collection] with the collection handle identified in section 5.5
[source] with the location of the DSpace simple archive
[map] with a suitable location for the map file
./dsrun org.dspace.app.itemimport.ItemImport –add--eperson=[eperson] –collection=[collection]--source=[source] --mapfile=[map] --test
If the ingest is successful the DSpace utility will display an appropriate success message.
If the ingest fails for any reason, examine the error message and take the appropriate action to resolve the error condition.
5.4.2 Ingesting the DSpace Simple Archive – Live
Ensure the DSpace Simple Archive is accessible to the dspace user
Change to the dspace user
Navigate to the bin directory of the DSpace installation. For example on RUBRIC systems this directory is at the following location
/usr/local/dspace/bin
Invoke the following command replacing:
[eperson] with the email address associated with a DSpace eperson with sufficient privilages to add items to the repository
[collection] with the collection handle identified in section 5.5
[source] with the location of the DSpace simple archive
[map] with a suitable location for the map file
./dsrun org.dspace.app.itemimport.ItemImport --add
--eperson=[eperson] --collection=[collection]
--source=[source] --mapfile=[map]
If the ingest is successful the DSpace utility will display an appropriate success message.
If the ingest fails for any reason, examine the error message and take the appropriate action to resolve the error condition.
5.4.3 Re-Indexing the DSpace repository
To minimise impact to users of the system, ensure the DSpace repository isn't under any heavy load
Ensure you are in the bin directory of the DSpace installation. For example on RUBRIC systems this directory is at the following location
/usr/local/dspace/bin
Invoke the following command
./index-all
Visit the DSpace homepage in your favourite Internet browser
Click on the Communities & Collections link in the browse menu
Click on the link for the collection that the new items were ingested into
Explore the collection to ensure the ingest completed successfully
5.4.4 Deleting the Imported Items
If it is necessary to remove the items from the DSpace repository, complete the following procedure.
Ensure the map file created in section 5.5.2 is still available
Ensure the map file is accessible to the dspace user
Change to the dspace user
Navigate to the bin directory of the DSpace installation. For example on RUBRIC systems this directory is at the following location
/usr/local/dspace/bin
Invoke the following command replacing:
[eperson] with the email address associated with a DSpace eperson with sufficient privilages to add items to the repository
[map] with the location of the map file created in section 5.5.2
./dsrun org.dspace.app.itemimport.ItemImport --delete
--eperson=[eperson] --mapfile=[map]
If the deletion is successful the DSpace utility will display an appropriate success message
If the delete fails for any reason, examine the error message and take the appropriate action to resolve the error condition




