Harvesting ADT data to VITAL
1 About this document
- Author
- Bron Dye, RUBRIC Technical Officer
- Tim McCallum, RUBRIC Technical Officer
- Purpose
- This technical report outlines how to use the ADT Harvest script to harvest items from an ADT repository for import to VITAL3.1
- Audience
- RUBRIC Project Partners and other users of ADT
- Requirements
- Access to an ADT website
- Python 2.4 installed on the system running the harvest and on the system containing the Fedora repository (where the ingest will take place)
- The libxml2 library, including the Python bindings, installed on the system to run the harvest
- Ensure pdf2text module is installed
- An instance of VITAL3.1 for ingest
- References
- Official ADT website
- http://adt.caul.edu.au/
- Official VITAL website at VTLS
- http://www.vtls.com/Products/vital.shtml
- Documentation on the FOXML (Fedora Object XML) specification
- http://www.fedora.info/download/2.1.1/userdocs/digitalobjects/introFOXML.html
- Official Python website
- http://www.python.org/
- Official libxml2 website
- http://xmlsoft.org/python.html
- Official py.test tool and library website
- http://codespeak.net/py/current/doc/test.html
- Official Subversion website
- http://subversion.tigris.org/
- Notes
- The ADT harvesting scripts have been developed on a Linux based system. Python is a cross platform programming language and therefore the scripts should also run under Microsoft Windows, and the OSX operating systems. This has not been tested.
- The installation of the Python programming language and the libxml2 library, including Python bindings is outside the scope of this technical report. Many Linux distributions, such as Ubuntu, will have these already installed.
2 Background Information
A component of the work undertaken at RUBRIC-Central is the development of various data migration strategies. These strategies are designed to assist RUBRIC Project Partners to migrate data into, and out of, various systems. The data migrations specifically target the three institutional repository solutions under consideration as part of the project.
Interest was expressed in being able to migrate items from an Australian Digital Thesis (ADT) repository into other repositories, such as VITAL or DSpace. This technical report, and the associated Python scripts, comprise part of the strategy to achieve this goal.
The Python scripts create an archive or directory, similar in structure to a DSpace simple archive. Within this directory are created: ADT item directories each with a temporary xml file storing dublin core metadata, all files relevant to the ADT item and a file listing all relevant files attached to the ADT item. This structure forms the basis of further migration and can then be used for various other repositories.
The Python scripts have been developed using a unit testing approach; the testing framework provided by the py.test tool and library. More information about the library is available at the website listed in the references section of this technical report. These scripts have been developed using Python version 2.4, and may work with earlier versions. However this has not been tested.
The Python scripts are modular in nature and use functionality provided by modules that have been used in other migration strategies. It is anticipated that this type of architecture will allow modification and customisation as required.
3 The Python Scripts
The data migration work is carried out by a script written in the Python programming language, for more information about Python see the official Python website listed in the references section of this technical report. The following files make up the scripts used in the data migration.
- get_adt_html.py
- This python script is located in the adt directory of the migration tool kit and is used to harvest items out of an ADT repository.
- xsl_transform.py
- This python script is located in the top level of the migration tool kit directory.
- A Python module that transforms metadata from one form to another depending on the stylesheet used. In the ADT to foxml process, this script is used to convert the harvested ADT metadata into marc metadata and also to convert the marc file to dublin core metadata.
- dspace_archive.py
- This python script is located in the dspace archive directory of the migration tool kit.
- A Python module that provides a utility class for the creation of dspace archive objects.
- pdf_to_full_text.py
- This python script is located in the top level of the migration tool kit directory.
- A Python module that converts all pdf files within each item directory to a single fulltext file within the individual item directory.
- archive_to_foxml.py
- This python script is located in the top level of the migration tool kit directory.
- A Python module that creates a foxml object by adding a prefix to the title, merges all datastreams and exports the foxml objects into a single output directory.
- insert_xmlns_xsi.py
- This python script is located in the top level of the migration tool kit directory.
- A python script to insert a correct marc namespace into a completed foxml object.
- create_marc_controlfield_tag_008.py
- This python script is located in the top level of the migration tool kit directory.
- A python script to create a controlfield tag within each marc record to reflect the current date and language used within the item.
4 Download the Python Scripts via Subversion
If you have the subversion client installed you can download the Python scripts, test files, and other files used during development. The URL that you will need to check out is as follows:
https://rubric-central.usq.edu.au/svn/Public/code/migration_toolkit
5 How to Harvest the ADT Data
The following sections of this technical report outline the procedure for using the Python scripts to harvest items from the ADT repository.
It is assumed that you have Python and the Libxml2 library, including the Python bindings, already installed.
5.1 Harvesting the Items in the ADT Repository
The get_adt_html.py script is the Python script that harvests all of the items from ADT. The script captures the metadata between the head tags of the individual ADT items and converts the tags to lowercase.
To invoke the script enter the python get_adt_html.py command in the following directory: migration_toolkit/adt/python
Script structure:
python get_adt_html.py [ADT URL]
Argument Definitions:
[ADT URL] the URL of the ADT web page to be harvested.
Example:
eg: python get_adt_html.py http://adt.usq.edu.au/adt-QUSQ/public/index.html
** ignore any HTML parser error:Opening and ending tag mismatch error messages **
This script creates a temporary DSpace archive, dspaceArchive, in the current directory. Any files the script is unable to locate, will not be created. The user is notified if the files are missing.
Each item is represented by one item directory within dspaceArchive. These files are numbered consecutively; first directory is called 00, 01 and so on.
The datastreams for each object are stored in the corresponding numbered directory. For example, the datastreams in the 00 directory contain three datastreams for this object. They are the two PDF files, 01front.pdf and 02whole.pdf, and a text file containing all of the text extracted from the two PDF files named contents and a temporary xml file, dc_temp.xml, containing basic dublin core metadata
5.1.1 DSpace simple archive example
dspaceArchive/
00/
01front.pdf
02whole.pdf
contents
temp.xml
01/
01front.pdf
02whole.pdf
03appendix.pdf
contents
temp.xml
6 Preparing the harvested data for ingest
6.1 Convert basic metadata to Marc
The xsl_transform.py converts the dc_temp.xml files in the individual item directories of the dspaceArchive directory to VITAL compliant marc, marc.xml.
Script structure:
python xsl_transform.py [InputFile][XslFilePath][OutputFile] [archiveName][RemoveInputFile]
Argument Definitions:
[InputFile] the filename of the input xml file to be converted, this is found within the item directory of the archive.
[XslFilePath] the file path to the stylesheet to be used for the conversion
[OutputFile] the filename to be used following the conversion, this will be found in the item directory within the archive
[archiveName] the name of the archive to be accessed
[RemoveInputFile] Remove or retain the input file? - True or False
Example:
python xsl_transform.py dc_temp.xml adt/xsl/adt_html_to_marc.xsl
marc.xml adt/python/dspaceArchive True
6.2 Add Controlfield tag
The basic metadata does not have a MARC controlfield tag 008 that is applicable to the contents of each record. This needs to be created during the xsl transformation.
Script structure:
python create_marc_controlfield_tag_008.py [archiveName]
Argument Definitions:
[archiveName] the name of the archive to be accessed.
Example:
python create_marc_controlfield_tag_008.py adt/python/dspaceArchive
6.3 Convert Marc to Dublin Core stream
The xsl_transform.py script is the Python script that converts the marc metadata into a dublin core stream ready for ingest into VITAL.
- Note:
- This script requires that the namespace of the marc xml be declared at the beginning of the file, not included in the individual tags inside the marc record.
Script structure:
python xsl_transform.py [InputFile][XslFilePath][OutputFile] [archiveName][RemoveInputFile]
Argument Definitions:
[InputFile] the filename of the input xml file to be converted, this file is found within the item directory of the archive.
[XslFilePath] the file path to the stylesheet to be used for the conversion
[OutputFile] the filename to be used following the conversion, this will be found in the item directory within the archive
[archiveName] the name of the archive to be accessed
[RemoveInputFile] Remove or retain the input file? - True or False
Example:
python xsl_transform.py marc.xml
xsl/marc_dc.xsl dublin_core.xml adt/python/dspaceArchive False
This script converts the marc.xml files in the individual item directories of the dspaceArchive directory to VITAL compliant dublin core, dublin_core.xml.
6.4 Create a full text version of the pdf files
The pdf_to_full_text.py script is the Python script that will converts the harvested pdf files to fulltext. Replace [archiveName] with the full path to the archive created during this migration.
Script structure:
python pdf_to_full_text.py [archiveName]
Argument Definitions:
[archiveName] the name of the archive to be accessed.
Example:
python pdf_to_full_text.py adt/python/dspaceArchive
Once this script has run, an additional file will be found in the item directory called fulltext this will contain the fulltext of ALL the pdf files belonging to the item.
dspaceArchive/
00/
01front.pdf
02whole.pdf
contents
dc.xml
marc.xml
fulltext
6.5 Create foxml objects
The archive_to_foxml.py script is the Python script that will create a directory of foxml objects.
Script structure:
python archive_to_foxml.py [archiveName] [startNum][PIDPrefix]
[outputDirectory][labelPrefix][foxmlObjectState][MARCFileName]
[MARCDataStreamState][DCFileName][DCDataStreamState][URLforNonXmlDataStreams]
Argument Definitions:
[archiveName] the full path to name of the archive to be accessed
[startNum] the starting number for the PID increment
[PIDPrefix] the name of the PID
[outputDirectory] the name of the directory for storing the foxml objects
[labelPrefix] Prefix to be added to title for reference. Eg Imported Item:
[foxmlObjectState] set this to Active (A), Inactive(I) or Deleted (D).
[MARCFileName] the name of the MARC xml file contained in the Archive.
[MARCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).
[DCFileName] the name of the Dublin Core file contained in the Archive
[DCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).
[URLforNonXmlDataStreams]
If non xml data streams exist in the archive (PDF or full text), a URL is required to access them. If you use python simple server use http://localhost:8000 or if you are using an existing server enter a URL path to the existing server where the non xml data streams can be made available during ingest. (see 7 - Making the Datastreams Available via a Web Server)
If no pdf files or fulltext datastreams exist, set to FALSE
Example:
eg: python archive_to_foxml.py adt/python/dspaceArchive 0 vital
exportedItems Imported_Items A marc.xml A dublin_core.xml A
http://servername/directoryname
6.6 Insert name space into foxml objects
This script ensures that the correct marc namespace appears within the foxml object. Insertion is completed at this point because to do so earlier in the process would change formatting and mean that the marc viewed within VITAL repository was not validating correctly.
Script structure:
python insert_xmlns_xsi.py [outputDirectory]
Argument Definitions:
[outputDirectory] the name of the directory for storing the foxml objects
Example:
python insert_xmlns_xsi.py exportedItems
7 Making the Datastreams Available via a Web Server
By design, the Fedora ingest applications expect to retrieve non XML datastreams (PDF and Full Text items in the archive) via HTTP from a web server.
XML datastreams can be incorporated into the XML that represents the FOXML object. It is anticipated that in a future release of the Fedora software it will be possible to encode binary data, such as PDF files, and include them in the FOXML object.
There are two mechanisms available to make the datastreams available via a web server. The first is to the use the basic HTTP server that comes with Python. The other is to use an existing web server.
7.1 Using the Python Web Server
To use the base HTTP server that comes with Python follow these steps:
To prepare the data for ingest, copy the contents only of the [outputDirectory] into the [archiveName] .
Copy the [archiveName] onto the server that is running VITAL
Change to the directory one level above the [archiveName] directory ie /opt/vtls/vital/application/fedora/client/bin
Read the following before executing the following command.
- Note: Without placing the & after the command as shown below the user will loose the command prompt when the server starts and be forced to open another console window.
- If using the & after the command simply press return twice after the issuing the command to pass the server process into the background and regain control of the command prompt.
python -c "import SimpleHTTPServer;SimpleHTTPServer.test()"&
This command will invoke Python and start the SimpleHTTPServer. The server will be able to provide access to files from the current directory, and any sub directories. The port used by the web server is 8000.
Please note that the SimpleHTTPServer will not be able to service requests other than those from the local machine, and therefore this process will only work when the datastreams are on the same server as the VITAL repository.
7.2 Using an Existing Web Server
Copy the [archiveName]directory into a new directory on an existing server. Ensure the directory name and structure is preserved, including the names of each file.
Copy the [outputDirectory] onto the server that is running VITAL
8 Ingesting the Items into VITAL
To ingest the items into VITAL complete the following procedure:
Ensure that the datastream files are downloadable, either via the Python SimpleHTTPServer or via an existing web server
Ensure that the FOXML object files are accessible via the dbadmin user
Change to the dbadmin user
Navigate to the following directory on the server
/opt/vtls/vital/applications/fedora/client/bin
Ensure the FEDORA_HOME and JAVA_HOME shell variables exist. If they do not exist, sample commands are outlined below
export FEDORA_HOME=/opt/vtls/vital/applications/fedora
export JAVA_HOME=/opt/vtls/java
Invoke the fedora-ingest command, this may take some time to complete
- Script structure:
- ./fedora-ingest.sh d [foxmlObjectLocation] foxml1.0 O localhost:8080 fedoraAdmin [password]
- Argument Definitions:
- [foxmlObjectLocation] the name of the directory for storing the foxml objects
- * For a simple server use the [archiveName] see 7.1 Using the Python Web Server
- * For an existing webserver use [outputDirectory] see: 7.2 Using an Existing Web Server
- [password] fedoraAdmin password
- Example:
./fedora-ingest.sh d dspaceArchive foxml1.0 O localhost:8080 fedoraAdmin fedoraAdminpassword
Further information on the Fedora ingest utilities is available at the following URL:
http://www.fedora.info/download/2.1.1/userdocs/client/cmd-line/index.html#ingest
Once the ingest is complete, check the XML log file, as specified by the output of the program, for any errors
If the new objects are to made available via the VITAL portal, ensure sufficient time has elapsed to allow the VITAL indexer to become aware of the additional objects





