Contents > Technical Reports > Harvesting ADT data to VITAL
PDF version

Harvesting ADT data to VITAL

1 About this document

Author
Bron Dye, RUBRIC Technical Officer
Tim McCallum, RUBRIC Technical Officer
Purpose
This technical report outlines how to use the ADT Harvest script to harvest items from an ADT repository for import to VITAL3.1
Audience
RUBRIC Project Partners and other users of ADT
Requirements
Access to an ADT website
Python 2.4 installed on the system running the harvest and on the system containing the Fedora repository (where the ingest will take place)
The libxml2 library, including the Python bindings, installed on the system to run the harvest
Ensure pdf2text module is installed
An instance of VITAL3.1 for ingest
References
Official ADT website
http://adt.caul.edu.au/
Official VITAL website at VTLS
http://www.vtls.com/Products/vital.shtml
Documentation on the FOXML (Fedora Object XML) specification
http://www.fedora.info/download/2.1.1/userdocs/digitalobjects/introFOXML.html
Official Python website
http://www.python.org/
Official libxml2 website
http://xmlsoft.org/python.html
Official py.test tool and library website
http://codespeak.net/py/current/doc/test.html
Official Subversion website
http://subversion.tigris.org/
Notes
The ADT harvesting scripts have been developed on a Linux based system. Python is a cross platform programming language and therefore the scripts should also run under Microsoft Windows, and the OSX operating systems. This has not been tested.
The installation of the Python programming language and the libxml2 library, including Python bindings is outside the scope of this technical report. Many Linux distributions, such as Ubuntu, will have these already installed.

2 Background Information

A component of the work undertaken at RUBRIC-Central is the development of various data migration strategies. These strategies are designed to assist RUBRIC Project Partners to migrate data into, and out of, various systems. The data migrations specifically target the three institutional repository solutions under consideration as part of the project.

Interest was expressed in being able to migrate items from an Australian Digital Thesis (ADT) repository into other repositories, such as VITAL or DSpace. This technical report, and the associated Python scripts, comprise part of the strategy to achieve this goal.

The Python scripts create an archive or directory, similar in structure to a DSpace simple archive. Within this directory are created: ADT item directories each with a temporary xml file storing dublin core metadata, all files relevant to the ADT item and a file listing all relevant files attached to the ADT item. This structure forms the basis of further migration and can then be used for various other repositories.

The Python scripts have been developed using a unit testing approach; the testing framework provided by the py.test tool and library. More information about the library is available at the website listed in the references section of this technical report. These scripts have been developed using Python version 2.4, and may work with earlier versions. However this has not been tested.

The Python scripts are modular in nature and use functionality provided by modules that have been used in other migration strategies. It is anticipated that this type of architecture will allow modification and customisation as required.

3 The Python Scripts

The data migration work is carried out by a script written in the Python programming language, for more information about Python see the official Python website listed in the references section of this technical report. The following files make up the scripts used in the data migration.

get_adt_html.py
This python script is located in the adt directory of the migration tool kit and is used to harvest items out of an ADT repository.
xsl_transform.py
This python script is located in the top level of the migration tool kit directory.
A Python module that transforms metadata from one form to another depending on the stylesheet used. In the ADT to foxml process, this script is used to convert the harvested ADT metadata into marc metadata and also to convert the marc file to dublin core metadata.
dspace_archive.py
This python script is located in the dspace archive directory of the migration tool kit.
A Python module that provides a utility class for the creation of dspace archive objects.
pdf_to_full_text.py
This python script is located in the top level of the migration tool kit directory.
A Python module that converts all pdf files within each item directory to a single fulltext file within the individual item directory.
archive_to_foxml.py
This python script is located in the top level of the migration tool kit directory.
A Python module that creates a foxml object by adding a prefix to the title, merges all datastreams and exports the foxml objects into a single output directory.
insert_xmlns_xsi.py
This python script is located in the top level of the migration tool kit directory.
A python script to insert a correct marc namespace into a completed foxml object.
create_marc_controlfield_tag_008.py
This python script is located in the top level of the migration tool kit directory.
A python script to create a controlfield tag within each marc record to reflect the current date and language used within the item.

4 Download the Python Scripts via Subversion

If you have the subversion client installed you can download the Python scripts, test files, and other files used during development. The URL that you will need to check out is as follows:

https://rubric-central.usq.edu.au/svn/Public/code/migration_toolkit

5 How to Harvest the ADT Data

The following sections of this technical report outline the procedure for using the Python scripts to harvest items from the ADT repository.

It is assumed that you have Python and the Libxml2 library, including the Python bindings, already installed.

5.1 Harvesting the Items in the ADT Repository

The get_adt_html.py script is the Python script that harvests all of the items from ADT. The script captures the metadata between the head tags of the individual ADT items and converts the tags to lowercase.

To invoke the script enter the python get_adt_html.py command in the following directory: migration_toolkit/adt/python

Script structure:

  • python get_adt_html.py [ADT URL]

Argument Definitions:

  • [ADT URL] the URL of the ADT web page to be harvested.

Example:

eg: python get_adt_html.py http://adt.usq.edu.au/adt-QUSQ/public/index.html
  • ** ignore any HTML parser error:Opening and ending tag mismatch error messages **

This script creates a temporary DSpace archive, dspaceArchive, in the current directory. Any files the script is unable to locate, will not be created. The user is notified if the files are missing.

Each item is represented by one item directory within dspaceArchive. These files are numbered consecutively; first directory is called 00, 01 and so on.

The datastreams for each object are stored in the corresponding numbered directory. For example, the datastreams in the 00 directory contain three datastreams for this object. They are the two PDF files, 01front.pdf and 02whole.pdf, and a text file containing all of the text extracted from the two PDF files named contents and a temporary xml file, dc_temp.xml, containing basic dublin core metadata

5.1.1 DSpace simple archive example

dspaceArchive/
    00/
        01front.pdf
        02whole.pdf
        contents
        temp.xml
    01/
        01front.pdf
        02whole.pdf
        03appendix.pdf
        contents
        temp.xml

6 Preparing the harvested data for ingest

6.1 Convert basic metadata to Marc

The xsl_transform.py converts the dc_temp.xml files in the individual item directories of the dspaceArchive directory to VITAL compliant marc, marc.xml.

Script structure:

  • python xsl_transform.py [InputFile][XslFilePath][OutputFile] [archiveName][RemoveInputFile]

Argument Definitions:

  • [InputFile] the filename of the input xml file to be converted, this is found within the item directory of the archive.

  • [XslFilePath] the file path to the stylesheet to be used for the conversion

  • [OutputFile] the filename to be used following the conversion, this will be found in the item directory within the archive

  • [archiveName] the name of the archive to be accessed

  • [RemoveInputFile] Remove or retain the input file? - True or False

Example:

python  xsl_transform.py dc_temp.xml adt/xsl/adt_html_to_marc.xsl
 marc.xml adt/python/dspaceArchive True

6.2 Add Controlfield tag

The basic metadata does not have a MARC controlfield tag 008 that is applicable to the contents of each record. This needs to be created during the xsl transformation.

Script structure:

  • python create_marc_controlfield_tag_008.py [archiveName]

Argument Definitions:

  • [archiveName] the name of the archive to be accessed.

Example:

python create_marc_controlfield_tag_008.py adt/python/dspaceArchive

6.3 Convert Marc to Dublin Core stream

The xsl_transform.py script is the Python script that converts the marc metadata into a dublin core stream ready for ingest into VITAL.

Note:
This script requires that the namespace of the marc xml be declared at the beginning of the file, not included in the individual tags inside the marc record.

Script structure:

  • python xsl_transform.py [InputFile][XslFilePath][OutputFile] [archiveName][RemoveInputFile]

Argument Definitions:

  • [InputFile] the filename of the input xml file to be converted, this file is found within the item directory of the archive.

  • [XslFilePath] the file path to the stylesheet to be used for the conversion

  • [OutputFile] the filename to be used following the conversion, this will be found in the item directory within the archive

  • [archiveName] the name of the archive to be accessed

  • [RemoveInputFile] Remove or retain the input file? - True or False

Example:

python  xsl_transform.py marc.xml 
xsl/marc_dc.xsl dublin_core.xml adt/python/dspaceArchive False

This script converts the marc.xml files in the individual item directories of the dspaceArchive directory to VITAL compliant dublin core, dublin_core.xml.

6.4 Create a full text version of the pdf files

The pdf_to_full_text.py script is the Python script that will converts the harvested pdf files to fulltext. Replace [archiveName] with the full path to the archive created during this migration.

Script structure:

  • python pdf_to_full_text.py [archiveName]

Argument Definitions:

  • [archiveName] the name of the archive to be accessed.

Example:

python pdf_to_full_text.py adt/python/dspaceArchive

Once this script has run, an additional file will be found in the item directory called fulltext this will contain the fulltext of ALL the pdf files belonging to the item.

dspaceArchive/
    00/
        01front.pdf
        02whole.pdf
        contents
        dc.xml
        marc.xml
        fulltext

6.5 Create foxml objects

The archive_to_foxml.py script is the Python script that will create a directory of foxml objects.

Script structure:

  • python archive_to_foxml.py [archiveName] [startNum][PIDPrefix]

  • [outputDirectory][labelPrefix][foxmlObjectState][MARCFileName]

  • [MARCDataStreamState][DCFileName][DCDataStreamState][URLforNonXmlDataStreams]

Argument Definitions:

  • [archiveName] the full path to name of the archive to be accessed

  • [startNum] the starting number for the PID increment

  • [PIDPrefix] the name of the PID

  • [outputDirectory] the name of the directory for storing the foxml objects

  • [labelPrefix] Prefix to be added to title for reference. Eg Imported Item:

  • [foxmlObjectState] set this to Active (A), Inactive(I) or Deleted (D).

  • [MARCFileName] the name of the MARC xml file contained in the Archive.

  • [MARCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).

  • [DCFileName] the name of the Dublin Core file contained in the Archive

  • [DCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).

  • [URLforNonXmlDataStreams]

    If non xml data streams exist in the archive (PDF or full text), a URL is required to access them. If you use python simple server use http://localhost:8000 or if you are using an existing server enter a URL path to the existing server where the non xml data streams can be made available during ingest. (see 7 - Making the Datastreams Available via a Web Server)

    If no pdf files or fulltext datastreams exist, set to FALSE

Example:

eg: python archive_to_foxml.py adt/python/dspaceArchive 0 vital
exportedItems Imported_Items A marc.xml A dublin_core.xml A
http://servername/directoryname

6.6 Insert name space into foxml objects

This script ensures that the correct marc namespace appears within the foxml object. Insertion is completed at this point because to do so earlier in the process would change formatting and mean that the marc viewed within VITAL repository was not validating correctly.

Script structure:

  • python insert_xmlns_xsi.py [outputDirectory]

Argument Definitions:

  • [outputDirectory] the name of the directory for storing the foxml objects

Example:

python insert_xmlns_xsi.py exportedItems 

7 Making the Datastreams Available via a Web Server

By design, the Fedora ingest applications expect to retrieve non XML datastreams (PDF and Full Text items in the archive) via HTTP from a web server.

XML datastreams can be incorporated into the XML that represents the FOXML object. It is anticipated that in a future release of the Fedora software it will be possible to encode binary data, such as PDF files, and include them in the FOXML object.

There are two mechanisms available to make the datastreams available via a web server. The first is to the use the basic HTTP server that comes with Python. The other is to use an existing web server.

7.1 Using the Python Web Server

To use the base HTTP server that comes with Python follow these steps:

  1. To prepare the data for ingest, copy the contents only of the [outputDirectory] into the [archiveName] .

    • Example of the file structure of the [archiveName]

      graphics2

  2. Copy the [archiveName] onto the server that is running VITAL

  3. Change to the directory one level above the [archiveName] directory ie /opt/vtls/vital/application/fedora/client/bin

  4. Read the following before executing the following command.

Note: Without placing the & after the command as shown below the user will loose the command prompt when the server starts and be forced to open another console window.
If using the & after the command simply press return twice after the issuing the command to pass the server process into the background and regain control of the command prompt.
python -c "import SimpleHTTPServer;SimpleHTTPServer.test()"&
  • This command will invoke Python and start the SimpleHTTPServer. The server will be able to provide access to files from the current directory, and any sub directories. The port used by the web server is 8000.

  • Please note that the SimpleHTTPServer will not be able to service requests other than those from the local machine, and therefore this process will only work when the datastreams are on the same server as the VITAL repository.

7.2 Using an Existing Web Server

  1. Copy the [archiveName]directory into a new directory on an existing server. Ensure the directory name and structure is preserved, including the names of each file.

  2. Copy the [outputDirectory] onto the server that is running VITAL

8 Ingesting the Items into VITAL

To ingest the items into VITAL complete the following procedure:

  1. Ensure that the datastream files are downloadable, either via the Python SimpleHTTPServer or via an existing web server

  2. Ensure that the FOXML object files are accessible via the dbadmin user

  3. Change to the dbadmin user

  4. Navigate to the following directory on the server

    /opt/vtls/vital/applications/fedora/client/bin
  5. Ensure the FEDORA_HOME and JAVA_HOME shell variables exist. If they do not exist, sample commands are outlined below

    export FEDORA_HOME=/opt/vtls/vital/applications/fedora
    export JAVA_HOME=/opt/vtls/java
  6. Invoke the fedora-ingest command, this may take some time to complete

    Script structure:
    ./fedora-ingest.sh d [foxmlObjectLocation] foxml1.0 O localhost:8080 fedoraAdmin [password]
    Argument Definitions:
    [foxmlObjectLocation] the name of the directory for storing the foxml objects
    * For a simple server use the [archiveName] see 7.1 Using the Python Web Server
    * For an existing webserver use [outputDirectory] see: 7.2 Using an Existing Web Server
    [password] fedoraAdmin password
    Example:
    ./fedora-ingest.sh d dspaceArchive foxml1.0 O localhost:8080 fedoraAdmin fedoraAdminpassword
  7. Further information on the Fedora ingest utilities is available at the following URL:

    http://www.fedora.info/download/2.1.1/userdocs/client/cmd-line/index.html#ingest

  8. Once the ingest is complete, check the XML log file, as specified by the output of the program, for any errors

  9. If the new objects are to made available via the VITAL portal, ensure sufficient time has elapsed to allow the VITAL indexer to become aware of the additional objects