Voyager XML data to Vital3.1.1
- 1 About this document
- 2 Background Information
- 3 The Python Scripts
- 4 Downloading the Python Scripts
- 5 Preparing the Voyager data for ingest
- 5.1 Creating the marcxml
- 5.2 Extracting the XML items from the larger XML file
- 5.3 Clean temp.xml
- 5.4 Obtain pdf files for archive
- 5.5 Convert temp.xml into marc.xml
- 5.6 Create a full text version of the pdf files
- 5.7 Remove 856 marc tags from marc.xml (optional)
- 5.8 Convert split record (marc.xml) to Dublin Core stream
- 6 Create foxml objects
- 7 Ingesting the Items into VITAL
- 8 Ingest procedure
1 About this document
- Author
- Bron Dye, RUBRIC Technical Officer
- Tim McCallum, RUBRIC Technical Officer
- Purpose
- This technical report outlines how to Voyager library system xml can be used for ingest as foxml objects.
- Audience
- RUBRIC Project Partners and other users of Fedora repositories
- Requirements
- xml import file
- Python 2.4 installed on the system to run the harvest
- The libxml2 library, including the Python bindings, installed on the system to run the harvest
- An instance of VITAL3.1.1 for ingest
- A working installation of “marc-edit 5.0”
- References
- Official VITAL website at VTLShttp://www.vtls.com/Products/vital.shtml
- Documentation on the FOXML (Fedora Object XML) specificationhttp://www.fedora.info/download/2.1.1/userdocs/digitalobjects/introFOXML.html
- Official Python websitehttp://www.python.org/
- Official libxml2 websitehttp://xmlsoft.org/python.html
- Official py.test tool and library websitehttp://codespeak.net/py/current/doc/test.html
- Official Subversion websitehttp://subversion.tigris.org/
- Marc-edit site .exe file
- http://oregonstate.edu/~reeset/marcedit/software/development/MarcEdit50_Setup.exe
- Notes
- The scripts have been developed on a Linux based system. Python is a cross platform programming language and therefore the scripts should also run under Microsoft Windows, and the OSX operating systems. This has not been tested.
- The installation of the Python programming language and the libxml2 library, including Python bindings is outside the scope of this technical report. Many Linux distributions, such as Ubuntu, will have these already installed.
2 Background Information
A component of the work undertaken at RUBRIC-Central is the development of various data migration strategies. These strategies are designed to assist RUBRIC Project Partners to migrate data into, and out of, various systems. The data migrations specifically target the three institutional repository solutions under consideration as part of the project.
Interest was expressed in being able to migrate exported files produced from local library systems into other repositories, such as VITAL or DSpace. This technical report, and the associated Python scripts, comprise the strategy to achieve this goal.
The Python scripts create an archive or directory, similar in structure to a DSpace Archive. Within this directory are created item directories each with a temporary xml file storing dublin core metadata, all files relevant to the xml item and a file listing all relevant files attached to the exported item.
The Python scripts have been developed using a unit testing approach using the testing framework provided by the py.test tool and library. More information about the library is available at the website listed in the references section of this technical report. These scripts have been developed using Python version 2.4, and may work with earlier versions. However this has not been tested.
The Python scripts are modular in nature and use functionality provided by modules that have been used in other migration strategies. It is anticipated that this type of architecture will allow modification and customisation as required.
3 The Python Scripts
The data migration work is carried out by a script written in the Python programming language, for more information about Python see the official Python website listed in the references section of this technical report. The following files make up the scripts used in the data migration.
- split_xml_into_archive.py
- The python script used to split a large xml file into items.
- xsl_transform.py
- A Python module that converts temporary xml file into dublin core xml or marc xml files.
- clean_marc_metadata.py
- A Python script that checks if date and title metadata in the input file have extra trailing characters. It removes these from the temp.xml.
- dspace_archive.py
- A Python module that provides a utility class for the creation of dspace archive objects.
- archive_to_foxml_3_1.py
- A Python module that provides creates a foxml object by adding a prefix to the title, merges all datastreams and exports the foxml objects into a single output directory.
- obtain_files_for_archive.py
- A Python module that loops through the archive, reading the temp.xml file. When a link to an external file is found the script fetches the pdf and loads it into the archive.
- pdf_to_full_text
- A python script that converts pdf files found in the archive into full text files
- wrap_namespace_around_file.py
- This python script is used to convert the bare split record to marc.xml.
- remove_xml_node.py
- A python script that removes an entire xml node and its contents.
4 Downloading the Python Scripts
All of the data migration scripts, and associated code libraries, modules and files, are made available via a publicly accessibly website at the following URL.
https://rubric-central.usq.edu.au/svn/Public/migration_toolkit
5 Preparing the Voyager data for ingest
The following sections of this technical report outline the procedure converting the Voyager output file to marcxml and then using the Python scripts to extract the xml objects from a larger xml file.
It is assumed that you have Python and the Libxml2 library, including the Python bindings, already installed.
5.1 Creating the marcxml
Download and open marc edit.
Select marc maker
Set the input file as the Voyager file
Set the output file
Click on marc to marc21 in the marc functions
Click execute to create the MarcXML
Open MarcXML in a text editor.
Remove the marc namespaces from it, Replace <marc: with < and replace </marc: with </
5.2 Extracting the XML items from the larger XML file
To split the large xml file into individual items, the Python script split_xml_into_archive.py is used. This script creates a temporary DSpace archive, in the current directory. Replace the parameters as follows:
Script structure:
- python split_xml_into_archive.py [dataFileName] [archiveName] [xpath] [streamName][templateFile]
Argument definitions:
[dataFileName] filename of the large input xml file created using marcedit
[archiveName] name of the Archive that will be created
[xpath] match the node for each item in the larger xml file (eg //record) If there is a namespace for example marc:record use the following
“//*[local-name()='record']”
[streamName] filename of the output xml file that will be created
[templateFile] (opt) the filepath to a wrapping file if a namespace is required. If there is NO namespace requirement, use “False” If unsure set this to “False”
Example:
python split_xml_into_archive.py temp_data/ethesis.xml voyagerArchive //record temp.xml False
This script will create the following simple Dspace archive, where each item is represented by one item directory. These files are numbered consecutively; first directory is called 0, 1 and so on.
dspaceArchive/
0/
contents
temp.xml
1/
contents
temp.xml
5.3 Clean temp.xml
The clean_marc_metadata.py script is used to check if date and title metadata have extra trailing characters. For example: A date field may have a trailing “.” or “]”. A title field may have a trailing “:”. The script removes the extra characters and updates the temp.xml.
Script structure:
- python clean_marc_metadata.py [archiveName][tempFileName][recordType]
Argument definitions:
Replace the following parameters:
[archiveName] name of the archive (eg voyagerArchive)
[tempFileName] filename of the temp.xml file that will be cleaned
[recordType] the type of record Ethesis(use E) or Brunner (use B) Library Publications (use L) Working Papers (use W)
Example:
python clean_marc_metadata.py voyagerArchive temp.xml E
5.4 Obtain pdf files for archive
If the data contains pdfs to be harvested then carry out the following instructions otherwise proceed to the next step
The obtain_files_for_archive.py script is a Python script that searches the temp.xml files looking for URL links to external pdf files. Once found the fiiles are then downloaded by the script and put into the archive. Use optional protocol, username and password arguments when files to be harvested require authentication.
- Script structure:
- python obtain_files_for_archive.py [archiveName][targetFileName][fileType][protocol][username][password]
- Argument definitions
[archiveName] name of the archive to be accessed
[targetFileName] name of the file in each archive to be accessed
[fileType] file extension of external file to be accessed and downloaded
[protocol] protocol used to access file(optional)
[username] username required to access file(optional)
[password] password required to access file(optional)
python obtain_files_for_archive.py voyagerArchive temp.xml pdf http:// username password
5.5 Convert temp.xml into marc.xml
The wrap_namespace_around_file.py script will be used to convert the bare split record to marc.xml.
Script structure:
- python wrap_namespace_around_file.py [fileName][wrapperFile][archiveName] [newFile]
Argument definitions:
[fileName] name of the file inside the archive that needs wrapping
[wrapperFile] the full path to the wrapper file (eg wrapper_file_for_marc_file)
[archiveName] name of the archive (eg voyagerArchive)
[newFile] the name of the output file ie: marc.xml
Example:
python wrap_namespace_around_file.py temp.xml wrapper_file_for_marc_file voyagerArchive marc.xml
5.6 Create a full text version of the pdf files
The pdf_to_full_text.py script is the Python script that will converts the harvested pdf files to fulltext. Replace [ArchiveName] with the name of the archive to be accessed
Script structure:
- python pdf_to_full_text.py [ArchiveName]
Argument definitions:
[ArchiveName] name of the archive (eg voyagerArchive)
Example:
python pdf_to_full_text.py [ArchiveName]
eg python pdf_to_full_text.py voyagerArchive
Once this script has run, an additional file will be found in the item directory called fulltext this will contain the fulltext of ALL the pdf files belonging to the item.
dspaceArchive/
0/
01front.pdf
02whole.pdf
contents
dc.xml
marc.xml
fulltext
5.7 Remove 856 marc tags from marc.xml (optional)
The remove_xml_node.py script is the Python script that removes unwanted nodes, in this case from the marc.xml.
Script structure:
python remove_xml_node.py [archiveName] [fileName][xpath]
Argument definitions:
[archiveName] name of the archive to be accessed
[fileName] name of the file from which the node is removed, in this case marc.xml
[xpath] the xpath to locate the unwanted node
Example:
python remove_xml_node.py dspaceArchive marc.xml “//*[local-name()='datafield'][@tag='856']”
5.8 Convert split record (marc.xml) to Dublin Core stream
The xsl_transform.py script is the Python script that converts the marc metadata into a dublin core stream ready for ingest into VITAL.
Script structure:
- python xsl_transform.py [InputFile][XslFilePath][OutputFile][ArchiveName] [RemoveInputFile]
Argument definitions:
[InputFile] filename of the input xml file to be converted.
[XslFilePath] file path to the stylesheet used for the conversion
[OutputFile] filename created by conversion – dublin_core.xml in this case
[ArchiveName] name of the archive to be accessed
[RemoveInputFile] Remove the input file? - set to False
Example:
python xsl_transform.py marc.xml
xsl/marc_dc_macquarie.xsl dublin_core.xml voyagerArchive False
6 Create foxml objects
The archive_to_foxml_3_1.py script is the Python script that will create a directory of foxml objects. We have included the mods datastream in this script. A mods.xml datastream can be created and added to the VoyagerArchive at an earlier stage, if desired.
Script structure:
python archive_to_foxml_3_1.py [archiveName] [startNum][PIDPrefix]
[outputDirectory][labelPrefix][foxmlObjectState][MARCFileName]
[MARCDataStreamState][DCFileName] [DCDataStreamState][MODSFileName][MODSObjectState] [URLforNonXmlDataStreams]
Argument definitions:
[archiveName] full path to name of the archive to be accessed
[startNum] starting number for the PID increment
[PIDPrefix] name of the PID
[outputDirectory] name of the directory for storing the foxml objects
[labelPrefix] Prefix to be added to title for reference. Eg Imported Item:
[foxmlObjectState] set this to Active (A), Inactive(I) or Deleted (D).
[MARCFileName] name of the MARC xml file contained in the Archive.
[MARCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).
[DCFileName] name of the Dublin Core file contained in the Archive
[DCDataStreamState] set this to Active (A), Inactive(I) or Deleted (D).
MODSFileName] name of the MODS file contained in the Archive
[MODSObjectState] set this to Active (A), Inactive(I) or Deleted (D).
[URLforNonXmlDataStreams]
If non xml data streams exist in the archive (PDF or full text), a URL is required to access them. If you use python simple server use http://localhost:8000 or if you are using an existing server enter a URL path to the existing server where the non xml data streams can be made available during ingest.
If no pdf files or fulltext datastreams exist, set this toFALSE
Note: Remove any trailing slashes (created by tab completion) from arguments before executing script
Example:
python archive_to_foxml_3_1.py voyagerArchive 0 vital
exportedItems Imported_Items A marc.xml A dublin_core.xml A mods.xml I
http://servername/directoryname
6.1 Insert name space in to foxml objects
This script ensures that the correct marc namespace appears within the foxml object. Insertion is completed at this point because to do so earlier in the process would change formatting and mean that the marc viewed within VITAL repository was not validating correctly.
Script structure:
- python insert_xmlns_xsi.py [outputDirectory]
Argument definitions:
[outputDirectory] the name of the directory for storing the foxml objects
Example:
python insert_xmlns_xsi.py voyagerItems
7 Ingesting the Items into VITAL
7.1 Making the Datastreams Available via a Web Server
By design the Fedora ingest applications expect to retrieve non XML datastreams via HTTP from a web server. XML datastreams can be incorporated into the XML that represents the FOXML object. It is anticipated that in a future release of the Fedora software it will be possible to encode binary data, such as PDF files, and include them in the FOXML object.
There are two mechanisms available to make the datastreams available via a web server. The first is to the use the basic HTTP server that comes with Python. The other is to use an existing web server.
7.1.1 Using the Python Web Server
To use the simple HTTP server that comes with Python follow these steps:
Log into the machine that VITAL is running on
Start a new terminal session (or place the & after the command to make the server run in the background)
Copy the contents of [archiveName] directory (created in archive to foxml procedure above) into the [outputDirectory] (named in archive to foxml procedure above) If these directories are currently on another machine carrying out the harvest, ie development machine then carry out this step on the other machine and secure copy the [outputDirectory] over in one step. A good place copy it to is the /usr/vtls/vital/fedora/client/bin directory on the machine running VITAL
Change to the directory one level above the [outputDirectory] (/usr/vtls/vital/fedora/client/bin)
Execute the following command
python -c "import SimpleHTTPServer;SimpleHTTPServer.test()"&
This command will invoke Python and start the SimpleHTTPServer. The server will be able to provide access to files from the current directory, and any sub directories. The port used by the web server is 8000.
Please note that the SimpleHTTPServer will not be able to service requests other than those from the local machine, and therefore this process will only work when the datastreams are on the same server as the VITAL repository.
7.1.2 Using an Existing Web Server
Using an existing web server is possible. The archive_to_foxml_3_1.py script contains the http://localhost:8000 default server, simply change this path to the location on the server where you will be uploading the datastream files. The Python script will append the directory and file names to this path.
When uploading the datastream files onto the server ensure the directory structure is preserved, including the names of each file.
8 Ingest procedure
To ingest the items into VITAL complete the following procedure:
Copy the [outputDirectory] onto the server that is running VITAL
Ensure that the contents of the [outputDirectory] are accessible via the dbadmin user
Change to the dbadmin user
Navigate to the following directory on the server
/opt/vtls/vital/applications/fedora/client/bin
Ensure the FEDORA_HOME and JAVA_HOME shell variables exist. If they do not exist, sample commands are outlined below
export FEDORA_HOME=/opt/vtls/vital/applications/fedora
export JAVA_HOME=/opt/vtls/java
Invoke the following command to start the fedora-ingest utility, where [outputDirectory]is the location of the FOXML files and [password] is the fedoraAdmin password
Note: This may take some time to complete
./fedora-ingest.sh d outputDirectory foxml1.0 O localhost:8080 fedoraAdmin password http “”
Further information on the Fedora ingest utilities is available at the following URL:
http://www.fedora.info/download/2.1.1/userdocs/client/cmd-line/index.html#ingest
Once the ingest is complete, check the XML log file, as specified by the output of the program, for any errors
If the new objects are to made available via the VITAL portal, ensure sufficient time has elapsed to allow the VITAL indexer to become aware of the additional objects




