.. role:: gocode(code)
:language: go
PMH server framework
====================
The “pmh” package provides a framework for building an OAI-PMH_ server. It
lets others harvest arbitrary datasets in arbitrary metadata formats.
.. _OAI-PMH: http://www.openarchives.org/OAI/openarchivesprotocol.html
The basic idea is to store the datasets in a MongoDB. Then, you need to write
a Go function that converts a MongoDB document into a Go struct ready to be
serialised into Dublin Core XML. For each additional metadata format that your
PMH server should support, you write an additional function.
The server listens on port 8000.
You may configure it with the following environment variable:
PMH_LOG_PATH
Path to the directory of the log file. The log file is called “pmh.log”. If
this environment variable is empty or not set, logging goes to stderr.
Example
-------
A complete example is the `GEPRIS PMH server`_, in particular, in the directory
`cmd/pmh`_.
.. _GEPRIS PMH server: https://jugit.fz-juelich.de/fdm/gepris-crawler
.. _cmd/pmh: https://jugit.fz-juelich.de/fdm/gepris-crawler/-/tree/master/cmd/pmh
Let’s do a tour through the files. First, the file `server.go`_. It contains
the ``main`` function, which does only one thing: It calls ``pmh.Serve`` and
passes the `options`_ (see below).
The file `common.go`_ is only the type ``grantDocument``, which is used in all
other files that define metadata formats (in this example, ``dublin_core.go``
and ``marcxml.go``). ``grantDocument`` defines the fields that each MongoDB
document has. Note that due to Go idiosyncrasies, the first letter of each
field name is capitalised. Each MongoDB document *must* have the fields
``identifier`` and ``lastModification``, however, you might include only those
fields here that you actually need for the conversion to XML.
The file `dublin_core.go`_ does the conversion to Dublin Core XML. The three
type definitions ``grantEntryDC``, ``metadataDC``, and ``recordDC`` represent
the nested structure of the resulting XML: ``<record>`` contains
``<metadata>``, which contains ``<dc>``. Each of these three XML elements
contain many simple string elements or XML attributes, too, which are
represented by the other struct fields. See the `Go XML package`_ for further
information, in particular about the meaning of the ```xml:"…`` tags.
The file `marcxml.go`_ does the conversion to Dublin Core XML. The three type
definitions ``grantEntry``, ``metadata``, ``record``, ``datafield``, and
``subfield`` represent the nested structure of the resulting XML: ``<record>``
contains ``<metadata>``, which contains ``<record>`` (but this time in the MARC
namespace), which contains ``<datafield>``, which contains ``<subfield>``.
Again, each of these five XML elements contain many simple string elements or
XML attributes, too, which are represented by the other struct fields.
.. _server.go: https://jugit.fz-juelich.de/fdm/gepris-crawler/-/tree/master/cmd/pmh/server.go
.. _common.go: https://jugit.fz-juelich.de/fdm/gepris-crawler/-/tree/master/cmd/pmh/common.go
.. _dublin_core.go: https://jugit.fz-juelich.de/fdm/gepris-crawler/-/tree/master/cmd/pmh/dublin_core.go
.. _marcxml.go: https://jugit.fz-juelich.de/fdm/gepris-crawler/-/tree/master/cmd/pmh/marcxml.go
.. _Go XML package: https://pkg.go.dev/encoding/xml
Options
-------
``MongoOpts`` (required)
Options related to the MongoDB itself, see `MongoDB options`_ below.
``Name`` (required)
Human-readable name for the PMH repository, which will be exposed to
harvesters.
``URL`` (required)
The base URL of the repository.
``AdminEmail`` (required)
The email address of the admin of the PMH server.
``Filter``
Additional filters for datasets to be offered for harvesting. The GEPRIS
example above, this is used to limit harvesting to all projects after a
certain year (or no year set at all). See `Specify a Query`_ in the MongoDB
Go driver documentation for how to build such filters.
``MetadataFormats`` (required)
Slice with all metadata formats that the server supports. Dublin Core is
mandatory, and its ``Prefix``, ``Schema``, and ``Namespace`` must be exactly
as given in the GEPRIS example. ``BuildFunc`` is the function that converts
a MongoDB document to an XML entry in that metadata format, see `Build
functions`_ below. All four fields are required.
.. _Specify a query: https://www.mongodb.com/docs/drivers/go/current/fundamentals/crud/read-operations/query-document/
Build functions
---------------
A build function converts a MongoDB document into an XML entry. It does not do
the serialisation into XML but only the construction of a data structure that
can be serialised properly.
The rough structure of a build function is the following:
.. code-block:: go
func BuildDC(decodeFunc func(any) error) (any, error) {
var (
doc document
entry entryDC
)
if err := decodeFunc(&doc); err != nil {
return nil, err
}
entry.Identifier = doc.Identifier
entry.Datestamp = doc.LastModification.In(time.UTC).
Format(time.RFC3339)
/*
Set all the other fields in entry.Metadata.Record,
according to what’s found in doc.
*/
return entry, nil
}
Here, ``doc`` is the source and ``entry`` is the destination. The
``Identifier`` and ``Datestamp`` fields are required in the OAI-PMH
specification. Then, you are on your own by setting the fields in
``entry.Metadata.Record``. Actually, you can name ``Record`` as you wish, and
set its XML element name arbitrarily, but it will end up in the XML hierarchy
as a child of ``<metadata>``. For example, in `dublin_core.go`_, ``Record``
becomes an ``<oai_dc:dc>`` element.
MongoDB options
---------------
``Host`` (required)
URI to the MongoDB server, e.g. “``mongodb://localhost``”
``DBName`` (required)
Name of the database on the MongoDB server to use.
``CollName`` (required)
Name of the collection in the database ``DBName`` that contains the datasets
that are supposed to be offered to harvesters. Each of these documents must
contain a unique ``identifier``, which needs to be a URI, and a
``lastModification`` field, which contains the timestamp of the last
modification of the respective document.
``AddIndices``
Field names that should get an index, additionally to the standard indices.
The standard indices are for “identifier” and “lastModification”. In
particular if you add filters in the ``Filter`` field of the options above,
you should consider indices for the fields used there.
Filling the MongoDB
-------------------
So far, we’ve only talked about serving the datasets to others. But the data
must come into the MongoDB first. Basically, you are totally free how you do
that, as long as each dataset (“document” as MongoDB calls it) has an
“identifier” (a URI string) and a “lastModification” (a timestamp) field. The
identifiers must be unique.
That said, the pmh package helps you a little bit by providing an
initialisation function that set things up (collections and indices) and
returns a client (that you should disconnect after use) and the working
collection:
.. code-block:: go
import "jugit.fz-juelich.de/fdm/pmh-server/mongo"
client, collection, err := mongo.InitializeMongo(
context.Background(), mongoOpts)
Here, ``mongoOpts`` is the value passed for the field ``MongoOpts`` in the
option value passed to the ``pmh.Serve`` function. You can copy it from there,
or store it as a variable in a package shared between PMH server and the
database-filling code. See `server.go`_ for an example for the latter.
Contact
-------
Author and maintainer of this PMH package is `Torsten Bronger`_.
.. _Torsten Bronger: mailto:[email protected]