ABCD - XML Primer

Abstract

This primer is intended to provide an easily readable background to ABCD and should take anyone with no knowledge of the standard to the very point where they would be able to understand the principles and the more detailed technical specification. Examples are given which are complemented by references to the normative texts.

Purpose

ABCD - Access to Biological Collection Data - Schema is a common data specification for biological collection units, including living and preserved specimens, along with field observations that did not produce voucher specimens. It is intended to support the exchange and integration of detailed primary collection and observation data.

All of the world's biological collections contain a number of data items including specimen specific (e.g. taxon, altitude, sex) and collection specific (e.g. holding institution) elements. The set of elements used varies from collection to collection. ABCD provides a standard set of element names and their definition for scientists and curators to use. It is not expected (or even possible) for any collection to use more than a fraction of the elements defined in the standard.

A design goal was to be both comprehensive and general, to include a broad array of concepts that might be available in a collection database, but to mandate only the bare minimum of elements required to make the specification functional. ABCD deliberately does not cover taxonomic data, such as synonymy, other than the use of names in identifications. Likewise, taxon-related information, such as distribution range, indicator values, etc., is also not included. The elements and concepts that are used provide as much compatibility as is possible with other standards in the field of biological collection data, such as HISPID, Darwin Core, and others. ABCD version 2 is a TDWG standard, which has been ratified by the annual TDWG meeting in September 2005. This standard is promoted by GBIF for use globally. The technical data specification is cast as an XML schema.

Top-Level Structure

The ABCD schema is highly structured in order to manage the large quantity of data that a record may contain.

The top level of the schema is arranged as follows:

<DataSets>
  <DataSet>
  GUID
  <Metadata>
    <Units> # Observations and Specimens
      <Unit>
        ...

A minimum ABCD record could look like this:

<?xml version='1.0' encoding='UTF-8'?>
<DataSets xmlns='http://www.tdwg.org/schemas/abcd/2.06'>
  <DataSet>
    <TechnicalContacts>
      <TechnicalContact>
        <Name>Gerd Müller</Name>
        <Email>gerd@dfb.de</Email>
      </TechnicalContact>
    </TechnicalContacts>
    <ContentContacts>
      <ContentContact>
        <Name>A Another</Name>
        <Email>a.another@fake.org</Email>
      </ContentContact>
    </ContentContacts>
    <Metadata>
      <Description>
        <Representation language='en'>
          <Title>PonTaurus collection</Title>
        </Representation>
      </Description>
      <RevisionData>
          <DateModified>2001-03-01T00:00:00</DateModified>
      </RevisionData>
    </Metadata>
    <Units>
      <Unit>
        <SourceInstitutionID>BGBM</SourceInstitutionID>
        <SourceID>PonTaurus</SourceID>
        <UnitID>1136</UnitID>
      </Unit>
    </Units>
  </DataSet>
</DataSets>

From this it can be seen that an XML document based on ABCD may contain records from several datasets, each of which is treated separately. Each dataset has a Globally Unique Identifier (GUID) along with information about who may be contacted for further details, for the content of the dataset and for technical information.

There are then two major groups, one holding metadata about the entire dataset and the other holding the actual data records.

The Metadata section holds information about an entire dataset and has the following structure:

METADATA
  - Description
  - Icon URI
  - Scope (Geo-ecological and Taxonomic)
  - Version
  - Revision data (Creator, Contributors, Creation and Modification dates)
  - Owners
  - Intellectual Property Rights (IPR) statements

The second major section, called UNITS, holds all the records selected and exported from the original dataset, each one of which is a UNIT. This is by far the largest component of ABCD and has the following high-level structure:

UNITS
    UNIT
      Here we can distinguish several areas.
      Most of these do not show up in the actual XML hierarchy,
      because ABCD 2.06 avoids using container elements that serve only to group items together: 
      - Unit-level metadata
      - Record basis and Kind Of Unit
      - Identifications
      - Collection domain-specific data
      - Unit relationships (Associations and Assemblages) 
      - Named collections and surveys
      - Gathering event and site characteristics
      - Measurements and Facts
      - Unit extension area

Usage Recommendations

Initially, ABCD may appear somehow complex to the new user, but as the principles of its design are known, it will be found out how the data held in biological collections fits so well with the structure defined. The ABCD is highly structured in order to manage the large quantity of data that a biological collection record may contain. Some of the decisions on where to provide what information will depend closely on previous curatorial decisions made upon the management of the information itself, although usually a dataset will correspond to the information of a collection and each UNIT within the dataset will be related to a record of information from a particular specimen or observation in the biological collection.

Almost no internal referencing and (almost) no recursive structures will be found inside ABCD. This means that ABCD could be seen as a single-root document that allows processing to be easier and faster, without the inherent inconvenience of many relational structures using IDs.

ABCD was designed to be comprehensive, aiming to define the semantics of all elements to provide a unified approach for the natural history collection community, to accept detailed information (where available) and to develop a proto-ontology, as a first step towards a collection ontology.

The variable atomisation followed in ABCD, should allow the provision of data in different degrees of detail and standardisation, accepting data from a wide variety of sources and enabling data integration.

Extensible Slots included into ABCD should not be used for individualised adaptations of the schema. They are rather intended for fast community support in case of missing elements in the current version, before definite integration into a subsequent version. The ABCD extensible slots also provide for the inclusion of third-party-schemas (or parts thereof), in order to prevent duplication of developments in other communities (e.g. geographical data)

All along ABCD, there are also flexible containers included to allow freely defined and repeatable data fields according to the discipline or characteristics of data (e.g., higher taxa, measurements, morphological features, among others). These take the form of Element-element or element-attribute couples.

Besides several particular complex type elements (like PersonName or Monomial elements), it is common to find, throughout ABCD, string elements of two particular types, the string extended with a language attribute (StringL)and the string extended with a preferred attribute (StringP),and the combination of both (StringLP)in different lengths (50, 255 and unbounded). The string extended with a language attribute is used to indicate in which language is the textual information contained being provided; while the string extended with a preferred attribute is provided to indicate that the textual value contained within the element is to be preferred among others available.

In addition, for some elements, textual data could be provided even when an atomised form is impractical or imposible to provide. To allow this, you may see there is a provision for free-text data next to the atomised version. An example of this can be found in the scientific name element.

SourceInstitutionID, SourceID and UnitID are the three elements that conform the unique Unit record identifier and they correspond, respectively, to the identifier of the institution holding the original data source, the name or code of the data source unique within the institution and a unique identifier for the unit record within the data source. Therefore these are currently the only mandatory information at the Unit level. For an example, look at the minimum ABCD record shown above.

All the usual gathering information should be registered within the gathering element in the Unit element. This includes, but is not limited to, agents (collectors), dates, method, locality, site coordinates and altitude. Additional information like permits, project, depth, height, images references, aspect and notes, among others, could be provided here.

The identification related information could be registered in the Identifications section of the Unit element. Here, both, the current identification(s) and the identification history can be registered. It is worth noting that the result of the identification event would fit into the Identificatons/Identification/Result/TaxonIdentified element, where the higher taxa and the scientific name (or an informal name, when the later is not available) can be included, either as the string of the full scientific name or as a name atomised with subtypes according to the corresponding Bacterial, Botanical, Zoological or Viral Code.

Glossary

BioCASE	Biological Collection Access Service for Europe http://www.biocase.org
BioCASe	Biological Collection Access Service http://www.biocase.org
BioCISE	Biological Collection Information Service in Europe http://www.bgbm.org/BioCise/
CODATA	International Council for Science: Committee on Data for Science and Technology http://www.codata.org/
Darwin Core	A simple set of data element definitions designed to support the sharing and integration of primary biodiversity data http://darwincore.calacademy.org/
DiGIR	Distributed Generic Information Retrieval http://www.digir.net
ENHSIN	European Natural History Specimen Information Network http://www.bgbm.org/BioDivInf/projects/ENHSIN/
GBIF	Global Biodiversity Information Facility
GML	Geography Markup Language http://www.opengeospatial.org/standards/gml
GUID	Global Unique Identifier http://en.wikipedia.org/wiki/GUID
HISPID	Herbarium Information Standards and Protocols for Interchange of Data http://plantnet.rbgsyd.nsw.gov.au/HISCOM/HISPID/HISPID3/hispidright.html
HTML	HyperText Markup Language http://www.w3.org/MarkUp/
ITF	The International Transfer Format for Botanic Garden Plant Records http://ww.bgbm.org/TDWG/acc/itf2-d32.doc
ITIS	Integrated Taxonomic Information System http://www.itis.usda.gov/
ITIS-CA	ITIS Canada http://www.cbif.gc.ca/pls/itisca/
JavaScript	JavaScript Scripting language (originally called LiveScript) developed by Netscape Communications for use with the Navigator browser http://www.mozilla.org/js/
LSID	Life Science Identifier http://lsid.sourceforge.net/
OGC	Open Geospatial Consortium, Inc. http://www.opengeospatial.org/
REMIB	Red Mundial de Información sobre Biodiversidad http://www.conabio.gob.mx/remib/doctos/remib_esp.html
SDD	Structure of Descriptive Data. An TDWG, XML-based interoperability standard for descriptive data
SYNTHESYS	Synthesis of Systematic Resources http://www.synthesys.info/
Species Analyst	Research project developing standards and software tools for access to the world's natural history collection and observation databases http://speciesanalyst.net
speciesLink	Distributed Information System integrating primary data from scientific biological collections http://splink.cria.org.br
UDDI	Universal Description, Discovery and Integration http://www.uddi.org
URI	Universal Resource Identifiers http://www.w3.org/Addressing/#background
UTF	Universal Transformation Format http://www.unicode.org/
WFS	Web Feature Service http://www.opengeospatial.org/standards/wfs
WMS	Web Map Service http://www.opengeospatial.org/standards/wms
XML DTD	Document Type Definition http://www.w3.org/TR/html4/sgml/dtd.html
XML Schema	A formal definition of the mandatory and optional structure and content of XML formatted documents within its domain.

References

http://rs.tdwg.org/abcd

Appendix: Examples

A typical example for a botanical specimen collected in Turkey:

<?xml version='1.0' encoding='UTF-8'?>
<DataSets xmlns='http://www.tdwg.org/schemas/abcd/2.06'>
  <DataSet>
    <TechnicalContacts>
      <TechnicalContact>
        <Name>Gerd Müller</Name>
        <Email>gerd@dfb.de</Email>
      </TechnicalContact>
    </TechnicalContacts>
    <ContentContacts>
      <ContentContact>
        <Name>A Another</Name>
        <Email>a.another@fake.org</Email>
      </ContentContact>
    </ContentContacts>
    <Metadata>
      <Description>
        <Representation language='en'>
          <Title>PonTaurus collection</Title>
        </Representation>
      </Description>
      <RevisionData>
          <DateModified>2001-03-01T00:00:00</DateModified>
      </RevisionData>
    </Metadata>
    <Units>
      <Unit>
        <SourceInstitutionID>BGBM</SourceInstitutionID>
        <SourceID>PonTaurus</SourceID>
        <UnitID>1136</UnitID>
        <DateLastEdited>2001-03-01T00:00:00</DateLastEdited>
        <Identifications>
          <Identification>
            <Result>
              <TaxonIdentified>
                <HigherTaxa>
                  <HigherTaxon>
                    <HigherTaxonName>Plumbaginaceae</HigherTaxonName>
                    <HigherTaxonRank>familia</HigherTaxonRank>
                  </HigherTaxon>
                </HigherTaxa>
                <ScientificName>
                  <FullScientificNameString>Acantholimon lycaonicum Boiss. & Heldr.</FullScientificNameString>
                  <NameAtomised>
                    <Botanical>
                      <GenusOrMonomial>Acantholimon</GenusOrMonomial>
                      <FirstEpithet>lycaonicum</FirstEpithet>
                      <AuthorTeam>Boiss. & Heldr.</AuthorTeam>
                    </Botanical>
                  </NameAtomised>
                </ScientificName>
              </TaxonIdentified>
            </Result>
            <Identifiers>
              <Identifier>
                <PersonName>
                  <FullName>Mrs Barcode</FullName>
                </PersonName>
              </Identifier>
            </Identifiers>
          </Identification>
        </Identifications>
        <RecordBasis>PreservedSpecimen</RecordBasis>
        <Gathering>
          <DateTime>
            <ISODateTimeBegin>1999-08-01T00:00:00</ISODateTimeBegin>
          </DateTime>
          <Agents>
            <GatheringAgent>
              <Person>
                <FullName>B Another</FullName>
              </Person>
            </GatheringAgent>
          </Agents>
          <Country>
            <Name>Turkey</Name>
          </Country>
          <Altitude>
            <MeasurementOrFactAtomised>
              <LowerValue>2620</LowerValue>
              <UnitOfMeasurement>meter</UnitOfMeasurement>
            </MeasurementOrFactAtomised>
          </Altitude>
          <Aspect>
            <Ordination>N</Ordination>
          </Aspect>
        </Gathering>
      </Unit>
    </Units>
  </DataSet>
</DataSets>

Abstract
Purpose
Top-Level Structure
Usage Recommendations
Glossary
References
Appendix: Examples