metadata watch
standards framework
workshops
registry
information services
publicity materials



intranet
vertical line  
Home vertical line
Project vertical line
Partners vertical line
Related vertical line
Archives vertical line
Search vertical line
Glossary vertical line
 

Metadata Watch Report #2

[ contents | section 1 | section 2 ]

Section 3 - Domain reports

3.1 Audio-visual sector
3.2 Educational sector
3.3 Academic sector
3.4 Geographical information sector
3.5 Publishing sector

3.3 Academic sector

Correspondent: Michael Day, UKOLN

 

1. CURRENT STATE OF DOMAIN

The scope of the first academic domain metadata watch report included World Wide Web Consortium metadata developments, Internet information gateway initiatives and the specification of metadata for record keeping and digital preservation. This second academic domain report will (briefly) update one of these topics, but will describe in more detail metadata standards developed for electronic literary and linguistic texts (the Text Encoding Initiative header), e-print services (the Open Archives initiative) and the more complex topic of biological diversity information.

2. MAIN ISSUES

1. Digital Preservation

The collaboration between the Research Libraries Group and OCLC on digital preservation has resulted in the creation of a small, but international, Digital Archive Attributes Working Group. This working group started work in June 2000 and initially will review the OAIS Reference Model, the various preservation initiatives of the National Library of Australia and the British Library, and the outcomes of research projects like NEDLIB and Cedars.

2. Literary and linguistic texts

Humanities scholars have used electronic texts in their research since the 1950s, when Roberto Busa began to compile a word index and concordance to the complete works of Thomas Aquinas. Over the succeeding years, scholars have created a large quantity of electronic texts - both for literary and linguistic research. Electronic text centres, e.g. the University of Virginia Electronic Text Center and the Oxford Text Archive, have been set up to serve this constituency of scholars, and to encourage the long-term retention and reuse of texts. It has long been realised that the easy reuse of electronic texts is dependent upon the application of a standardised encoding (or markup). Defining this standardised encoding has been the main goal of the Text Encoding Initiative (TEI).

The TEI is an ongoing collaborative research effort concerned with developing generic guidelines for the representation of textual materials in electronic form in order to facilitate the preparation and interchange of electronic texts for scholarly research. Since 1990, the TEI has published several editions of its "Guidelines for Electronic Text Encoding and Interchange", the encoding scheme of which is formulated as an application of the Standard Generalized Markup Language (SGML). TEI-conformant Document Type Definitions (DTDs) can be built up from the tag-sets documented in the Guidelines.

Metadata - typically bibliographic-type information - is stored in the TEI header. The elements in the header are divided into four parts: file description (where most of the bibliographic information would be stored), encoding description, text profile and revision history. The file description element of the TEI header was broadly based on library cataloguing principles (e.g. the International Standard Bibliographic Description (ISBD) series) but did not mandate any particular content rules. In practice, however, electronic text centres creating TEI headers have often developed their own cataloguing guidelines. These tend to be compliant with library standards like the ISBD(ER) - the ISBD for electronic resources - and the 2nd edition of the Anglo-American Cataloguing Rules (AACR2). A good example of such rules is included in the University of Virginia Library's "Cataloging Procedures Manual".

A membership consortium known as the TEI Consortium is now responsible for the development, maintenance and promotion of TEI. This consortium will be responsible for ensuring the future sustainability of the TEI (e.g., through developing training and consulting services) and for ensuring that the TEI Guidelines have a role to play in the development of XML-based tools.

3. E-print archives

Another metadata-related initiative with its roots in the academic sector is the interoperable system being developed by the Open Archives initiative (OAi). The Open Archives initiative is working towards the development of a universal service that will give access to author self-archived scholarly literature (e-prints). Following the early example of the 'e-print archive' hosted by the Los Alamos National Laboratory (now called arXiv.org), a large (and growing) number of e-print services have been set-up and they increasingly form an important part of scholarly communication in some disciplines.

The Open Archives initiative aims to offer interoperability between these – quite diverse - e-print services. A meeting of the OAi in Santa Fe resulted in the publication of the "Santa Fe Convention", a set of interoperability agreements that are intended to aid the creation of e-print 'mediator services'. The Convention recognises the existence of format diversity but suggests that interoperability will depend upon the existence of a shared format for exchanging metadata. The proposed basic metadata set is called the Open Archive Metadata Set (oams) - the semantics of which has deliberately been kept simple in the interest of easy creation and widest applicability.

4. Biological diversity information

A different area where there is a pressing need for interoperability and standardisation is that of biological information. This need has been emphasised by the growing perception of the importance of biological diversity (biodiversity) and the creation of Internet-based services like the Clearing House Mechanism (CHM) of the United Nations Convention on Biological Diversity (1992).

One of the challenges of integrating biological information is that there are many different types of it. For example, curatorial institutions like museums, herbaria, botanical gardens and zoological gardens would typically have large amounts of descriptive data about biological specimens and artefacts. There is no single standard for these descriptions and they are often not in any machine-readable form, e.g. they could be ancient hand-written labels stuck on glass jars or placed next to insects on pins. Other information is concerned with geographical distribution, biological nomenclature or publications. One key to any proposed biodiversity 'database' will be the digitisation and integration of this wide range of information types.

Digitising and integrating all of this information, however, will take a very long time. Even finding out what has been previously described is difficult - there is currently no master inventory of all of the species that have been described by taxonomists. Progress has been made by biologists in developing particular areas; e.g. the creation of nomenclatural databases like the International Legume Database & Information Service (ILDIS) or the Missouri Botanical Garden's VAST (VAScular Tropicos) database.

A variety of standards have been developed to support the interchange of biological information. These include the Association of Systematics Collections (ASC) Reference Model for Biological Collections and the Herbarium Information Standards and Protocols for Interchange of Data (HISPID). The DELTA (DEscription Language for TAxonomy) format can be used to record taxonomic descriptions and has been adopted as a standard for data exchange by the International Taxonomic Databases Working Group.

Generic approaches to taxonomic information have been taken by services like the Index to Organism Names (maintained by BIOSIS), the Integrated Taxonomic Information System (ITIS), and by the Species 2000 programme. Further developments might be facilitated by the 1999 agreement of the OECD's Global Science Forum (previously the Megascience Forum) to create a Global Biodiversity Information Facility (GBIF).

Smaller-scale progress is being made in two important areas: developing standardised metadata formats for biological information; and the Species 2000 programme - the production of a uniform and validated index of the names of all known species.

Some progress in the metadata format area has been achieved by the agreement of the Federal Geographic Data Committee's (FGDC) Biological Metadata Profile. This is an enhancement of the FGDC's Content Standard for Digital Geospatial Metadata (CSDGM). It includes all of the CSDGM elements but adds other elements that can be used to document biological information about taxonomy and nomenclature. Metadata created according to this profile can be added to biological metadata clearinghouses like the US NBII (National Biological Information Infrastructure) Metadata Clearinghouse.

The Species 2000 programme aims to create an index to all of the world's known species that could be used as a tool in inventorying and monitoring biodiversity worldwide. It is an international initiative, based on a federation of existing taxonomic databases who will create a range of global species databases that will cover all of the major groups of organisms. The Species 2000 programme is likely to take a long time. In the interim, regional approaches to indexing species are being developed, of which ITIS (a partnership of U.S., Canadian, and Mexican agencies) covers North America. In the UK, the National Biodiversity Network (NBN) will be developing relevant data (and metadata) standards and a dictionary of species.

3. TRENDS

Some trends can be identified:

First, to state the obvious, much of the impetus towards developing standardised metadata schemes is based on a need to share metadata or otherwise interoperate. Previous standardisation, however, has tended to based within one particular sector; e.g. the sharing of MARC records between libraries. Increasingly, there is a need for some standardisation of data across sectors. Good examples are the way in which electronic text centres have developed content rules for TEI headers based on existing library standards and the interaction between biological and geographical information made possible by the development of the FGDC Biological Metadata Profile.

Secondly, successful academic sector metadata initiatives will need to be collaborative ventures between academic institutions and other organisations. This will be especially important in the biological diversity information area, where universities, curatorial institutions, government agencies, supra-national organisations, commercial database providers, professional societies, etc. will all have an interest in the development of standards.

Thirdly, it is becoming clear that short-term research and development projects will often need to evolve into something more sustainable. A good example of this is the development of the TEI from a development project funded by professional societies and other funding organisations into a non-profit membership consortium.

>>Section 3.4 Geographical information sector

[ contents | section 1 | section 2 ]


Maintained by: UK Office for Library and Information Networking (UKOLN)
Last updated: 07 August 2001