1.2 XML Specification

The source files were parsed, record by record, and exported into an XML file with the following specification:

"<?XML VERSION="1.0" ?>
      <!ELEMENT ARPID (#PCDATA)>
      <!ELEMENT CITATION (REFID*)>
            <!ELEMENT REFID (#PCDATA)>
      <!ELEMENT BIBLIO (TITLE, ENUM*)>
            <!ELEMENT TITLE (#PCDATA)>
            <!ELEMENT ENUM (#PCDATA)>
            <!ATTLIST ENUM TYPE NA #REQUIRED>
      <!ELEMENT KEYTERMS (KEYWORD*)>
            <!ELEMENT KEYWORD (#PCDATA)>
            <!ATTLIST KEYWORD TYPE NA #REQUIRED>
]> "

An example of an XML record for an article in the source files:

<RECORD>
      <ARPID>ISSN_1013-9826_1998_137_55</ARPID>
      <CITATION>
            <REFID>ISSN_0032-3861_1987_28_1489 <~>
            ISSN_0022-2461_1994_29_3377 <~>ISSN_0032-3861_1994_35_3948 <~>
            ISSN_0032-3861_1995_36_4587 <~> ISSN_0032-3861_1995_36_4605 <~>
            ISSN_0032-3861_1995_36_4621 <~> ISSN_0022-2461_1989_24_298 <~>
            ISSN_0022-2461_1989_24_2454 <~> ISSN_0021-8936_1983_50_1042 <~>
            ISSN_0032-3888_1993_33_819 <~> ISSN_0032-3861_1992_33_268 <~>
            ISSN_0032-3861_1992_33_284 <~> ISSN_0032-3861_1990_31_2267 <~>
            ISSN_0032-3861_1980_21_466 <~> ISSN_0032-3861_1985_26_1855
            </REFID>
      </CITATION>
      <BIBLIO>
            <TITLE>Effect of rubber functionality on mechanical and fracture properties of impact-modified nylon 6,6/polypropylene blends</TITLE>
            <ENUM TYPE="ENDPAGE">62</ENUM>
      </BIBLIO>
      <KEYTERMS>
            <KEYW TYPE="TITLE">rubber <~> properti <~> nylon <~> mechan <~> impact-modifi <~> function <~> fractur <~> blend <~> /polypropylen</KEYW>
            <KEYW TYPE="KEYW_AU">PA6,6/PP blends <~> rubber-toughened nylon <~> rubber-toughened polypropylene <~> mechanical properties <~> fracture toughness <~> J(c) <~> fractography</KEYW>
            <KEYW TYPE="KEYW_ISI">FILLED COMPOSITE-MATERIALS <~> POLYPROPYLENE BLENDS <~> BLOCK-COPOLYMERS <~> PREDICTIVE MODEL <~> COMPATIBILIZATION <~> POLYAMIDES <~> MORPHOLOGY <~> CAVITATION <~> PARTICLES</KEYW>
            <KEYW TYPE="AUTHOR">Wong, SC <~> Mai, YW</KEYW>
      </KEYTERMS>
</RECORD>

This format is intended to be as general as possible to enable us to import or update record information from other databases in the future. Below we describe the XML markup tags in more detail.

1.2.1 XML Tags and Attributes

  1. ARPID tag. The attribute to this tag is the unique record identification number that we have created for ARP: the ARPID. This identification string, which is unique for each record, is built from the following information:
    1. Type of record: article, book, or chapter.
    2. ISSN or ISBN ISnumber.
    3. Bibliographic information: year (publication year), volume, and start page.
    The ARPID is of the form: type_isnumber_year_vol_startpage, e.g. ISSN_1063-651X_1997_55_R6315. This is unique for all the records with ISSN or ISBN information. We do not consider records without such identification.
  2. CITATION tag. The attribute to this tag is a list, identified by a REFID subtag, of ARPID's (same form as record ARPID) delimited by the <~> tag. Each element of the list uniquely identifies a document, referred to by an XML repository record, which may or may not be itself a record in the XML repository.
  3. BIBLIO tag. Within the BIBLIO tag we keep a number of subtags used to identify bibliographic information not already specified in the ARPID. These subtags can have different TYPE attributes such as: "ISSUE", "PART", and "ENDPAGE" depending on which extra information is provided in the records defined in different databases.
  4. KEYTERMS tag. The attribute to this tag is a list, identified by a KEYW subtag. This subtag can be qualified by different TYPE attributes such as "AUTHOR", "KEYW_ISI", "KEYW_AU" and "TITLE". Each KEYW tag contains a list of keywords of a given TYPE delimited by the <~> tag. Typically, records define different types of keywords. In particular ISI records specify two different kinds of keywords: author assigned (TYPE= "KEYW_AU") and ISI+ (TYPE = "KEYW_ISI"), the latter listing keywords that have been standardized by ISI. Many records in the ISI database, however, do not contain either of these types of keywords. Therefore, we have derived another class of keywords (TYPE= "TITLE") from the title words by removing a list of most common words and stemming the remaining words (see details below). We have also included the names of authors as a type of keyword (TYPE= "AUTHOR"). In the future we can easily add other classes of keywords such as those derived from abstract text. Indeed, the flexibility to add other types of information not currently available to records, is one of the important advantages of the XML specification.

1.2.2 Index File

To better manage the XML record repository, especially in the future when more databases are added, we created an index file which contains additional record information regarding different information resources or corpuses. The index file exists to avoid in place editing of records in the XML repository. We record multiple occurrences of a document in several databases as multiple entries in the index file, a simple ASCII single-line file. This way we only need to append new index lines, and subsequently sort them easily by ARPID, in order to keep information about a record's existence in multiple databases. The index file stores records' handles to various information resources. A record which is present in n information resources will have n entries in the index file. The index file keeps the following items:

Record Resource Resource ID Filename Line Number
ARPID Corpus Name (e.g. ISI) ID of record in specific resource XML Filename Line Number where record is found in XML file
Back to ARP main page
To the LWW main page
To CIC-3's main page To LANL's main page

For more information contact Luis Rocha at rocha@lanl.gov
Last Modified: July 14, 2004