The source files were parsed, record by record, and exported into an XML file with the following specification:
"<?XML VERSION="1.0" ?>
<!ELEMENT ARPID (#PCDATA)>
<!ELEMENT CITATION (REFID*)>
<!ELEMENT REFID (#PCDATA)>
<!ELEMENT BIBLIO (TITLE, ENUM*)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT ENUM (#PCDATA)>
<!ATTLIST ENUM TYPE NA #REQUIRED>
<!ELEMENT KEYTERMS (KEYWORD*)>
<!ELEMENT KEYWORD (#PCDATA)>
<!ATTLIST KEYWORD TYPE NA #REQUIRED>
]>
"
An example of an XML record for an article in the source files:
<RECORD>
<ARPID>ISSN_1013-9826_1998_137_55</ARPID>
<CITATION>
<REFID>ISSN_0032-3861_1987_28_1489 <~>
ISSN_0022-2461_1994_29_3377 <~>ISSN_0032-3861_1994_35_3948 <~>
ISSN_0032-3861_1995_36_4587 <~> ISSN_0032-3861_1995_36_4605 <~>
ISSN_0032-3861_1995_36_4621 <~> ISSN_0022-2461_1989_24_298 <~>
ISSN_0022-2461_1989_24_2454 <~> ISSN_0021-8936_1983_50_1042 <~>
ISSN_0032-3888_1993_33_819 <~> ISSN_0032-3861_1992_33_268 <~>
ISSN_0032-3861_1992_33_284 <~> ISSN_0032-3861_1990_31_2267 <~>
ISSN_0032-3861_1980_21_466 <~> ISSN_0032-3861_1985_26_1855
</REFID>
</CITATION>
<BIBLIO>
<TITLE>Effect of rubber functionality on mechanical and fracture properties of
impact-modified nylon 6,6/polypropylene blends</TITLE>
<ENUM TYPE="ENDPAGE">62</ENUM>
</BIBLIO>
<KEYTERMS>
<KEYW TYPE="TITLE">rubber <~> properti <~> nylon <~> mechan <~>
impact-modifi <~> function <~> fractur <~> blend <~> /polypropylen</KEYW>
<KEYW TYPE="KEYW_AU">PA6,6/PP blends <~> rubber-toughened nylon <~>
rubber-toughened polypropylene <~> mechanical properties <~> fracture toughness <~> J(c)
<~> fractography</KEYW>
<KEYW TYPE="KEYW_ISI">FILLED COMPOSITE-MATERIALS <~>
POLYPROPYLENE BLENDS <~> BLOCK-COPOLYMERS <~> PREDICTIVE MODEL <~>
COMPATIBILIZATION <~> POLYAMIDES <~> MORPHOLOGY <~> CAVITATION <~>
PARTICLES</KEYW>
<KEYW TYPE="AUTHOR">Wong, SC <~> Mai, YW</KEYW>
</KEYTERMS>
</RECORD>
This format is intended to be as general as possible to enable us to import or update record information from other databases in the future. Below we describe the XML markup tags in more detail.
To better manage the XML record repository, especially in the future when more databases are added, we created an index file which contains additional record information regarding different information resources or corpuses. The index file exists to avoid in place editing of records in the XML repository. We record multiple occurrences of a document in several databases as multiple entries in the index file, a simple ASCII single-line file. This way we only need to append new index lines, and subsequently sort them easily by ARPID, in order to keep information about a record's existence in multiple databases. The index file stores records' handles to various information resources. A record which is present in n information resources will have n entries in the index file. The index file keeps the following items:
Record | Resource | Resource ID | Filename | Line Number |
ARPID | Corpus Name (e.g. ISI) | ID of record in specific resource | XML Filename | Line Number where record is found in XML file |