Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Computing
Conference & Expo
November 2-4, 2009 NYC
Register Today and SAVE !..
SYS-CON.TV
Today's Top SOA Links


Using XSLT on Bioinformatic XML Data
When data changes frequently

For the biologist, the bioinformatic analysis of genes requires the compilation of tables of gene characteristics. To do this, data is often taken manually out of databases in an ad hoc fashion. Different databases (TIGR, MIPS, BLAIR, and NCBI, for example) give different outputs in different formats. We would like to be able to extract information from the databases in a common, structured file format in a way that allows for easy rearranging and processing of the data.

The Extensible Markup Language (XML) is being used increasingly to represent semi-structured data and transmit it over the Internet. XML data is data that is marked up by tags in a manner similar to those in the Hyper-Text Markup Language (HTML). For example, the following code shows one way of using XML to mark up the protein with accession number "BAA03739.1". It is taken from the National Center for Biological Information (NCBI) Internet Web site. The NCBI output can be expressed in the XML file format. All the main biological databases on the Internet now give the user the option of choosing output as XML.


<GBQualifier>
	<GBQualifier_name>protein_id</GBQualifier_name>
	<GBQualifier_value>BAA03739.1</GBQualifier_value>
</GBQualifier>

HTML markup is used to format and present data; XML is used to organize and structure data. It is the user who defines the choice of elements and attributes, the types of data contained in them, and the way the elements nestle within each other. One of the principal characteristics of XML is the separation of the data itself, which is the XML document, from the formatting instructions in the style sheet. The rules governing the document structure and data types are also kept separate in a document called the schema.

To illustrate the use of XML as an open data format for exchanging and processing biological data, data files describing rice starch synthase genes were downloaded from the NCBI database in an XML format (GBSeqXML). One of the standard XML tools, Extensible Stylesheet Language Transformation (XSLT), was used to process and transform the information required into a table within a text document. XSLT is capable of transforming XML documents from one XML format into another and can also transform XML into HTML for presentation, or into text for documentation.

Case Study: Using XSLT on a Bioinformatic XML Data File
The NCBI Internet Web site is a huge biological database. The authors were interested in information on the rice soluble starch synthase genes. Rice soluble starch synthase genes are "nucleotides" and each gene has a unique accession number in the NCBI database. Thus a search was done under the keyword "nucleotide" for accession number "D16202". The output was downloaded in the "GBsSeqXML" format and saved as an XML file on our hard drive. The truncated file, saved as "file1.xml," is shown in Listing 1.

The top line is the XML declaration; it identifies the document as actually being an XML document for an XML processor. The declaration starts with the text "<?xml version="1.0"?>", which signals the parser that an XML declaration follows and that the version number of the XML specification being used in the document is "1.0". The encoding value in the first line identifies the character codes used in the document. Because different languages use different encoding schemes, this declaration allows XML to support different languages. The default encoding scheme is the English language scheme "UTF-8." In this article we use "UTF-16," the 16-bit Unicode scheme or international language scheme. It is important to realize that XML is case sensitive. This is a critical difference from HTML, with which many biologists are familiar, which is not case sensitive.

The second line links the XML document to the schema. A schema can be thought of as a set of rules that establishes the format and structure created for the document. A schema describes precisely the permitted elements and attributes that are available within a given XML document, along with the relationships between the elements. We can think of a schema as a legal contract between the person who created the markup language and the person who will create documents using that language. Each document that conforms to the schema is referred to as an instance of the schema, and within the rules of the schema a wide variation in instance documents is possible. Not all instance documents will contain the same information.

When it comes to creating schemas, two different approaches can be taken:

  • Document type definitions (DTDs)
  • XML Schema definitions (XSDs)
The second line in Listing 1 is the document type declaration for a public external DTD. A document type declaration is a line of code that identifies the DTD being used; in this case it is a URL, which indicates its location as "http://www.ncbi.nlm.nih.gov/dtd/ NCBI_GBSeq.dtd." The big distinction here is that the definition (DTD) actually describes the markup language, whereas the declaration connects the document to the DTD, which may be located on a remote server.

The goal is to make an XML document a valid document. Document validity is extremely important because it guarantees that the data within the document conforms to a standard set of guidelines, as laid out in a schema. Our example in Listing 1 uses a DTD. In a valid XML document, all rules, elements, and attributes match the logical structure and data types defined in the DTD schema. Not all XML documents have to be valid. To validate an XML document the parser must read the DTD, validate the document against it, and report any violations to the XML application. Because this takes time, some XML applications might use XML to code small chunks of data that really don't require the thorough validation options made possible by a schema or DTD.

Even if the XML document is not valid it must be well-formed. A well-formed XML document conforms to the World Wide Web Consortium (W3C) XML specification 1.0. Rules for well-formed XML documents include matching start tags with end tags and setting values for all attributes used. A well-formed XML document contains one or more elements. It has a single root document element. In Listing 1 the root element is "GBSet," all other elements are properly nested under it, and each of the parsed entities is referenced correctly. An XML document is a structure of elements, attributes, and text all nested within the root element. Well-formed XML documents do not require a DTD but valid XML documents do.

XSLT Implementation
XSLT is used to transform XML documents into other documents, such as HTML or text. XSLT processors parse the input XML document and then process the instructions found in the XSLT stylesheet, using the elements from the input XML document. The familiar markup structure, using the less than "<" and greater than ">" symbols makes its syntax readily identifiable and easier for some people to use than a procedural language. During the processing of the XSLT instructions, which are in the form of XML elements, a structured text output is created. XSLT instructions could also use XML attributes to access and process the content of the elements in the XML input document.

Using XSLT to perform transformations on XML is easier than writing a custom application with a procedural language because the design of XSLT is based on the recognition that these XML documents are all very similar. It should be possible to do the processing using the XSLT declarative language rather than by writing a program from scratch in Java or some other programming language. The required transformation can be expressed as a set of rules. The output we want to generate from particular patterns that occur in the input will define the rules. The language is declarative because one describes the transformation required as a set of transformations, rather than by creating a sequence of procedures in a given order. The process is simpler because XSLT describes the required transformation and is a complete programming language in itself.

Typically, most genetic databases produce data files with a size in the amount of millions of records. It would be faster to parse this data with a procedural language because XML and XSLT require lots of processing power to parse large XML documents. The major disadvantage of a parsing approach is that when data formats change, as they often do, the parsers will not work and the procedural language must be completely rewritten or major modifications must be made. Take, for instance, the recent change where Unigene started appending the version number to their NCBI GenBank Accession Number. Using XML technology, the programmer adds a tag to extract the GenBank Accession Number. The advantage of XML over raw flat files is that no code must be rewritten.

By using XSLT on the downloaded data of rice starch synthase genes, which were saved as XML files, we can construct a table that includes characteristics such as "accession number," "molecular type," "protein number," etc. XSLT can extract any piece of data from the XML file, process it using built-in functions, and format it on the page in any way desired. Listing 2 shows many of these features.

Listing 2 has been truncated for space reasons. Some of the XSLT code used is adapted from XSLT Cookbook by Mangano and further features of the code are described there. An XML document may be visualized as a tree structure of elements, attributes, text, comments, etc. XSLT is a mapping from the source tree into the result tree. Each node that is to be mapped (each branch and leaf of the tree) has a rule associated with it called a "template" that describes how the node is to be transformed. At the top of the tree is an imaginary node called the "document root," denoted by "/". It corresponds to the XML declaration. A node is addressed using the descriptive language XPath. For example, "/GBSet/GBSeq/GBSeq_primary-accession" refers to the element "GBSeq_ primary-accession" which is a child element of "GBSeq" and which is itself a child element of "GBset." This in turn is a child element of the document root "GBSet," which is the topmost element or "root element." In our code the line "document($file)/GBSet/GBSeq/ GBSeq_primary-accession" is shortened to "document($file)//GBSeq_primary-accession." The two slashes, "//," are a set of nodes consisting of every GBSeq_ primary-accession element in the tree.

Any of the elements and attributes in the downloaded XML files can be extracted by XSLT code and formatted into a text or HTML file. But XSLT does more than just extract data from the XML source: it processes data into information. For example, using the built-in string function, "string-length(...)", the number of amino-acids (entry Amino_Acid, Table 1) can be found. Similarly, the molecular weight (in Daltons) can be calculated using the XSLT arithmetic features.

Results
Ten GBSeqXML files on rice synthase proteins from the NCBI Web site were downloaded and saved as XML files. Our XSLT program was written to extract and format the data and present it as Table 1. The first column in Table 1 is the primary accession number, which uniquely identifies each file. Our XSLT looped through the ten files and extracted the data shown in Table 1. Any data in the XML files could have been extracted. The first five columns were obtained by simple extraction and the last two columns were calculated from the amino acid sequence. The XSLT Altova XMLSPYprocessor was used.

Conclusion
This article has shown how the declarative language XSLT can be used to extract and format data into a table. No procedural programming is needed to produce output like that shown in Table 1. Only XSLT was used. Since data formats like NCBI change frequencies often, using XML technology is an advantage because the code does not have to be rewritten. Only the necessary declarative tags need to be added to extract the new data, if needed. Using a procedural language to parse the data would most likely require major modifications in the code. The advantage of using a procedural language over a declarative language is that the processing speed is faster.

For our work here, the XML output files were obtained directly by downloading them to the hard drive. XML processing could be combined with Java programming to automate the process. For example, a Java program could have been written to work in conjunction with the gene output from the NCBI Web site to automatically generate the XML output files and process them into the text document. This would have required procedural programming skills; what we have done here uses only the built-in declarative features of the XSLT tool.

References

  • Baxevanis, A. and Ouellette, F. (2001) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, John Wiley & Sons.
  • Cagle, K. M. Corning, Daimond J., and Duynstee T. (2001) Professional XSL, Wrox Press Ltd.
  • Castro, E. (2001) XML for the World Wide Web: Visual QuickStart Guide, Peachpit Press.
  • Birbeck, M., J. Daimond J. Duckett, and Gudmundsson, O.G. (2001) Professional XML, Wrox Press Ltd.
  • Van der Vlist, E. XML Schema, (2002) O'Reilly & Associates.
  • NCBI Web site: www.ncbi.nlm.nih.gov
  • Michael, K. (2001) XSLT Programmer's Reference, Wrox Press Ltd.
  • XML Web site: www.w3.org/TR/REC-xml or www.w3.org/XML/Core
  • Gardner, J. R. and Rendon, Z.L. (2002) XSLT & XPATH A Guide to XML Transformations, Prentice Hall.
  • Mangano, S. (2002) XSLT Cookbook, O'Reilly & Associates.
  • About Philip Burton
    Philip Burton earned his Ph.D. in Mathematical Physics from the University of Queensland in 1996. He is an assistant professor in the Department of Information Science at the University of Arkansas at Little Rock. Philip is a member of the Institute of Electrical and Electronic Engineering (IEEE) Society, the American Mathematics Society (AMS), and the American Physical Society (APS).

    About Russel Bruhn
    Russel Bruhn earned his PhD in electrical engineering from Washington State
    University in 1997. He is an associate professor and chair of the Department of Information Science at the University of Arkansas in Little Rock. His research interests are in the areas of creating innovative curriculum, computers and education, XML and applications of XML with SVG graphics.

    About Gary A. Thompson
    Gary A. Thompson earned his Ph.D. in plant genetics from the Department of Botany and Plant Pathology at Purdue University in 1989. He is a professor in the Department of Applied Science at the University of Arkansas at Little Rock. Gary is jointly appointed with the University of Arkansas Division of Agriculture and was formerly on the faculty in the Department of Plant Sciences at the University of Arizona. He is a member of the American Society of Plant Biologists. His research interests are in the areas of plant molecular biology and genomics.

    In order to post a comment you need to be registered and logged in.

    Register | Sign-in

    Reader Feedback: Page 1 of 1

    Subscribe to the World's Most Powerful Newsletters
    Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

    SYS-CON Featured Whitepapers
    ADS BY GOOGLE