|
Comments
|
Today's Top SOA Links
XML Using XSLT on Bioinformatic XML Data
When data changes frequently
Aug. 3, 2004 12:00 AM
For the biologist, the bioinformatic analysis of genes requires the compilation of tables of gene characteristics. To do this, data is often taken manually out of databases in an ad hoc fashion. Different databases (TIGR, MIPS, BLAIR, and NCBI, for example) give different outputs in different formats. We would like to be able to extract information from the databases in a common, structured file format in a way that allows for easy rearranging and processing of the data. The Extensible Markup Language (XML) is being used increasingly to represent semi-structured data and transmit it over the Internet. XML data is data that is marked up by tags in a manner similar to those in the Hyper-Text Markup Language (HTML). For example, the following code shows one way of using XML to mark up the protein with accession number "BAA03739.1". It is taken from the National Center for Biological Information (NCBI) Internet Web site. The NCBI output can be expressed in the XML file format. All the main biological databases on the Internet now give the user the option of choosing output as XML. <GBQualifier> <GBQualifier_name>protein_id</GBQualifier_name> <GBQualifier_value>BAA03739.1</GBQualifier_value> </GBQualifier> HTML markup is used to format and present data; XML is used to organize and structure data. It is the user who defines the choice of elements and attributes, the types of data contained in them, and the way the elements nestle within each other. One of the principal characteristics of XML is the separation of the data itself, which is the XML document, from the formatting instructions in the style sheet. The rules governing the document structure and data types are also kept separate in a document called the schema. To illustrate the use of XML as an open data format for exchanging and processing biological data, data files describing rice starch synthase genes were downloaded from the NCBI database in an XML format (GBSeqXML). One of the standard XML tools, Extensible Stylesheet Language Transformation (XSLT), was used to process and transform the information required into a table within a text document. XSLT is capable of transforming XML documents from one XML format into another and can also transform XML into HTML for presentation, or into text for documentation. Case Study: Using XSLT on a Bioinformatic XML Data File The top line is the XML declaration; it identifies the document as actually being an XML document for an XML processor. The declaration starts with the text "<?xml version="1.0"?>", which signals the parser that an XML declaration follows and that the version number of the XML specification being used in the document is "1.0". The encoding value in the first line identifies the character codes used in the document. Because different languages use different encoding schemes, this declaration allows XML to support different languages. The default encoding scheme is the English language scheme "UTF-8." In this article we use "UTF-16," the 16-bit Unicode scheme or international language scheme. It is important to realize that XML is case sensitive. This is a critical difference from HTML, with which many biologists are familiar, which is not case sensitive. The second line links the XML document to the schema. A schema can be thought of as a set of rules that establishes the format and structure created for the document. A schema describes precisely the permitted elements and attributes that are available within a given XML document, along with the relationships between the elements. We can think of a schema as a legal contract between the person who created the markup language and the person who will create documents using that language. Each document that conforms to the schema is referred to as an instance of the schema, and within the rules of the schema a wide variation in instance documents is possible. Not all instance documents will contain the same information. When it comes to creating schemas, two different approaches can be taken:
The goal is to make an XML document a valid document. Document validity is extremely important because it guarantees that the data within the document conforms to a standard set of guidelines, as laid out in a schema. Our example in Listing 1 uses a DTD. In a valid XML document, all rules, elements, and attributes match the logical structure and data types defined in the DTD schema. Not all XML documents have to be valid. To validate an XML document the parser must read the DTD, validate the document against it, and report any violations to the XML application. Because this takes time, some XML applications might use XML to code small chunks of data that really don't require the thorough validation options made possible by a schema or DTD. Even if the XML document is not valid it must be well-formed. A well-formed XML document conforms to the World Wide Web Consortium (W3C) XML specification 1.0. Rules for well-formed XML documents include matching start tags with end tags and setting values for all attributes used. A well-formed XML document contains one or more elements. It has a single root document element. In Listing 1 the root element is "GBSet," all other elements are properly nested under it, and each of the parsed entities is referenced correctly. An XML document is a structure of elements, attributes, and text all nested within the root element. Well-formed XML documents do not require a DTD but valid XML documents do. XSLT Implementation Using XSLT to perform transformations on XML is easier than writing a custom application with a procedural language because the design of XSLT is based on the recognition that these XML documents are all very similar. It should be possible to do the processing using the XSLT declarative language rather than by writing a program from scratch in Java or some other programming language. The required transformation can be expressed as a set of rules. The output we want to generate from particular patterns that occur in the input will define the rules. The language is declarative because one describes the transformation required as a set of transformations, rather than by creating a sequence of procedures in a given order. The process is simpler because XSLT describes the required transformation and is a complete programming language in itself. Typically, most genetic databases produce data files with a size in the amount of millions of records. It would be faster to parse this data with a procedural language because XML and XSLT require lots of processing power to parse large XML documents. The major disadvantage of a parsing approach is that when data formats change, as they often do, the parsers will not work and the procedural language must be completely rewritten or major modifications must be made. Take, for instance, the recent change where Unigene started appending the version number to their NCBI GenBank Accession Number. Using XML technology, the programmer adds a tag to extract the GenBank Accession Number. The advantage of XML over raw flat files is that no code must be rewritten. By using XSLT on the downloaded data of rice starch synthase genes, which were saved as XML files, we can construct a table that includes characteristics such as "accession number," "molecular type," "protein number," etc. XSLT can extract any piece of data from the XML file, process it using built-in functions, and format it on the page in any way desired. Listing 2 shows many of these features. Listing 2 has been truncated for space reasons. Some of the XSLT code used is adapted from XSLT Cookbook by Mangano and further features of the code are described there. An XML document may be visualized as a tree structure of elements, attributes, text, comments, etc. XSLT is a mapping from the source tree into the result tree. Each node that is to be mapped (each branch and leaf of the tree) has a rule associated with it called a "template" that describes how the node is to be transformed. At the top of the tree is an imaginary node called the "document root," denoted by "/". It corresponds to the XML declaration. A node is addressed using the descriptive language XPath. For example, "/GBSet/GBSeq/GBSeq_primary-accession" refers to the element "GBSeq_ primary-accession" which is a child element of "GBSeq" and which is itself a child element of "GBset." This in turn is a child element of the document root "GBSet," which is the topmost element or "root element." In our code the line "document($file)/GBSet/GBSeq/ GBSeq_primary-accession" is shortened to "document($file)//GBSeq_primary-accession." The two slashes, "//," are a set of nodes consisting of every GBSeq_ primary-accession element in the tree. Any of the elements and attributes in the downloaded XML files can be extracted by XSLT code and formatted into a text or HTML file. But XSLT does more than just extract data from the XML source: it processes data into information. For example, using the built-in string function, "string-length(...)", the number of amino-acids (entry Amino_Acid, Table 1) can be found. Similarly, the molecular weight (in Daltons) can be calculated using the XSLT arithmetic features. Results Conclusion For our work here, the XML output files were obtained directly by downloading them to the hard drive. XML processing could be combined with Java programming to automate the process. For example, a Java program could have been written to work in conjunction with the gene output from the NCBI Web site to automatically generate the XML output files and process them into the text document. This would have required procedural programming skills; what we have done here uses only the built-in declarative features of the XSLT tool. References Reader Feedback: Page 1 of 1
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week |
|||||||||||||||||||||||||||