Abstract
The purpose of this project is to parse an HTML file and present it in XML format. Two operations are performed. First, an executable file is called to clean up the HTML file according to XML specification and a resulting XHTML file is generated. Second, an XML parser is called to process the XHTML file and produce the text presentation for all XML elements.
I. XML vs. HTML
As the Internet continues to evolve, many of the original technologies are replaced by newer and more powerful tools. Hyper Text Markup Language (HTML) was a prevailing markup language in creating web pages in late 90s because HTML documents are simple text documents with markup elements embedded in the text, which means they are completely portable among platforms and programs.
HTML has its limitations. It has limited sets of elements and cannot be extended, i.e. you can’t add new syntax to it. A large portion of its elements are used to define the format instead of being used to present data properties. On the other hand, today’s companies have added many dynamic functions to their web sites to encourage the users’ interaction. It is desired that the web page can interpret the data transferred from somewhere else. HTML cannot support that. Searching in HTML is also very poor because HTML does not recognize the data property [1].
XML is the abbreviation for the Extensible Markup Language. It has quickly gained popularity ever since its debut. In addition to portability, XML has many other advanced features. The most exciting one is that it separates data from its presentation. It allows user to create his own tags, beginning tag and ending tag in a pair, such as <myTag> and </myTag>, to describe the data. This brings the overhead to create a style sheet to instruct the web browser how to present the format of user-defined tags. However, this overhead is miniscule compared to the great advantage XML brings to us.
With XML, data structure and be defined as whatever you like and data can be presented in as many style sheets as you want. This means, data transaction between heterogeneous databases over the network is as easy as string streaming between files. XML is not only for the web but also for any kinds of data and with many types of programs and devices. When a web browser interprets an XML file, it is able to
organize data according to its properties. For example, two presentations of data on two platforms are as follows:
Format I
<student>
<name>..</name>
<dateOfBirth>..</dateOfBirth>
<grade_1>..</grade_1>
</student>
Format II
<student>
<university>..</university>
<grade_1>..</grade_1>
<grade_2>..</grade_2>
<grade_3>..</grade_3>
</student>
It is very convenient to merge the data in different formats. In one of the platform, say Machine I because its file has Format I, you retrieve the file in Format II from Machine II, use an XML parser to generate Format III during the first pass.
Format III
<student>
<name>..</name>
<dateOfBirth>..</dateOfBirth>
<university>..</university>
<grade_1>..</grade_1>
<grade_2>..</grade_2>
<grade_3>..</grade_3>
</student>
And, in the second pass, the parser generates final XML documents. You can also choose to design the XML parser to complete the structure-generation and data presentation all at once. The interesting thing is, XML parser does all this for us. And a parser is nothing more than a program to process string patterns! Thus you see how convenient it is for the data transaction among networks.
In order to do all this neat job, XML has some rules that are fairly strict for HTML. The file displaying HTML elements while implementing XML syntax rules is called XHTML [1]. HTML does not require that each tag has to have a matching tag but XML does. For instance, <p> in HTML is acceptable. While in XML, you have to use <p>….</p>, otherwise, the browser won’t display the file content. On the other hand, HTML is not case-sensitive, but XML is. All XML tags have to be in lowercase. All attributes have to be in double quotes. Element <title> has to be the first element within <head>..</head>. XML has many advanced functions such as generating document object model (DOM) and implementing document type definition (DTD). These are far more than what this project covers and left for future work. With the development of Internet, there must be more and more demand on converting traditional HTML to XML, and XML will eventually replace HTML.
II. Event-Driven
Program Design
Java (JDK1.4) is used to develop the graphical user interface and event-driven processes. (See appendix for flow chart.) Java has been widely used in telecommunication area because of its platform independence and fantastic security management. So far, many industries, such as IBM and Motorola, have developed their own packages in Java to support applications in next generation of products.
The user interface is shown as follows. The program performs four operations. The four buttons are quite self-explanatory about these operations. For better results, it is desired that the order of buttons pressed is from left to right.
Display
User can specify a file name in the text field at the top. When Return key or Display is pressed, the file content will be loaded into the text area below the text field.
Tidy
When Tidy is pressed, an executable Tidy.exe is called in Java runtime environment. It can clean up an HTML by converting tags to lowercase, adding missing starting and ending tags, correcting misspelling tags and piping out to a file called temp_tidy.html, which is ready to be converted to XML file. At the mean time, the content of intermediate file temp_tidy.html is shown in the text area.

Figure1. Scenario snapshot when Display button is
pressed
Tidy.exe needs an input file even though Java can catch its output in form of stream. Due to this limitation, we have to perform several times of disk I/O. This is the compromise of calling Tidy.exe directly instead of writing our own processing software, which demand tremendous hard work.
The following table shows some of the typical mapping works Tidy.exe does [2]:
|
Before |
After |
|
<h1>heading <h2>subheading</h3> |
<h1>heading</h1> <h2>subheading</h2> |
|
<p>here is a para <b>bold <i>bold italic</b> bold?</i> normal? |
<p>here is a para <b>bold <i>bold italic</i> bold?</b> normal? |
|
<h1><i>italic heading</h1> |
<h1><i>italic heading</i></h1> |
|
<i><h1>heading</h1></i> <p>new paragraph <b>bold text |
<h1><i>heading</i></h1> <p>new paragraph <b>bold text</b> |
|
<a href="#refs">References<a> |
<a href="#refs">References</a> |
|
<body> <li>1st list item <li>2nd list item |
<body> <ul> <li>1st list item</li> <li>2nd list item</li> </ul> |
Conversion
When Convert is pressed, the file specified in the text field is converted to an XML file. Its name is shown in the text field and content is shown in the text area. Note that XML tags are interpreted as “starting element: html” and “ending element: html” instead of “<html>” and “</html>”. This is because a parser is called to process the clean XML file by parsing characters and listing all resulting elements and their attributes recursively.
The parser works like this. It reads one character each time and processes it according to certain rules. Its behavior is pretty much like a lexical analyzer. For example, the first character has to be “<”. It will be pushed into the stack. The second character can be “?”, “!” or letters, then a mode may be set according to this character. Typical mode includes COMMENT, DOCTYPE, ATTRIBUTE, ENTITY and CDATA. Note that even though we parse “<” and “!”, we cannot tell whether it is a COMMENT mode or DOCTYPE mode. Then the second character is pushed into the stack and the third letter is parsed to clarify this. Reading of characters will continue until an unambiguous mode is determined. The parser keeps reading characters and pushes them on top of stack. Once a space is encountered, the parser knows that the name ends, a tag is built by stack’s popping up its character elements.
Note that now we are in an open tag. When the space is encountered, the mode is changed to ATTRIBUTE. In a quite similar way, individual attribute can be built by parsing characters until a “=” is encountered, then mode is switched to ATTRIBUTE_VALUE. An attribute value can be built by parsing a double quote, letter(s), and another double quote. When we read “>”, we know that the starting tag is ended, all characters read in before “<” will be treated as text content of the tag. Remember that text contents are plain text, and there is no HTML- like presenting style tags such as <font> or <b>. So when next “<” is encountered, we are pretty sure that we begin to process the ending tag. After conversion of entire input file is complete, the text field is set to the name of XML file.
Show File in Browser
Whenever ShowIE
is pressed, Java runtime environment will call IExplore.exe. The file specified
in the text field will be opened in IE. The difference is that, IE gives no
discrepancy between HTML and XHTML file, and XML file can be displayed
correctly only if the original HTML file is tidied. If an arbitrary file is
specified in the text field, IE will give error information. See following
images for these scenarios.

Figure2. Scenario snapshot when Tidy button is pressed

Figure3. Scenario snapshot when Convert button is
pressed
It is interesting to notice how IE interprets an XML file. The style sheet instructs the browser to use specific typeface and color for the text. The nesting data hierarchy is defined based on the functions within its document and its relationship to other pieces of data. Children elements are displayed with more indents. All tags appear as brown word inside blue sharp brackets. Attributes appear as brown words also, with attribute value in a pair of quotes. Attribute value and text content appear as black characters.
The red sign before a tag tells us that it has children. The sheet is collapsed if it appears as a plus sign and stretched in details if it appears as a minus sign. Note that if there is no text content associated with a pair of tags, such as <title></title>, IE displays it as <title />. If text content is not empty, then IE displays them as regular pairs.

Figure4. Scenario snapshot when Show IE button is
pressed

Figure5. Scenario snapshot if unclean XML is loaded
and shown
Interface
Java Swing components are used to construct the user interface. Swing components are lightweight. “A heavyweight component is one that is associated with its own native screen resource (commonly known as a peer). A lightweight component is one that borrows the screen resource of an ancestor (which means it has no native resource of its own -- so it's lighter).”[5]. In addition, more consistency across platform is achieved. Swing also gives cleaner look-and-feel integration than traditional AWT components.
Text field for file name input is put at the top of the frame. Command buttons are put at the bottom. The remaining space is left for the text area which contains the content of the file. The background is a grayed picture of a cell phone. The image is loaded into the buffer when the text area is initialized and then associated with the text area.
The text area is built into a scroll panel. This is amazing because Java does not specify scroll bars standards for text area. In order to view large content, developer has to implement the methods in scroll bar interface, specifying what to be done in the methods. The intention of this design is retained in Swing development. A new scroll panel component called JScrollPane contains two scroll bars and is designed to contain other components. Therefore, once we put text area inside the JScrollPane, text area is associated with the two scroll bars, thus large content can be displayed very neatly.
III. Test
The general procedure is as follows:
Ø Specify a file name in the text field. It can be any type of file to load and display but it has to be an HTML to be Tidy-ed.
Ø Display that file
Ø Tidy that file
Ø Convert that file to an XML file. The file name is set to the text field.
Ø Show it in IE is optional. But you will be very happy to read it in IE.
IV. Future
Work
Due to the limitation of module Tidy.exe, i.e. reading file from disk, at this point, we can’t implement client/server architecture to make everything done on air. If a Java version of Tidy is available, we can set up a connection with the target website. Then a completely pipe will be formed and our program will take less burden because the buffer required can be very small. Especially, if we consider its application in wireless communication, we cannot tolerate big files on hard device. The pipe is shown below:

On the other hand, J2ME, which is a micro edition of Java, is the most popular tools in developing wireless applications. The functions in this program are compatible with J2ME, but the package javax.microedition.* cannot be imported and compiled in the platform in computer lab of College of Engineering. So at this time, I will leave it in J2SE and hopefully it can be migrated successfully to a cell that supports J2ME in the near future.
Reference
[1] Hughes, Cheryl M., The Web Wizard’s Guide to XML, QA76.76.H94 H84, 2003.
[2] Raggett, Dave, Clean up your web pages with HTML Tidy, http://www.w3.org/People/Raggett/tidy/.
[3] Brandt, Steven R., Create a quick-and-dirty XML parser,
http://www.javaworld.com/javaworld/javatips/jw-javatip128.html
[4] Jepsen, Thomas C., Java in Telecommunications: Solutions for Next Generation Networks, QA76.73.J38 J368, 2001.
[5] Sun, http://java.sun.com
Appendices
Flow Chart
User Interface Commands
Source Code
Test File