Abstract
Traditionally, correct parsing of XML and HTML has been littered with semantic hacks in the parsing code to deal with the oddities of these languages, since HTML accepts unbalanced tags and tags that do not match in case, but XML is less forgiving. The detection of well-formedness of XML documents has, to date, required semantic analysis outside of the grammar specification. We present a grammar-only (HT\X)ML parser which, upon detecting that it is parsing XML, modifies itself dynamically in order to insure that the document conforms to XML'S stricter rules. Our grammar detects unbalanced tags in XML, as well as mismatched case in otherwise balanced tags, while, at the same time, requiring XML document tag's attribute values to be in quotes, but accepting the looser attribute syntax when in an HTML document. On a 733 MHz Windows 2000 machine, our parser did a wellformedness detecting parse on XML documents such as the KJV Old Testament at a rate of 92 Kb/second, the Austin's Pride and Prejudice at a rate of 108 Kb/second, and Wolfgang May 's Mondial 3.0 database at a rate of 149 Kb/second.
- {Bosak} XML documents marked up by John Bosak, downloaded on 15 May 2002 from the URL: http://www.ibiblio.org/bosak/. Documents included: KJV Old Testament, KJV New Testament, The Quran, The Book of Morman, and Shakespeare's Hamlet.Google Scholar
- {Boullier} Pierre Boullier, "Dynamic Grammars and Semantic Analysis," INRIA Research Report 2322, August 1994.Google Scholar
- {Burshteyn} Boris Burshteyn, "USSA -Universal Syntax and Semantics Analyzer," ACM SIGPLAN Notices, 27 (1), January 1992, pp. 42--60. Google ScholarDigital Library
- {Bray et al.} Tim Bray, Jean Paoli, & C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, W3C Recommendation REC-xml-19980210, World-Wide Web Consortium, February 10, 1998. URL: http://www.w3.org/TR/1998/REC-xml-19980210.Google Scholar
- {Christiansen} Henning Christiansen, "A Survey of Adaptable Grammars," ACM SIGPLAN Notices, 25 (11), November 1990, pp. 35--44. Google ScholarDigital Library
- {Gutenberg} XML documents marked up by a variety of individuals, downloaded on 16 May 2002 from the URL: http://gutenberg.hwg.org/checkdoc1.html. Documents included: Great Expectations (Dickens), Pride and Prejudice (Austin), and Moby Dick (Melville).Google Scholar
- {Jackson 1999} Quinn Tyler Jackson, "PAISLEI: Towards Grammar-Only Parsing of C++," research results posted at the URL: http://qtj.n3.net/writing_editing/articles/PAISLEIGrammarOnly.htmlGoogle Scholar
- {Jackson 2000} Quinn Tyler Jackson, "Disambiguation as a Quantifiable Computational Process," Perfection, No. 3, Villeurbanne, France, March 2000.Google Scholar
- {May} Wolfgang May, "Information Extraction and Integration with Florid: The Mondial Case Study," Universität Freiburg, Institut für Informatik, No. 131, downloaded on 16 May 2002, from the URL: http ://www.informatik.uni-freiburg.de/~may/Mondial/.Google Scholar
- {Shutt} John N. Shutt, Recursive Adaptable Grammars, Master's Thesis, Worchester Polytechnic Institute, August 1993.Google Scholar
- {Shutt & Rubinstein} John Shutt & Roy Rubinstein, "Self-Modifying Finite Automata," in B. Pehrson and I. Simon, editors, Technology and Foundations: Information Processing '94 Vol. I: Proc. 13th IFIP World Computer Congress, Amsterdam: North-Holland, 1994, pp. 493--498.Google Scholar
- {STG}STG's XML 1.0 Reference Validator, Scholarly Technology Group, Brown University, Rhode Island, at the URL: http://www.stg.brown.edu/pub/xmlvalid/Xml.tr98.2.shtml as of 14 May 2002.Google Scholar
Index Terms
- Efficient formalism-only parsing of XML/HTML using the §-calculus
Recommendations
Parsing concurrent XML
WIDM '04: Proceedings of the 6th annual ACM international workshop on Web information and data managementConcurrent markup hierarchies appear often in document-centric XML documents, as a result of different XML elements having overlapping scopes. They require significantly different approach to management and maintenance. Management of XML documents ...
LLLR parsing
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied ComputingThe idea of an LLLR parsing is presented. An LLLR(k) parser can be constructed for any LR(k) grammar but it produces the left parse of the input string in linear time (in respect to the length of the derivation) without backtracking. If used as a basis ...
XML parsing: a threat to database performance
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementXML parsing is generally known to have poor performance characteristics relative to transactional database processing. Yet, its potentially fatal impact on overall database performance is being underestimated. We report real-word database applications ...
Comments