skip to main content
article

Efficient formalism-only parsing of XML/HTML using the §-calculus

Published:01 February 2003Publication History
Skip Abstract Section

Abstract

Traditionally, correct parsing of XML and HTML has been littered with semantic hacks in the parsing code to deal with the oddities of these languages, since HTML accepts unbalanced tags and tags that do not match in case, but XML is less forgiving. The detection of well-formedness of XML documents has, to date, required semantic analysis outside of the grammar specification. We present a grammar-only (HT\X)ML parser which, upon detecting that it is parsing XML, modifies itself dynamically in order to insure that the document conforms to XML'S stricter rules. Our grammar detects unbalanced tags in XML, as well as mismatched case in otherwise balanced tags, while, at the same time, requiring XML document tag's attribute values to be in quotes, but accepting the looser attribute syntax when in an HTML document. On a 733 MHz Windows 2000 machine, our parser did a wellformedness detecting parse on XML documents such as the KJV Old Testament at a rate of 92 Kb/second, the Austin's Pride and Prejudice at a rate of 108 Kb/second, and Wolfgang May 's Mondial 3.0 database at a rate of 149 Kb/second.

References

  1. {Bosak} XML documents marked up by John Bosak, downloaded on 15 May 2002 from the URL: http://www.ibiblio.org/bosak/. Documents included: KJV Old Testament, KJV New Testament, The Quran, The Book of Morman, and Shakespeare's Hamlet.Google ScholarGoogle Scholar
  2. {Boullier} Pierre Boullier, "Dynamic Grammars and Semantic Analysis," INRIA Research Report 2322, August 1994.Google ScholarGoogle Scholar
  3. {Burshteyn} Boris Burshteyn, "USSA -Universal Syntax and Semantics Analyzer," ACM SIGPLAN Notices, 27 (1), January 1992, pp. 42--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {Bray et al.} Tim Bray, Jean Paoli, & C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, W3C Recommendation REC-xml-19980210, World-Wide Web Consortium, February 10, 1998. URL: http://www.w3.org/TR/1998/REC-xml-19980210.Google ScholarGoogle Scholar
  5. {Christiansen} Henning Christiansen, "A Survey of Adaptable Grammars," ACM SIGPLAN Notices, 25 (11), November 1990, pp. 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {Gutenberg} XML documents marked up by a variety of individuals, downloaded on 16 May 2002 from the URL: http://gutenberg.hwg.org/checkdoc1.html. Documents included: Great Expectations (Dickens), Pride and Prejudice (Austin), and Moby Dick (Melville).Google ScholarGoogle Scholar
  7. {Jackson 1999} Quinn Tyler Jackson, "PAISLEI: Towards Grammar-Only Parsing of C++," research results posted at the URL: http://qtj.n3.net/writing_editing/articles/PAISLEIGrammarOnly.htmlGoogle ScholarGoogle Scholar
  8. {Jackson 2000} Quinn Tyler Jackson, "Disambiguation as a Quantifiable Computational Process," Perfection, No. 3, Villeurbanne, France, March 2000.Google ScholarGoogle Scholar
  9. {May} Wolfgang May, "Information Extraction and Integration with Florid: The Mondial Case Study," Universität Freiburg, Institut für Informatik, No. 131, downloaded on 16 May 2002, from the URL: http ://www.informatik.uni-freiburg.de/~may/Mondial/.Google ScholarGoogle Scholar
  10. {Shutt} John N. Shutt, Recursive Adaptable Grammars, Master's Thesis, Worchester Polytechnic Institute, August 1993.Google ScholarGoogle Scholar
  11. {Shutt & Rubinstein} John Shutt & Roy Rubinstein, "Self-Modifying Finite Automata," in B. Pehrson and I. Simon, editors, Technology and Foundations: Information Processing '94 Vol. I: Proc. 13th IFIP World Computer Congress, Amsterdam: North-Holland, 1994, pp. 493--498.Google ScholarGoogle Scholar
  12. {STG}STG's XML 1.0 Reference Validator, Scholarly Technology Group, Brown University, Rhode Island, at the URL: http://www.stg.brown.edu/pub/xmlvalid/Xml.tr98.2.shtml as of 14 May 2002.Google ScholarGoogle Scholar

Index Terms

  1. Efficient formalism-only parsing of XML/HTML using the §-calculus
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 38, Issue 2
      February 2003
      53 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/772970
      Issue’s Table of Contents

      Copyright © 2003 Author

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 February 2003

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader