Wednesday, May 16, 2007

Extreme Markup Languages 2007

The Markup Theory & Practice Conference will take place on August 7-10 2007 in Montréal, Canada. On August the 6th the International Workshop on Markup of Overlapping Structures will take place at the same location.

I saw some interesting talks in the program that was just published and I listed below the talks that seem interesting to me in person. I hope to be able to get a copy of the papers or some other form of publication from the authors, as I doubt that I'll be able to attend.

Here's a copy&paste of the abstracts I find most attractive:

Writing an XSLT optimizer in XSLT

Michael Kay, Saxonica

In principle, XSLT is ideally suited to the task of writing an XSLT or XQuery optimizer. After all, optimizers consist of a set of rules for rewriting a tree representation of the query or stylesheet, and XSLT is specifically designed as a language for rule-based tree rewriting. The paper illustrates how the abstract syntax tree representing a query or stylesheet can be expressed as an XML data structure making it amenable to XSLT processing, and shows how a selection of rewrites can be programmed in XSLT. The key question determining whether the approach is viable in practice is performance. Some simple measurements suffice to demonstrate that there is a significant performance penalty, but not an insurmountable one: further work is needed to see whether it can be reduced to an acceptable level.

Streaming validation of schemata: the Lazy Typing discipline

Paolo Marinelli, Fabio Vitali, Stefano Zacchiroli, University of Bologna

Assertions, identity constraints, and conditional type assignments are (planned) features of XML Schema which rely on XPath evaluation. The XPath subset exploitable in those features is limited, for several reasons, including (apparently) to avoid buffering in evaluation of an expression. We divide XPath into subsets with varying streamability characteristics. We also identify the larger XPath subset which is compatible with the typing discipline we believe underlies some of the choices currently present in the XML Schema specification. Such a discipline requires that the type of an element be decided when its start tag is encountered and its validity when its end tag is encountered. An alternative “lazy typing” discipline is proposed in which both type assignment and validity assessment are fired as soon as they are available. Our approach is more flexible, giving schema authors control over the trade-off between using larger XPath subsets (and thus increasing buffering requirements) and expeditiousness.

Localization of schema languages

Felix Sasaki, World Wide Web Consortium

Internationalization is the process of making a product ready for global use. Localization is the adaptation of a product to a specific locale (e.g., country, region, or market). Localization of XML schemas (XSD, DTD, Relax NG) can include translation of element and attribute names, modification of data types, and content or locale-specific modifications such as currency and dates. Combining the TEI ODD (One Document Does it all) approach for renaming and adaptation of documentation, the Common Locale Data Registry (CLDR) for the modification of data types, and the new Internationalization Tag Set (W3C 2007), the authors have produced an implementation that will take as input a schema without any localization and some external localization parameters (such as the locale, the schema language, any localization annotations, and the CLDR data) and produce a localized schema for XSD and Relax NG. For a DTD, the implementation produces a Schematron document for validation of the modified data types that can be used with a separate renaming stylesheet to generate a localized DTD.

Applying structured content transformation techniques to software source code

Roy Amodeo, Stilo International

In structured content processing, benefits of modeling information content rather than presentation include the ability to automate the publication of information in many formats, tailored for different audiences. Software programs are a form of content, usually authored by humans and “published” by compilers to the computer that runs these programs. However, programs are not written solely for use by machines. If they were, programming languages would have no need for comments or programming style guidelines. The application developers and maintainers themselves are also an audience. Modeling software programs as XML instances is not a new idea. This paper takes a fresh look at the challenge of producing XML markup from programming languages by recasting it as a content processing problem using tools developed in the same way as any other content-processing application. The XML instances we generate can be used to craft transformation and analysis tools useful for software engineering by leveraging the marked up structure of the program rather than th native syntax.

Characterizing XQuery implementations: Categories and key features

Liam Quin, World Wide Web Consortium

XQuery 1.0 was published as a W3C Recommendation in January 2007, and there are fifty or more XQuery implementations. The XQuery Public Web page at W3C lists them but gives little or no guidance about choosing among them. The author proposes a simple ontology (taxonomy) to characterize XQuery implementations based on emergent patters of the features appearing in implementations and suggests some ways to choose among those implementations. The result is a clearer view of how XQuery is being used and also provides insights that will help in designing system architectures that incorporate XQuery engines. Although specific products are not endorsed in this paper, actual examples are given. With XML in use in places as diverse as automobile engines and encyclopedias, the most important part of investigating an XML tool’s suitability to task is often the tool’s intended usage environment. It is not unreasonable to suppose that most XQuery implementations are useful for something. Let's see!

Building a C++ XSLT processor for large documents and high performance

Kevin Jones, Jianhui Li, & Lan Yi, Intel

Some current XML users require an XSLT processor capable of handling documents up to 2 gigabytes. To produce a high-speed processor for such large documents, the authors employed a data representation that supports minimal inter-record linking to provide a small, in-memory representation. XML documents are represented as a sequence of records; these records can be viewed as binary encodings of events produced by an XML parser based on the XPath data model. The format is designed to support documents in excess of the 32-bit boundary; its current theoretical limit is 32 gigabytes. To offset the slower navigation speed for a records-based data format, the processor uses a new Path Map algorithm for simultaneous XPath processing. The authors carried out a series of experiments comparing their newly constructed XSLT processor to an object-model-based XSLT processor (the Intel® XSLT Accelerator Software library).

Converting into pattern-based schemas: A formal approach

Antonina Dattolo, University of Napoli Federico II
Angelo Di Iorio, Silvia Duca, Antonio Angelo Feliziani, & Fabio Vitali, University of Bologna

A traditional distinction among markup languages is how descriptive or prescriptive they are. We identify six levels along the descriptive/prescriptive spectrum. Schemas at a specific level of descriptiveness that we call "Descriptive No Order" (DNO) specify a list of allowable elements, their number and requiredness, but do not impose any order upon them. We have defined a pattern-based model based on a set of named patterns, each of which is an object and its composition rule (content model); we show that any schema can be converted into a pattern-based schema without loss of information at the DNO level. We present a formal analysis of lossless conversions of arbitrary schemas as a demonstration of the correctness and completeness of our pattern model. Although all examples are given in DTD syntax, the results should apply equality to XSD, Relax NG, or other schema languages.

Declarative specification of XML document fixup

Henry S. Thompson, University of Edinburgh

The historical and social complications of the development of the HTML family of languages defy easy analysis. In the recent discussion of the future of the family, one question has stood out: should ‘the next HTML’ have a schema or indeed any form of formal definition? One major constituency has vocally rejected the use of any form of schema, maintaining that the current behavior of deployed HTML browsers cannot usefully be described in any declarative notation. But a declarative approach, based on the Tag Soup work of John Cowan, proves capable of specifying the repair of ill-formed HTML and XHTML in a way that approximates the behavior of existing HTML browsers. A prototype implementation named PYXup demonstrates the capability; it operates on the PYX output produced by the Tag Soup scanner and fixes up well-formedness errors and some structural problems commonly found in HTML in the wild based on an easily understood declarative specification.

No comments:

Post a Comment