Description
The parse-xml function processes XML input, returning a list of XML tags,
attributes, and text. Here is a simple example:
(parse-xml "this is some text")
-->
((item1 ((item2 att1 "one")) "this is some text"))
The output format is known as LXML format.
Here is a description of LXML:
LXML is a list representation of XML tags and content.
Each list member may be:
a. a string containing text content, such as "Here is some text with a "
b. a list representing a XML tag with associated attributes and/or content,
such as ('item1 "text") or (('item1 :att1 "help.html") "link"). If the XML tag
does not have associated attributes, then the first list member will be a
symbol representing the XML tag, and the other elements will
represent the content, which can be a string (text content), a symbol (XML
tag with no attributes or content), or list (nested XML tag with
associated attributes and/or content). If there are associated attributes,
then the first list member will be a list containing a symbol
followed by two list members for each associated attribute; the first member is a
symbol representing the attribute, and the next member is a string corresponding
to the attribute value.
c. XML comments and or processing instructions - see the more detailed example below for
further information.
Parse-xml is a non-validating XML parser. It will detect non-well-formed XML input. When
processing valid XML input, parse-xml will optionally produce the same output as a validating
parser would, including the processing of an external DTD subset and external entity declarations.
By default, parse-xml outputs a DTD parse along with the parsed XML contents. The DTD parse may
be optionally suppressed. The following example shows DTD parsed output components:
(defvar *xml-example-external-url*
"")
(defun example-callback (var-name token &optional public)
(declare (ignorable token public))
(setf var-name (uri-path var-name))
(if* (equal var-name "null") then nil
else
(let ((string (eval (intern var-name (find-package :user)))))
(make-string-input-stream string))))
(defvar *xml-example-string*
"
]>
&ext1;")
(pprint (parse-xml *xml-example-string* :external-callback 'example-callback))
-->
((:xml :version "1.0" :encoding "utf-8")
(:comment " the following XML input is well-formed but may or may not be valid ")
(:pi :piexample "this is an example processing instruction tag ")
(:DOCTYPE :example
(:[ (:ELEMENT :item1 (:choice (:* :item2) (:seq (:+ :item3) :item4)))
(:ELEMENT :item2 :ANY)
(:ELEMENT :item3 :PCDATA) (:ELEMENT :item4 :PCDATA)
(:ATTLIST item1 (att1 :CDATA :FIXED "att1-default") (att2 :ID :REQUIRED)
(att3 (:enumeration :one :two :three) "one")
(att4 (:NOTATION :four :five) "four"))
(:ENTITY :param1 :param "text")
(:ENTITY :nentity :SYSTEM "null" :NDATA :somedata)
(:NOTATION :notation :SYSTEM "notation-processor"))
(:external (:ENTITY :ext1 "this is some external entity text")))
((item1 att1 "att1-default" att2 "1" att3 "one" att4 "four")
(item3 "this is some external entity text")))
Usage Notes:
1. The parse-xml function has been compiled and tested only in a
modern ACL. Its successful operation depends on both the mixed
case support and wide character support found in modern ACL.
2. The parser uses the keyword package for DTD tokens and other
special XML tokens. Since element and attribute token symbols are usually interned
in the current package, it is not recommended to execute parse-xml
when the current package is the keyword package.
3. The XML parser supports the XML Namespaces specification. The parser
recognizes a "xmlns" attribute and attribute names starting with "xmlns:".
As per the specification, the parser expects that the associated value
is an URI string. The parser then associates XML Namespace prefixes with a
Lisp package provided via the parse-xml :uri-to-package option or, if
necessary, a package created on the fly. The following example demonstrates
this behavior:
(setf *xml-example-string4*
"
A Tale of Two Cities
UK Library
1999
1999
")
(setf *uri-to-package* nil)
(setf *uri-to-package*
(acons (parse-uri "http://www.bibliography.org/XML/bib.ns")
(make-package "bib") *uri-to-package*))
(setf *uri-to-package*
(acons (parse-uri "urn:royal-mail.gov.uk/XML/ns/postal.ns,1999")
(make-package "royal") *uri-to-package*))
(setf *uri-to-package*
(acons (parse-uri "http://www.franz.com/XML/bib.ns")
(make-package "franz-ns") *uri-to-package*))
(pprint (multiple-value-list
(parse-xml *xml-example-string4*
:uri-to-package *uri-to-package*)))
-->
((((bibliography |xmlns:bib| "http://www.bibliography.org/XML/bib.ns" xmlns
"urn:royal-mail.gov.uk/XML/ns/postal.ns,1999")
"
"
((bib::book royal::owner "Smith") "
" (bib::title "A Tale of Two Cities") "
"
((bib::bibliography royal::|xmlns:bib| "http://www.franz.com/XML/bib.ns" royal::xmlns
"urn:royal-mail2.gov.uk/XML/ns/postal.ns,1999")
"
" ((franz-ns::library net.xml.namespace.0::branch "Main") "UK Library") "
" ((franz-ns::date net.xml.namespace.0::calendar "Julian") "1999") "
")
"
" ((bib::date royal::calendar "Julian") "1999") "
")
"
"))
((# . #)
(# . #)
(# . #)
(# . #)))
In the absence of XML Namespace attributes, element and attribute symbols are interned
in the current package. Note that this implies that attributes and elements referenced
in DTD content will be interned in the current package.
4. The ACL 6.0 beta does not contain a little-endian Unicode external format. To
process XML input containing Unicode characters correctly:
a. Place the following in a file called ef-fat-little.cl in the ACL code
directory:
(provide :ef-fat-little)
(in-package :excl)
(def-external-format :fat-little-base
:size 2)
(def-char-to-octets-macro :fat-little-base (char
state
&key put-next-octet external-format)
(declare (ignore external-format state))
`(let ((code (char-code ,char)))
(,put-next-octet (ldb (byte 8 0) code))
(,put-next-octet (ldb (byte 8 8) code))))
(def-octets-to-char-macro :fat-little-base (state-loc
&key get-next-octet external-format
octets-count-loc unget-octets)
(declare (ignore external-format state-loc unget-octets))
`(let ((lo ,get-next-octet)
(hi (progn (incf ,octets-count-loc)
,get-next-octet)))
(code-char (+ (ash hi 8) lo))))
(create-newline-ef :name :fat-little :base-name :fat-little-base
:nicknames '(:unicode-little))
b. Compile the file using a modern ACL.
5. The parse-xml function has been tested using the OASIS conformance test suite (see
details below). The test suite has wide coverage across possible XML and DTD syntax,
but there may be some syntax paths that have not yet been tested or completely
supported. Here is a list of currently known syntax parsing issues:
a. ACL does not support 4 byte Unicode scalar values, so input containing such data
will not be processed correctly. (Note, however, that parse-xml does correctly detect
and process wide Unicode input.)
b. The OASIS tests that contain wide Unicode all use a little-endian encoded Unicode.
Changes to the unicode-check function are required to also support big-endian encoded
Unicode. (Note also that this issue may be resolved by an ACL 6.0 final release change.)
c. An initial