From a1d0f28e9281bfc2d70b43b62e178f3d0da1114b Mon Sep 17 00:00:00 2001 From: "Kevin M. Rosenberg" Date: Tue, 15 Oct 2002 12:23:31 +0000 Subject: [PATCH] r3028: *** empty log message *** --- phtml.htm | 254 ----------------------------------- pxml.htm | 387 ------------------------------------------------------ 2 files changed, 641 deletions(-) delete mode 100644 phtml.htm delete mode 100644 pxml.htm diff --git a/phtml.htm b/phtml.htm deleted file mode 100644 index 255dcf2..0000000 --- a/phtml.htm +++ /dev/null @@ -1,254 +0,0 @@ - - - -A Lisp Based HTML Parser - - - - - -

A Lisp Based HTML Parser

- -

Introduction/Simple Example
-LHTML  parse output format
-Case mode notes
-Parsing HTML comments
-Parsing <SCRIPT> and <STYLE> tags
-Parsing SGML <! tags
-Parsing Illegal and Deprecated Tags
-Default Attribute Values
-Parsing Interleaved Character Formatting Tags
-parse-html reference
-   methods
-   phtml-internal

- -

The parse-html generic function processes HTML -input, returning a list of HTML tags, attributes, and text. Here is a simple example:
-
-(parse-html "<HTML>
-                    -<HEAD>
-                    -<TITLE>Example HTML input</TITLE>
-                    -<BODY>
-                    -<P>Here is some text with a <B>bold</B> word<br>and a <A -HREF=\"help.html\">link</P>
-                    -</HTML>")

- -

generates:
-
-((:html (:head (:title "Example HTML input"))
-  (:body (:p "Here is some text with a " (:b "bold") " -word" :br "and a "
-                  -((:a :href "help.html") "link")))))
-

- -

The output format is known as LHTML format; it is the same format that the
-aserve htmlgen macro accepts.
-
-LHTML format
-
-LHTML is a list representation of HTML tags and content.
-
-Each list member may be: - -

    -
  1. a string containing text content, such as "Here is some text with a "
    -
  2. -
  3. a keyword package symbol representing a HTML tag with no associated attributes
    - or content, such as :br.
    -
  4. -
  5. a list representing an HTML tag with associated attributes and/or content,
    - such as (:b "bold") or ((:a :href "help.html") "link"). If - the HTML tag
    - does not have associated attributes, then the first list member will be a
    - keyword package symbol representing the HTML tag, and the other elements will
    - represent the content, which can be a string (text content), a keyword package symbol - (HTML
    - tag with no attributes or content), or list (nested HTML tag with
    - associated attributes and/or content). If there are associated attributes,
    - then the first list member will be a list containing a keyword package symbol
    - followed by two list members for each associated attribute; the first member is a keyword
    - package symbol representing the attribute, and the next member is a string corresponding
    - to the attribute value.
    -
  6. -
- -

Case Mode and LHTML

- -

If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be
-in upper case; otherwise, they will be in lower case.

- -

HTML Comments

- -

HTML comments are represented use a :comment symbol. For example,
-
-(parse-html "<!-- this is a comment-->")
-
---> ((:comment " this is a comment"))

- -

HTML <SCRIPT> and <STYLE> tags

- -

All <SCRIPT> and <STYLE> content is not parsed; it is returned as text -content.
-
-For example,
-
-(parse-html "<SCRIPT>this <B>will not</B> be -parsed</SCRIPT>")
-
---> ((:script "this <B>will not</B> be parsed"))

- -

XML and SGML <! tags

- -

Since, some HTML pages contain special XML/SGML tags, non-comment tags
-starting with '<!' are treated specially:
-
-(parse-html "<!doctype this is some text>")
-
---> ((:!doctype " this is some text"))

- -

Illegal and Deprecated HTML

- -

There is plenty of illegal and deprecated HTML on the web that popular browsers
-nonetheless successfully display. The parse-html parser is generous - it will not
-raise an error condition upon encountering most input. In particular, it does not
-maintain a list of legal HTML tags and will successfully parse nonsense input.
-
-For example,
-
-(parse-html "<this> <is> <some> <nonsense> -<input>")
-
---> ((:this (:is (:some (:nonsense :input)))))
-
-In some situations, you may prefer a two-pass parse that results in a parse where
-deep nesting related to unrecognized tags is minimized:
-
-(let ((string "<this> <is> <some> <nonsense> </some> -<input>"))
-        (multiple-value-bind (res rogues)
-          (parse-html string -:collect-rogue-tags t)
-            (declare (ignorable -res))
-            (parse-html string -:no-body-tags rogues)))
-
---> (:this :is (:some (:nonsense)) :input)
-
-See the :collect-rogue-tags and :no-body-tags argument -descriptions in the reference
-section below for more information.

- -

Default Attribute values

- -

As per the HTML 4.0 specification, attributes without specified values are given a -lower case
-string value that matches the attribute name.
-
-For example,
-
-(parse-html "<P here ARE some attributes>")
-
---> (((:p :here "here" :are "are" :some "some" -:attributes "attributes")))

- -

Interleaved Character Formatting Tags

- -

Existing HTML pages often have character format tags that are interleaved among
-other tags. Such interleaving is removed in a manner consistent with the HTML 4.0
-specification.
-
-For example,
-
-(parse-html "<P>Here is <B>bold text<P>that spans</B>two -paragraphs")
-
---> ((:p "Here is " (:b "bold text")) (:p (:b "that -spans") "two paragraphs"))

- -
- -

parse-html Reference
-
-parse-html [Generic function]
-
-Arguments: input-source &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-Returns LHTML output, as described above.
-
-The callbacks argument, if non-nil, should be an association list. Each list member's
-car (first) element specifies a keyword package symbol, and each list member's cdr (rest)
-element specifies a function object or a symbol naming a function. The function should
-expect one argument. The function will be invoked once for each time the HTML tag
-corresponding to the specified keyword package symbol is encountered in the HTML input; -the
-argument will be an LHTML list containing the tag, along with associated attributes and
-content. The default callbacks argument value is nil.
-
-The callback-only argument, if non-nil, directs parse-html to not generate a complete -LHTML
-output. Instead, LHTML lists will only be generated when necessary as arguments for -functions
-specified in the callbacks association list. This results in faster parser execution. The -default
-callback-only argument value is nil.
-
-The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional -value,
-a list containing any unrecognized tags closed by the end of input.
-
-The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if
-encountered, will be treated as a tag with no body or content, and thus, no associated end
-tag. Typically, the argument is a list or modified list resulting from an earlier -parse-html
-execution with the :collect-rogue-tags argument specified as non-nil.
-
-parse-html Methods
-
-parse-html (p stream) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-parse-html (str string) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-parse-html (file t) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-The t method assumes the argument is a pathname suitable
-for use with the with-open-file macro.
-
-
-phtml-internal [Function]
-
-Arguments: stream read-sequence-func callback-only callbacks
-collect-rogue-tags no-body-tags
-
-This function may be used when more control is needed for supplying
-the HTML input. The read-sequence-func argument, if non-nil, should be a function
-object or a symbol naming a function. When phtml-internal requires another buffer
-of HTML input, it will invoke the read-sequence-func function with two arguments -
-the first argument is an internal buffer character array and the second argument is
-the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal
-will invoke read-sequence to fill the buffer. The read-sequence-func function must
-return the number of character array elements successfully stored in the buffer.
-
-
-
-
-
-
-
-

- - diff --git a/pxml.htm b/pxml.htm deleted file mode 100644 index 2cf26d5..0000000 --- a/pxml.htm +++ /dev/null @@ -1,387 +0,0 @@ - - - -A Lisp Based XML Parser - - - - - -

A Lisp Based XML Parser

- -

Introduction/Simple Example
-LXML parse output format
-parse-xml non-validating parser properties
-case and international character support issues
-parse-xml and packages
-parse-xml, the XML Namespace specification, and packages
-ACL does not support Unicode 4 byte scalar values
-only little-endian Unicode tested in ACL 6.0 beta
-debugging aids
-XML Conformance test results
-Compiling and Loading the parser
-parse-xml reference

- -

The parse-xml generic function processes XML -input, returning a list of XML tags,
-attributes, and text. Here is a simple example:
-
-(parse-xml "<item1><item2 att1='one'/>this is some -text</item1>")
-
--->
-
-((item1 ((item2 att1 "one")) "this is some text"))
-
-The output format is known as LXML format.
-
-LXML Format
-
-LXML is a list representation of XML tags and content.
-
-Each list member may be:
-
-a. a string containing text content, such as "Here is some text with a "
-
-b. a list representing a XML tag with associated attributes and/or content, -such as ('item1 "text") or (('item1 :att1 "help.html") -"link"). If the XML tag -does not have associated attributes, then the first list member will be a -symbol representing the XML tag, and the other elements will -represent the content, which can be a string (text content), a symbol (XML -tag with no attributes or content), or list (nested XML tag with -associated attributes and/or content). If there are associated attributes, -then the first list member will be a list containing a symbol -followed by two list members for each associated attribute; the first member is a -symbol representing the attribute, and the next member is a string corresponding -to the attribute value.
-
-c. XML comments and or processing instructions - see the more detailed example below for -further information.

- -

Non Validating Parser Properties

- -

Parse-xml is a non-validating XML parser. It will detect non-well-formed XML input. -When
-processing valid XML input, parse-xml will optionally produce the same output as a -validating
-parser would, including the processing of an external DTD subset and external entity -declarations.
-
-By default, parse-xml outputs a DTD parse along with the parsed XML contents. The DTD -parse may
-be optionally suppressed. The following example shows DTD parsed output components:

- -

(defvar *xml-example-external-url*
-   "<!ENTITY ext1 'this is some external entity %param1;'>")
-
-(defun example-callback (var-name token &optional public)
-  (declare (ignorable token public))
-  (setf var-name (uri-path var-name))
-  (if* (equal var-name "null") then nil
-    else
-      (let ((string (eval (intern var-name (find-package -:user)))))
-      (make-string-input-stream string))))
-
-(defvar *xml-example-string*
-"<?xml version='1.0' encoding='utf-8'?>
-<!-- the following XML input is well-formed but its validity has not been checked ... --->
-<?piexample this is an example processing instruction tag ?>
-<!DOCTYPE example SYSTEM '*xml-example-external-url*' [
-   <!ELEMENT item1 (item2* | (item3+ , item4))>
-   <!ELEMENT item2 ANY>
-   <!ELEMENT item3 (#PCDATA)>
-   <!ELEMENT item4 (#PCDATA)>
-   <!ATTLIST item1
-      att1 CDATA #FIXED 'att1-default'
-      att2 ID #REQUIRED
-      att3 ( one | two | three ) 'one'
-      att4 NOTATION ( four | five ) 'four' >
-   <!ENTITY % param1 'text'>
-   <!ENTITY nentity SYSTEM 'null' NDATA somedata>
-   <!NOTATION notation SYSTEM 'notation-processor'>
-   ]>
-<item1 att2='1'><item3>&ext1;</item3></item1>")
-
-(pprint (parse-xml *xml-example-string* :external-callback 'example-callback))
-
--->
-
-((:xml :version "1.0" :encoding "utf-8")
-  (:comment " the following XML input is well-formed but may or may not be valid -")
-  (:pi :piexample "this is an example processing instruction tag ")
-  (:DOCTYPE :example
-    (:[ (:ELEMENT :item1 (:choice (:* :item2) (:seq (:+ :item3) :item4)))
-        (:ELEMENT :item2 :ANY)
-        (:ELEMENT :item3 :PCDATA) (:ELEMENT :item4 -:PCDATA)
-        (:ATTLIST item1 (att1 :CDATA :FIXED -"att1-default") (att2 :ID :REQUIRED)
-             (att3 -(:enumeration :one :two :three) "one")
-             (att4 (:NOTATION -:four :five) "four"))
-        (:ENTITY :param1 :param "text")
-        (:ENTITY :nentity :SYSTEM "null" -:NDATA :somedata)
-        (:NOTATION :notation :SYSTEM -"notation-processor"))
-    (:external (:ENTITY :ext1 "this is some external entity -text")))
-   ((item1 att1 "att1-default" att2 "1" att3 "one" -att4 "four")
-       (item3 "this is some external entity -text")))
-
-
-Usage Notes
-
-

    -
  1. The parse-xml function has been primarily compiled and tested in a -modern ACL. However, in an ANSI Lisp with wide character support, it DOES pass the valid -component of the conformance suite in the same manner as it does in a Modern Lisp. The -parser's successful operation in all potential situations depends on wide character support. -

    -
  2. -
  3. The parser uses the keyword package for DTD tokens and other -special XML tokens. Since element and attribute token symbols are usually interned -in the current package, it is not recommended to execute parse-xml -when the current package is the keyword package. -

    -
  4. -
  5. The XML parser supports the XML Namespaces specification. The -parser recognizes a "xmlns" attribute and attribute names starting with -"xmlns:". -As per the specification, the parser expects that the associated value -is an URI string. The parser then associates XML Namespace prefixes with a -Lisp package provided via the parse-xml :uri-to-package option or, if -necessary, a package created on the fly. The following example demonstrates -this behavior:
    - -

    (setf *xml-example-string4*
    -   "<bibliography
    -      xmlns:bib='http://www.bibliography.org/XML/bib.ns'
    -      xmlns='urn:com:books-r-us'>
    -   <bib:book owner='Smith'>
    -      <bib:title>A Tale of Two Cities</bib:title>
    -      <bib:bibliography
    -         xmlns:bib='http://www.franz.com/XML/bib.ns'
    -         xmlns='urn:com:books-r-us'>
    -      <bib:library branch='Main'>UK -Library</bib:library>
    -      <bib:date calendar='Julian'>1999</bib:date>
    -      </bib:bibliography>
    -   <bib:date calendar='Julian'>1999</bib:date>
    -   </bib:book>
    -</bibliography>")
    -
    -(setf *uri-to-package* nil)
    -(setf *uri-to-package*
    -   (acons (parse-uri "http://www.bibliography.org/XML/bib.ns")
    -      (make-package "bib") *uri-to-package*))
    -(setf *uri-to-package*
    -   (acons (parse-uri "urn:com:books-r-us")
    -      (make-package "royal") *uri-to-package*))
    -(setf *uri-to-package*
    -   (acons (parse-uri "http://www.franz.com/XML/bib.ns")
    -      (make-package "franz-ns") *uri-to-package*))
    -(pprint (multiple-value-list
    -             (parse-xml -*xml-example-string4*
    -                  :uri-to-package -*uri-to-package*)))
    -
    --->
    -((((bibliography |xmlns:bib| "http://www.bibliography.org/XML/bib.ns"
    -     xmlns "urn:com:books-r-us")
    -    "
    -    "
    -   ((bib::book royal::owner "Smith") "
    -        " (bib::title "A Tale of Two -Cities") "
    -        "
    -    ((bib::bibliography royal::|xmlns:bib|
    -      "http://www.franz.com/XML/bib.ns" royal::xmlns
    -      "urn:com:books-r-us")
    -     "
    -         " ((franz-ns::library royal::branch -"Main") "UK Library") "
    -         " ((franz-ns::date royal::calendar -"Julian") "1999") "
    -         ")
    -     "
    -         " ((bib::date royal::calendar -"Julian") "1999") "
    -         ")
    -     "
    -         "))
    -((#<uri http://www.franz.com/XML/bib.ns> . #<The franz-ns package>)
    -  (#<uri urn:com:books-r-us> . #<The royal package>)
    -  (#<uri http://www.bibliography.org/XML/bib.ns> . #<The bib package>)))
    -
    -

  6. -
  7. In the absence of XML Namespace attributes, element and attribute symbols are interned -in the current package. Note that this implies that attributes and elements referenced -in DTD content will be interned in the current package. -
  8. -
  9. The parse-xml function has been tested using the OASIS conformance test suite (see -details below). The test suite has wide coverage across possible XML and DTD syntax, -but there may be some syntax paths that have not yet been tested or completely -supported. Here is a list of currently known syntax parsing issues: -
      -
    • ACL does not support 4 byte Unicode scalar values, so -input containing such data -will not be processed correctly. (Note, however, that parse-xml does correctly detect -and process wide Unicode input.) -
    • -
    • The OASIS tests that contain wide Unicode all use a -little-endian encoded Unicode. -Changes to the unicode-check function are required to also support big-endian encoded -Unicode. (Note also that this issue may be resolved by an ACL 6.0 final release change.) -
    • -
    • An initial <?xml declaration in external entity files is skipped without a check -being made to see if the <?xml declaration is itself incorrect. -
    • -
    -
  10. -
  11. When investigating possible parser errors or examining more closely -where the parser -determined that the input was non-well-formed, the net.xml.parser internal symbols -*debug-xml* and *debug-dtd* are useful. When not bound to nil, these variables cause -lexical analysis and intermediate parsing results to be output to *standard-output*. -
  12. -
  13. It is necessary to load the pxml module before using it. -Typically this can be done by evaluating (require :pxml). -
  14. -
-XML Conformance Test Suite
-
-Using the OASIS test suite (http://www.oasis-open.org), -here are the current parse-xml results:
-
-xmltest/invalid:    Not tested, since parse-xml is a non-validating parser
-
-not-wf/
-
-    ext.sa: 3 tests; all pass
-    not-sa: 8 tests; all pass
-    sa: 186 tests; the following fail:
-
-        170.xml: fails because ACL does not support 4 -byte Unicode scalar values
-
-valid/
-
-    ext-sa: 14 tests; all pass
-    not-sa: 31 tests; all pass
-    sa: 119 tests: the following fail:
-
-        052.xml, 064.xml, 089.xml: fails because ACL -does not support 4 byte
-                    -Unicode scalar values
-
-Compiling and Loading
-
-Load build.cl into a modern ACL session will result in a pxml.fasl file that can -subsequently be
-loaded in a modern ACL to provide XML parsing functionality.
-
--------------------------------------------------------------------------------------------
-
-parse-xml reference
-
-parse-xml            [Generic -function]
-
-Arguments: input-source &key external-callback content-only
-            general-entities -parameter-entities
-            uri-to-package
-
-Returns multiple values:
-
    -
  1. LXML and parsed DTD output, as described above.
  2. -
  3. An association list containing the uri-to-package argument conses (if any) -and conses associated with any XML Namespace packages created during the -parse (see uri-to-package argument description, below).
  4. -
-The external-callback argument, if specified, is a function object or symbol -that parse-xml will execute when encountering an external DTD subset -or external entity DTD declaration. Here is an example which shows that -arguments the function should expect, and the value it should return: -
-(defun file-callback (uri-object token &optional public)
-  ;; The uri-object is an ACL URI object created from
-  ;; the XML input. In this example, this function
-  ;; assumes that all uri's will be file specifications.
-  ;;
-  ;; The token argument identifies what token is associated
-  ;; with the external parse (for example :DOCTYPE for external
-  ;; DTD subset
-  ;;
-  ;; The public argument contains the associated PUBLIC string,
-  ;; when present
-  ;;
-  (declare (ignorable token public))
-  ;; An open stream is returned on success,
-  ;; a nil return value indicates that the external
-  ;; parse should not occur.
-  ;; Note that parse-xml will close the open stream before exiting.
-  (ignore-errors (open (uri-path uri-object))))
-
-

-The general-entities argument is an association list containing general entity symbol -and replacement text pairs. The entity symbols should be in the keyword package. -Note that this option may be useful in generating desirable parse results in -situations where you do not wish to parse external entities or the external DTD subset. -

-The parameter-entities argument is an association list containing parameter entity symbol -and replacement text pairs. The entity symbols should be in the keyword package. -Note that this option may be useful in generating desirable parse results in -situations where you do not wish to parse external entities or the external DTD subset. -

-The uri-to-package argument is an association list containing uri objects and package -objects. Typically, the uri objects correspond to XML Namespace attribute values, and -the package objects correspond to the desired package for interning symbols associated -with the uri namespace. If the parser encounters an uri object not contained in this list, -it will generate a new package. The first generated package will be named -net.xml.namespace.0, -the second will be named net.xml.namespace.1, and so on. -

parse-xml methods

-
-(parse-xml (p stream) &key
-                      external-callback content-only
-                      general-entities
-                      parameter-entities
-                      uri-to-package)
-
-(parse-xml (str string) &key
-                        external-callback content-only
-                        general-entities
-                        parameter-entities
-                        uri-to-package)
-
-An easy way to parse a file containing XML input: -
-(with-open-file (p "example.xml")
-  (parse-xml p :content-only p))
-
-

net.xml.parser unexported special variables:

-

-*debug-xml*
-
-When true, parse-xml generates XML lexical state and intermediary -parse result debugging output. -

-*debug-dtd*
-
-When true, parse-xml generates DTD lexical state and intermediary -parse result debugging output. - - -- 2.34.1