X-Git-Url: http://git.kpe.io/?p=xmlutils.git;a=blobdiff_plain;f=phtml.htm;fp=phtml.htm;h=0000000000000000000000000000000000000000;hp=255dcf201ff014417a41489805c166ae6021195d;hb=a1d0f28e9281bfc2d70b43b62e178f3d0da1114b;hpb=b5da6339c28ee272d0a32eb5c26a9f7446e71d9f diff --git a/phtml.htm b/phtml.htm deleted file mode 100644 index 255dcf2..0000000 --- a/phtml.htm +++ /dev/null @@ -1,254 +0,0 @@ - - - -A Lisp Based HTML Parser - - - - - -

A Lisp Based HTML Parser

- -

Introduction/Simple Example
-LHTML  parse output format
-Case mode notes
-Parsing HTML comments
-Parsing <SCRIPT> and <STYLE> tags
-Parsing SGML <! tags
-Parsing Illegal and Deprecated Tags
-Default Attribute Values
-Parsing Interleaved Character Formatting Tags
-parse-html reference
-   methods
-   phtml-internal

- -

The parse-html generic function processes HTML -input, returning a list of HTML tags, attributes, and text. Here is a simple example:
-
-(parse-html "<HTML>
-                    -<HEAD>
-                    -<TITLE>Example HTML input</TITLE>
-                    -<BODY>
-                    -<P>Here is some text with a <B>bold</B> word<br>and a <A -HREF=\"help.html\">link</P>
-                    -</HTML>")

- -

generates:
-
-((:html (:head (:title "Example HTML input"))
-  (:body (:p "Here is some text with a " (:b "bold") " -word" :br "and a "
-                  -((:a :href "help.html") "link")))))
-

- -

The output format is known as LHTML format; it is the same format that the
-aserve htmlgen macro accepts.
-
-LHTML format
-
-LHTML is a list representation of HTML tags and content.
-
-Each list member may be: - -

    -
  1. a string containing text content, such as "Here is some text with a "
    -
  2. -
  3. a keyword package symbol representing a HTML tag with no associated attributes
    - or content, such as :br.
    -
  4. -
  5. a list representing an HTML tag with associated attributes and/or content,
    - such as (:b "bold") or ((:a :href "help.html") "link"). If - the HTML tag
    - does not have associated attributes, then the first list member will be a
    - keyword package symbol representing the HTML tag, and the other elements will
    - represent the content, which can be a string (text content), a keyword package symbol - (HTML
    - tag with no attributes or content), or list (nested HTML tag with
    - associated attributes and/or content). If there are associated attributes,
    - then the first list member will be a list containing a keyword package symbol
    - followed by two list members for each associated attribute; the first member is a keyword
    - package symbol representing the attribute, and the next member is a string corresponding
    - to the attribute value.
    -
  6. -
- -

Case Mode and LHTML

- -

If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be
-in upper case; otherwise, they will be in lower case.

- -

HTML Comments

- -

HTML comments are represented use a :comment symbol. For example,
-
-(parse-html "<!-- this is a comment-->")
-
---> ((:comment " this is a comment"))

- -

HTML <SCRIPT> and <STYLE> tags

- -

All <SCRIPT> and <STYLE> content is not parsed; it is returned as text -content.
-
-For example,
-
-(parse-html "<SCRIPT>this <B>will not</B> be -parsed</SCRIPT>")
-
---> ((:script "this <B>will not</B> be parsed"))

- -

XML and SGML <! tags

- -

Since, some HTML pages contain special XML/SGML tags, non-comment tags
-starting with '<!' are treated specially:
-
-(parse-html "<!doctype this is some text>")
-
---> ((:!doctype " this is some text"))

- -

Illegal and Deprecated HTML

- -

There is plenty of illegal and deprecated HTML on the web that popular browsers
-nonetheless successfully display. The parse-html parser is generous - it will not
-raise an error condition upon encountering most input. In particular, it does not
-maintain a list of legal HTML tags and will successfully parse nonsense input.
-
-For example,
-
-(parse-html "<this> <is> <some> <nonsense> -<input>")
-
---> ((:this (:is (:some (:nonsense :input)))))
-
-In some situations, you may prefer a two-pass parse that results in a parse where
-deep nesting related to unrecognized tags is minimized:
-
-(let ((string "<this> <is> <some> <nonsense> </some> -<input>"))
-        (multiple-value-bind (res rogues)
-          (parse-html string -:collect-rogue-tags t)
-            (declare (ignorable -res))
-            (parse-html string -:no-body-tags rogues)))
-
---> (:this :is (:some (:nonsense)) :input)
-
-See the :collect-rogue-tags and :no-body-tags argument -descriptions in the reference
-section below for more information.

- -

Default Attribute values

- -

As per the HTML 4.0 specification, attributes without specified values are given a -lower case
-string value that matches the attribute name.
-
-For example,
-
-(parse-html "<P here ARE some attributes>")
-
---> (((:p :here "here" :are "are" :some "some" -:attributes "attributes")))

- -

Interleaved Character Formatting Tags

- -

Existing HTML pages often have character format tags that are interleaved among
-other tags. Such interleaving is removed in a manner consistent with the HTML 4.0
-specification.
-
-For example,
-
-(parse-html "<P>Here is <B>bold text<P>that spans</B>two -paragraphs")
-
---> ((:p "Here is " (:b "bold text")) (:p (:b "that -spans") "two paragraphs"))

- -
- -

parse-html Reference
-
-parse-html [Generic function]
-
-Arguments: input-source &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-Returns LHTML output, as described above.
-
-The callbacks argument, if non-nil, should be an association list. Each list member's
-car (first) element specifies a keyword package symbol, and each list member's cdr (rest)
-element specifies a function object or a symbol naming a function. The function should
-expect one argument. The function will be invoked once for each time the HTML tag
-corresponding to the specified keyword package symbol is encountered in the HTML input; -the
-argument will be an LHTML list containing the tag, along with associated attributes and
-content. The default callbacks argument value is nil.
-
-The callback-only argument, if non-nil, directs parse-html to not generate a complete -LHTML
-output. Instead, LHTML lists will only be generated when necessary as arguments for -functions
-specified in the callbacks association list. This results in faster parser execution. The -default
-callback-only argument value is nil.
-
-The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional -value,
-a list containing any unrecognized tags closed by the end of input.
-
-The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if
-encountered, will be treated as a tag with no body or content, and thus, no associated end
-tag. Typically, the argument is a list or modified list resulting from an earlier -parse-html
-execution with the :collect-rogue-tags argument specified as non-nil.
-
-parse-html Methods
-
-parse-html (p stream) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-parse-html (str string) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-parse-html (file t) &key callbacks callback-only
-            collect-rogue-tags -no-body-tags
-
-The t method assumes the argument is a pathname suitable
-for use with the with-open-file macro.
-
-
-phtml-internal [Function]
-
-Arguments: stream read-sequence-func callback-only callbacks
-collect-rogue-tags no-body-tags
-
-This function may be used when more control is needed for supplying
-the HTML input. The read-sequence-func argument, if non-nil, should be a function
-object or a symbol naming a function. When phtml-internal requires another buffer
-of HTML input, it will invoke the read-sequence-func function with two arguments -
-the first argument is an internal buffer character array and the second argument is
-the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal
-will invoke read-sequence to fill the buffer. The read-sequence-func function must
-return the number of character array elements successfully stored in the buffer.
-
-
-
-
-
-
-
-

- -