A Lisp Based HTML Parser

X-Git-Url: http://git.kpe.io/?a=blobdiff_plain;ds=sidebyside;f=phtml.htm;fp=phtml.htm;h=4a1608351a482875b41721bb48c89171f637f2ca;hb=1e8aa1df433841c85c5a0b44fbd92964672e18b5;hp=0000000000000000000000000000000000000000;hpb=2d40f4169cc89aaecf1a762cae1e2d7cd55587ab;p=xmlutils.git diff --git a/phtml.htm b/phtml.htm new file mode 100644 index 0000000..4a16083 --- /dev/null +++ b/phtml.htm @@ -0,0 +1,257 @@ + + + +A Lisp Based HTML Parser + + + + + +

A Lisp Based HTML Parser

+ +

Introduction/Simple Example
+LHTML parse output format
+Case mode notes
+Parsing HTML comments
+Parsing <SCRIPT> and <STYLE> tags
+Parsing SGML <! tags
+Parsing Illegal and Deprecated Tags
+Default Attribute Values
+Parsing Interleaved Character Formatting Tags
+parse-html reference
+ methods
+ phtml-internal

+ +

The parse-html generic function processes HTML +input, returning a list of HTML tags, attributes, and text. Here is a simple example:
+
+(parse-html "<HTML>
+                    +<HEAD>
+                    +<TITLE>Example HTML input</TITLE>
+                    +<BODY>
+                    +<P>Here is some text with a <B>bold</B> word<br>and a <A +HREF=\"help.html\">link</P>
+                    +</HTML>")

+ +

generates:
+
+((:html (:head (:title "Example HTML input"))
+ (:body (:p "Here is some text with a " (:b "bold") " +word" :br "and a "
+ +((:a :href "help.html") "link")))))
+

+ +

The output format is known as LHTML format; it is the same format that the
+aserve htmlgen macro accepts.
+
+LHTML format
+
+LHTML is a list representation of HTML tags and content.
+
+Each list member may be: + +

a string containing text content, such as "Here is some text with a "
+
a keyword package symbol representing a HTML tag with no associated attributes
+ or content, such as :br.
+
a list representing an HTML tag with associated attributes and/or content,
+ such as (:b "bold") or ((:a :href "help.html") "link"). If + the HTML tag
+ does not have associated attributes, then the first list member will be a
+ keyword package symbol representing the HTML tag, and the other elements will
+ represent the content, which can be a string (text content), a keyword package symbol + (HTML
+ tag with no attributes or content), or list (nested HTML tag with
+ associated attributes and/or content). If there are associated attributes,
+ then the first list member will be a list containing a keyword package symbol
+ followed by two list members for each associated attribute; the first member is a keyword
+ package symbol representing the attribute, and the next member is a string corresponding
+ to the attribute value.
+

+ +

Case Mode and LHTML

+ +

If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be
+in upper case; otherwise, they will be in lower case.

+ +

HTML Comments

+ +

HTML comments are represented use a :comment symbol. For example,
+
+(parse-html "")
+
+--> ((:comment " this is a comment"))

+ +

HTML <SCRIPT> and <STYLE> tags

+ +

All <SCRIPT> and <STYLE> content is not parsed; it is returned as text +content.
+
+For example,
+
+(parse-html "<SCRIPT>this <B>will not</B> be +parsed</SCRIPT>")
+
+--> ((:script "this <B>will not</B> be parsed"))

+ +

XML and SGML <! tags

+ +

Since, some HTML pages contain special XML/SGML tags, non-comment tags
+starting with '<!' are treated specially:
+
+(parse-html "<!doctype this is some text>")
+
+--> ((:!doctype " this is some text"))

+ +

Illegal and Deprecated HTML

+ +

There is plenty of illegal and deprecated HTML on the web that popular browsers
+nonetheless successfully display. The parse-html parser is generous - it will not
+raise an error condition upon encountering most input. In particular, it does not
+maintain a list of legal HTML tags and will successfully parse nonsense input.
+
+For example,
+
+(parse-html "<this> <is> <some> <nonsense> +<input>")
+
+--> ((:this (:is (:some (:nonsense :input)))))
+
+In some situations, you may prefer a two-pass parse that results in a parse where
+deep nesting related to unrecognized tags is minimized:
+
+(let ((string "<this> <is> <some> <nonsense> </some> +<input>"))
+        (multiple-value-bind (res rogues)
+          (parse-html string +:collect-rogue-tags t)
+            (declare (ignorable +res))
+            (parse-html string +:no-body-tags rogues)))
+
+--> (:this :is (:some (:nonsense)) :input)
+
+See the :collect-rogue-tags and :no-body-tags argument +descriptions in the reference
+section below for more information.

+ +

Default Attribute values

+ +

As per the HTML 4.0 specification, attributes without specified values are given a +lower case
+string value that matches the attribute name.
+
+For example,
+
+(parse-html "<P here ARE some attributes>")
+
+--> (((:p :here "here" :are "are" :some "some" +:attributes "attributes")))

+ +

Interleaved Character Formatting Tags

+ +

Existing HTML pages often have character format tags that are interleaved among
+other tags. Such interleaving is removed in a manner consistent with the HTML 4.0
+specification.
+
+For example,
+
+(parse-html "<P>Here is <B>bold text<P>that spans</B>two +paragraphs")
+
+--> ((:p "Here is " (:b "bold text")) (:p (:b "that +spans") "two paragraphs"))

+ +

parse-html Reference
+
+parse-html [Generic function]
+
+Arguments: input-source &key callbacks callback-only
+ collect-rogue-tags +no-body-tags parse-entities
+
+Returns LHTML output, as described above.
+
+The callbacks argument, if non-nil, should be an association list. Each list member's
+car (first) element specifies a keyword package symbol, and each list member's cdr (rest)
+element specifies a function object or a symbol naming a function. The function should
+expect one argument. The function will be invoked once for each time the HTML tag
+corresponding to the specified keyword package symbol is encountered in the HTML input; +the
+argument will be an LHTML list containing the tag, along with associated attributes and
+content. The default callbacks argument value is nil.
+
+The callback-only argument, if non-nil, directs parse-html to not generate a complete +LHTML
+output. Instead, LHTML lists will only be generated when necessary as arguments for +functions
+specified in the callbacks association list. This results in faster parser execution. The +default
+callback-only argument value is nil.
+
+The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional +value,
+a list containing any unrecognized tags closed by the end of input.
+
+The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if
+encountered, will be treated as a tag with no body or content, and thus, no associated end
+tag. Typically, the argument is a list or modified list resulting from an earlier +parse-html
+execution with the :collect-rogue-tags argument specified as non-nil.

+ +

If the parse-entities argument is true then entities are converted to the character +they name. Thus for example the < entity is converted to the less than sign.
+
+parse-html Methods
+
+parse-html (p stream) &key callbacks callback-only
+            collect-rogue-tags +no-body-tags parse-entities
+
+parse-html (str string) &key callbacks callback-only
+            collect-rogue-tags +no-body-tags parse-entities
+
+parse-html (file t) &key callbacks callback-only
+            collect-rogue-tags +no-body-tags parse-entities
+
+The t method assumes the argument is a pathname suitable
+for use with the with-open-file macro.
+
+
+phtml-internal [Function]
+
+Arguments: stream read-sequence-func callback-only callbacks
+collect-rogue-tags no-body-tags parse-entities
+
+This function may be used when more control is needed for supplying
+the HTML input. The read-sequence-func argument, if non-nil, should be a function
+object or a symbol naming a function. When phtml-internal requires another buffer
+of HTML input, it will invoke the read-sequence-func function with two arguments -
+the first argument is an internal buffer character array and the second argument is
+the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal
+will invoke read-sequence to fill the buffer. The read-sequence-func function must
+return the number of character array elements successfully stored in the buffer.
+
+
+
+
+
+
+
+

+ +