4 <title>A Lisp Based HTML Parser</title>
5 <meta name="GENERATOR" content="Microsoft FrontPage 3.0">
10 <p><big><strong><big>A Lisp Based HTML Parser</big></strong></big></p>
12 <p><a href="#intro">Introduction/Simple Example</a><br>
13 <a href="#lhtml">LHTML parse output format</a><br>
14 <a href="#case">Case mode notes</a><br>
15 <a href="#comment">Parsing HTML comments</a><br>
16 <a href="#script">Parsing <SCRIPT> and <STYLE> tags</a><br>
17 <a href="#sgml">Parsing SGML <! tags</a><br>
18 <a href="#illegal">Parsing Illegal and Deprecated Tags</a><br>
19 <a href="#default">Default Attribute Values</a><br>
20 <a href="#char">Parsing Interleaved Character Formatting Tags</a><br>
21 <a href="#reference">parse-html reference</a><br>
22 <a href="#methods">methods</a><br>
23 <a href="#internal">phtml-internal</a></p>
25 <p><a name="intro"></a>The <strong>parse-html</strong> generic function processes HTML
26 input, returning a list of HTML tags, attributes, and text. Here is a simple example:<br>
28 (parse-html "<HTML><br>
29
31
32 <TITLE>Example HTML input</TITLE><br>
33
35
36 <P>Here is some text with a <B>bold</B> word<br>and a <A
37 HREF=\"help.html\">link</P><br>
38
39 </HTML>")</p>
43 ((:html (:head (:title "Example HTML input"))<br>
44 (:body (:p "Here is some text with a " (:b "bold") "
45 word" :br "and a " <br>
46
47 ((:a :href "help.html") "link")))))<br>
50 <p>The output format is known as LHTML format; it is the same format that the<br>
51 aserve htmlgen macro accepts. <br>
53 <a name="lhtml"></a><strong><big>LHTML format</big></strong><br>
55 LHTML is a list representation of HTML tags and content.<br>
57 Each list member may be:
60 <li>a string containing text content, such as "Here is some text with a "<br>
62 <li>a keyword package symbol representing a HTML tag with no associated attributes <br>
63 or content, such as :br.<br>
65 <li>a list representing an HTML tag with associated attributes and/or content,<br>
66 such as (:b "bold") or ((:a :href "help.html") "link"). If
68 does not have associated attributes, then the first list member will be a<br>
69 keyword package symbol representing the HTML tag, and the other elements will <br>
70 represent the content, which can be a string (text content), a keyword package symbol
72 tag with no attributes or content), or list (nested HTML tag with<br>
73 associated attributes and/or content). If there are associated attributes,<br>
74 then the first list member will be a list containing a keyword package symbol<br>
75 followed by two list members for each associated attribute; the first member is a keyword<br>
76 package symbol representing the attribute, and the next member is a string corresponding<br>
77 to the attribute value.<br>
81 <p><a name="case"></a><strong>Case Mode and LHTML</strong></p>
83 <p>If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be<br>
84 in upper case; otherwise, they will be in lower case.</p>
86 <p><a name="comment"></a><strong>HTML Comments</strong></p>
88 <p>HTML comments are represented use a :comment symbol. For example,<br>
90 (parse-html "<!-- this is a comment-->")<br>
92 --> ((:comment " this is a comment"))</p>
94 <p><a name="script"></a><strong>HTML <SCRIPT> and <STYLE> tags</strong></p>
96 <p>All <SCRIPT> and <STYLE> content is not parsed; it is returned as text
101 (parse-html "<SCRIPT>this <B>will not</B> be
102 parsed</SCRIPT>")<br>
104 --> ((:script "this <B>will not</B> be parsed"))</p>
106 <p><a name="sgml"></a><strong>XML and SGML <! tags</strong></p>
108 <p>Since, some HTML pages contain special XML/SGML tags, non-comment tags<br>
109 starting with '<!' are treated specially:<br>
111 (parse-html "<!doctype this is some text>")<br>
113 --> ((:!doctype " this is some text"))</p>
115 <p><a name="illegal"></a><strong>Illegal and Deprecated HTML</strong></p>
117 <p>There is plenty of illegal and deprecated HTML on the web that popular browsers<br>
118 nonetheless successfully display. The parse-html parser is generous - it will not<br>
119 raise an error condition upon encountering most input. In particular, it does not<br>
120 maintain a list of legal HTML tags and will successfully parse nonsense input.<br>
124 (parse-html "<this> <is> <some> <nonsense>
125 <input>")<br>
127 --> ((:this (:is (:some (:nonsense :input)))))<br>
129 In some situations, you may prefer a two-pass parse that results in a parse where<br>
130 deep nesting related to unrecognized tags is minimized:<br>
132 (let ((string "<this> <is> <some> <nonsense> </some>
133 <input>"))<br>
134 (multiple-value-bind (res rogues)<br>
135 (parse-html string
136 :collect-rogue-tags t)<br>
137 (declare (ignorable
139 (parse-html string
140 :no-body-tags rogues)))<br>
142 --> (:this :is (:some (:nonsense)) :input)<br>
144 See the <strong>:collect-rogue-tags</strong> and <strong>:no-body-tags</strong> argument
145 descriptions in the reference<br>
146 section below for more information.</p>
148 <p><a name="default"></a><strong>Default Attribute values</strong></p>
150 <p>As per the HTML 4.0 specification, attributes without specified values are given a
152 string value that matches the attribute name.<br>
156 (parse-html "<P here ARE some attributes>")<br>
158 --> (((:p :here "here" :are "are" :some "some"
159 :attributes "attributes")))</p>
161 <p><a name="char"></a><strong>Interleaved Character Formatting Tags</strong></p>
163 <p>Existing HTML pages often have character format tags that are interleaved among<br>
164 other tags. Such interleaving is removed in a manner consistent with the HTML 4.0<br>
169 (parse-html "<P>Here is <B>bold text<P>that spans</B>two
170 paragraphs")<br>
172 --> ((:p "Here is " (:b "bold text")) (:p (:b "that
173 spans") "two paragraphs"))</p>
177 <p><a name="reference"></a><strong><big>parse-html Reference</big></strong><br>
179 parse-html [Generic function]<br>
181 Arguments: input-source &key callbacks callback-only<br>
182 collect-rogue-tags
185 Returns LHTML output, as described above.<br>
187 The callbacks argument, if non-nil, should be an association list. Each list member's<br>
188 car (first) element specifies a keyword package symbol, and each list member's cdr (rest)<br>
189 element specifies a function object or a symbol naming a function. The function should<br>
190 expect one argument. The function will be invoked once for each time the HTML tag<br>
191 corresponding to the specified keyword package symbol is encountered in the HTML input;
193 argument will be an LHTML list containing the tag, along with associated attributes and<br>
194 content. The default callbacks argument value is nil.<br>
196 The callback-only argument, if non-nil, directs parse-html to not generate a complete
198 output. Instead, LHTML lists will only be generated when necessary as arguments for
200 specified in the callbacks association list. This results in faster parser execution. The
202 callback-only argument value is nil.<br>
204 The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional
206 a list containing any unrecognized tags closed by the end of input.<br>
208 The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if<br>
209 encountered, will be treated as a tag with no body or content, and thus, no associated end<br>
210 tag. Typically, the argument is a list or modified list resulting from an earlier
212 execution with the :collect-rogue-tags argument specified as non-nil.<br>
214 <a name="methods"></a><strong>parse-html Methods</strong><br>
216 parse-html (p stream) &key callbacks callback-only<br>
217 collect-rogue-tags
220 parse-html (str string) &key callbacks callback-only<br>
221 collect-rogue-tags
224 parse-html (file t) &key callbacks callback-only<br>
225 collect-rogue-tags
228 The t method assumes the argument is a pathname suitable<br>
229 for use with the with-open-file macro.<br>
232 <a name="internal"></a><strong>phtml-internal [Function]</strong><br>
234 Arguments: stream read-sequence-func callback-only callbacks<br>
235 collect-rogue-tags no-body-tags<br>
237 This function may be used when more control is needed for supplying<br>
238 the HTML input. The read-sequence-func argument, if non-nil, should be a function<br>
239 object or a symbol naming a function. When phtml-internal requires another buffer<br>
240 of HTML input, it will invoke the read-sequence-func function with two arguments -<br>
241 the first argument is an internal buffer character array and the second argument is<br>
242 the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal<br>
243 will invoke read-sequence to fill the buffer. The read-sequence-func function must<br>
244 return the number of character array elements successfully stored in the buffer.<br>