phtml.htm

   1 <html>
   2
   3 <head>
   4 <title>A Lisp Based HTML Parser</title>
   5 <meta name="GENERATOR" content="Microsoft FrontPage 3.0">
   6 </head>
   7
   8 <body>
   9
  10 <p><big><strong><big>A Lisp Based HTML Parser</big></strong></big></p>
  11
  12 <p><a href="#intro">Introduction/Simple Example</a><br>
  13 <a href="#lhtml">LHTML&nbsp; parse output format</a><br>
  14 <a href="#case">Case mode notes</a><br>
  15 <a href="#comment">Parsing HTML comments</a><br>
  16 <a href="#script">Parsing &lt;SCRIPT&gt; and &lt;STYLE&gt; tags</a><br>
  17 <a href="#sgml">Parsing SGML &lt;! tags</a><br>
  18 <a href="#illegal">Parsing Illegal and Deprecated Tags</a><br>
  19 <a href="#default">Default Attribute Values</a><br>
  20 <a href="#char">Parsing Interleaved Character Formatting Tags</a><br>
  21 <a href="#reference">parse-html reference</a><br>
  22 &nbsp;&nbsp; <a href="#methods">methods</a><br>
  23 &nbsp;&nbsp; <a href="#internal">phtml-internal</a></p>
  24
  25 <p><a name="intro"></a>The <strong>parse-html</strong> generic function processes HTML
  26 input, returning a list of HTML tags, attributes, and text. Here is a simple example:<br>
  27 <br>
  28 (parse-html &quot;&lt;HTML&gt;<br>
  29 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  30 &lt;HEAD&gt;<br>
  31 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  32 &lt;TITLE&gt;Example HTML input&lt;/TITLE&gt;<br>
  33 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  34 &lt;BODY&gt;<br>
  35 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  36 &lt;P&gt;Here is some text with a &lt;B&gt;bold&lt;/B&gt; word&lt;br&gt;and a &lt;A
  37 HREF=\&quot;help.html\&quot;&gt;link&lt;/P&gt;<br>
  38 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  39 &lt;/HTML&gt;&quot;)</p>
  40
  41 <p>generates:<br>
  42 <br>
  43 ((:html (:head (:title &quot;Example HTML input&quot;))<br>
  44 &nbsp; (:body (:p &quot;Here is some text with a &quot; (:b &quot;bold&quot;) &quot;
  45 word&quot; :br &quot;and a &quot; <br>
  46 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  47 ((:a :href &quot;help.html&quot;) &quot;link&quot;)))))<br>
  48 </p>
  49
  50 <p>The output format is known as LHTML format; it is the same format that the<br>
  51 aserve htmlgen macro accepts. <br>
  52 <br>
  53 <a name="lhtml"></a><strong><big>LHTML format</big></strong><br>
  54 <br>
  55 LHTML is a list representation of HTML tags and content.<br>
  56 <br>
  57 Each list member may be:
  58
  59 <ol>
  60   <li>a string containing text content, such as &quot;Here is some text with a &quot;<br>
  61   </li>
  62   <li>a keyword package symbol representing a HTML tag with no associated attributes <br>
  63     or content, such as :br.<br>
  64   </li>
  65   <li>a list representing an HTML tag with associated attributes and/or content,<br>
  66     such as (:b &quot;bold&quot;) or ((:a :href &quot;help.html&quot;) &quot;link&quot;). If
  67     the HTML tag<br>
  68     does not have associated attributes, then the first list member will be a<br>
  69     keyword package symbol representing the HTML tag, and the other elements will <br>
  70     represent the content, which can be a string (text content), a keyword package symbol
  71     (HTML<br>
  72     tag with no attributes or content), or list (nested HTML tag with<br>
  73     associated attributes and/or content). If there are associated attributes,<br>
  74     then the first list member will be a list containing a keyword package symbol<br>
  75     followed by two list members for each associated attribute; the first member is a keyword<br>
  76     package symbol representing the attribute, and the next member is a string corresponding<br>
  77     to the attribute value.<br>
  78   </li>
  79 </ol>
  80
  81 <p><a name="case"></a><strong>Case Mode and LHTML</strong></p>
  82
  83 <p>If excl:*current-case-mode* is :CASE-INSENSITIVE-UPPER, keyword package symbols will be<br>
  84 in upper case; otherwise, they will be in lower case.</p>
  85
  86 <p><a name="comment"></a><strong>HTML Comments</strong></p>
  87
  88 <p>HTML comments are represented use a :comment symbol. For example,<br>
  89 <br>
  90 (parse-html &quot;&lt;!-- this is a comment--&gt;&quot;)<br>
  91 <br>
  92 --&gt; ((:comment &quot; this is a comment&quot;))</p>
  93
  94 <p><a name="script"></a><strong>HTML &lt;SCRIPT&gt; and &lt;STYLE&gt; tags</strong></p>
  95
  96 <p>All &lt;SCRIPT&gt; and &lt;STYLE&gt; content is not parsed; it is returned as text
  97 content.<br>
  98 <br>
  99 For example,<br>
 100 <br>
 101 (parse-html &quot;&lt;SCRIPT&gt;this &lt;B&gt;will not&lt;/B&gt; be
 102 parsed&lt;/SCRIPT&gt;&quot;)<br>
 103 <br>
 104 --&gt; ((:script &quot;this &lt;B&gt;will not&lt;/B&gt; be parsed&quot;))</p>
 105
 106 <p><a name="sgml"></a><strong>XML and SGML &lt;! tags</strong></p>
 107
 108 <p>Since, some HTML pages contain special XML/SGML tags, non-comment tags<br>
 109 starting with '&lt;!' are treated specially:<br>
 110 <br>
 111 (parse-html &quot;&lt;!doctype this is some text&gt;&quot;)<br>
 112 <br>
 113 --&gt; ((:!doctype &quot; this is some text&quot;))</p>
 114
 115 <p><a name="illegal"></a><strong>Illegal and Deprecated HTML</strong></p>
 116
 117 <p>There is plenty of illegal and deprecated HTML on the web that popular browsers<br>
 118 nonetheless successfully display. The parse-html parser is generous - it will not<br>
 119 raise an error condition upon encountering most input. In particular, it does not<br>
 120 maintain a list of legal HTML tags and will successfully parse nonsense input.<br>
 121 <br>
 122 For example,<br>
 123 <br>
 124 (parse-html &quot;&lt;this&gt; &lt;is&gt; &lt;some&gt; &lt;nonsense&gt;
 125 &lt;input&gt;&quot;)<br>
 126 <br>
 127 --&gt; ((:this (:is (:some (:nonsense :input)))))<br>
 128 <br>
 129 In some situations, you may prefer a two-pass parse that results in a parse where<br>
 130 deep nesting related to unrecognized tags is minimized:<br>
 131 <br>
 132 (let ((string &quot;&lt;this&gt; &lt;is&gt; &lt;some&gt; &lt;nonsense&gt; &lt;/some&gt;
 133 &lt;input&gt;&quot;))<br>
 134 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (multiple-value-bind (res rogues)<br>
 135 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (parse-html string
 136 :collect-rogue-tags t)<br>
 137 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (declare (ignorable
 138 res))<br>
 139 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (parse-html string
 140 :no-body-tags rogues)))<br>
 141 <br>
 142 --&gt; (:this :is (:some (:nonsense)) :input)<br>
 143 <br>
 144 See the <strong>:collect-rogue-tags</strong> and <strong>:no-body-tags</strong> argument
 145 descriptions in the reference<br>
 146 section below for more information.</p>
 147
 148 <p><a name="default"></a><strong>Default Attribute values</strong></p>
 149
 150 <p>As per the HTML 4.0 specification, attributes without specified values are given a
 151 lower case<br>
 152 string value that matches the attribute name.<br>
 153 <br>
 154 For example,<br>
 155 <br>
 156 (parse-html &quot;&lt;P here ARE some attributes&gt;&quot;)<br>
 157 <br>
 158 --&gt; (((:p :here &quot;here&quot; :are &quot;are&quot; :some &quot;some&quot;
 159 :attributes &quot;attributes&quot;)))</p>
 160
 161 <p><a name="char"></a><strong>Interleaved Character Formatting Tags</strong></p>
 162
 163 <p>Existing HTML pages often have character format tags that are interleaved among<br>
 164 other tags. Such interleaving is removed in a manner consistent with the HTML 4.0<br>
 165 specification.<br>
 166 <br>
 167 For example,<br>
 168 <br>
 169 (parse-html &quot;&lt;P&gt;Here is &lt;B&gt;bold text&lt;P&gt;that spans&lt;/B&gt;two
 170 paragraphs&quot;)<br>
 171 <br>
 172 --&gt; ((:p &quot;Here is &quot; (:b &quot;bold text&quot;)) (:p (:b &quot;that
 173 spans&quot;) &quot;two paragraphs&quot;))</p>
 174
 175 <hr>
 176
 177 <p><a name="reference"></a><strong><big>parse-html Reference</big></strong><br>
 178 <br>
 179 parse-html [Generic function]<br>
 180 <br>
 181 Arguments: input-source &amp;key callbacks callback-only<br>
 182 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; collect-rogue-tags
 183 no-body-tags parse-entities<br>
 184 <br>
 185 Returns LHTML output, as described above.<br>
 186 <br>
 187 The callbacks argument, if non-nil, should be an association list. Each list member's<br>
 188 car (first) element specifies a keyword package symbol, and each list member's cdr (rest)<br>
 189 element specifies a function object or a symbol naming a function. The function should<br>
 190 expect one argument. The function will be invoked once for each time the HTML tag<br>
 191 corresponding to the specified keyword package symbol is encountered in the HTML input;
 192 the<br>
 193 argument will be an LHTML list containing the tag, along with associated attributes and<br>
 194 content. The default callbacks argument value is nil.<br>
 195 <br>
 196 The callback-only argument, if non-nil, directs parse-html to not generate a complete
 197 LHTML<br>
 198 output. Instead, LHTML lists will only be generated when necessary as arguments for
 199 functions<br>
 200 specified in the callbacks association list. This results in faster parser execution. The
 201 default<br>
 202 callback-only argument value is nil.<br>
 203 <br>
 204 The collect-rogue-tags argument, if non-nil, directs parse-html to return an additional
 205 value, <br>
 206 a list containing any unrecognized tags closed by the end of input.<br>
 207 <br>
 208 The no-body-tags argument, if non-nil, should be a list containing unknown tags that, if<br>
 209 encountered, will be treated as a tag with no body or content, and thus, no associated end<br>
 210 tag. Typically, the argument is a list or modified list resulting from an earlier
 211 parse-html<br>
 212 execution with the :collect-rogue-tags argument specified as non-nil.</p>
 213
 214 <p>If the parse-entities argument is true then entities are converted to the character
 215 they name.&nbsp; Thus for example the &amp;lt; entity is converted to the less than sign.<br>
 216 <br>
 217 <a name="methods"></a><strong>parse-html Methods</strong><br>
 218 <br>
 219 parse-html (p stream) &amp;key callbacks callback-only<br>
 220 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; collect-rogue-tags
 221 no-body-tags parse-entities<br>
 222 <br>
 223 parse-html (str string) &amp;key callbacks callback-only<br>
 224 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; collect-rogue-tags
 225 no-body-tags parse-entities<br>
 226 <br>
 227 parse-html (file t) &amp;key callbacks callback-only<br>
 228 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; collect-rogue-tags
 229 no-body-tags parse-entities<br>
 230 <br>
 231 The t method assumes the argument is a pathname suitable<br>
 232 for use with the with-open-file macro.<br>
 233 <br>
 234 <br>
 235 <a name="internal"></a><strong>phtml-internal [Function]</strong><br>
 236 <br>
 237 Arguments: stream read-sequence-func callback-only callbacks<br>
 238 collect-rogue-tags no-body-tags parse-entities<br>
 239 <br>
 240 This function may be used when more control is needed for supplying<br>
 241 the HTML input. The read-sequence-func argument, if non-nil, should be a function<br>
 242 object or a symbol naming a function. When phtml-internal requires another buffer<br>
 243 of HTML input, it will invoke the read-sequence-func function with two arguments -<br>
 244 the first argument is an internal buffer character array and the second argument is<br>
 245 the phtml-internal stream argument. If read-sequence-fun is nil, phtml-internal<br>
 246 will invoke read-sequence to fill the buffer. The read-sequence-func function must<br>
 247 return the number of character array elements successfully stored in the buffer.<br>
 248 <br>
 249 <br>
 250 <br>
 251 <br>
 252 <br>
 253 <br>
 254 <br>
 255 </p>
 256 </body>
 257 </html>