1*cda5da8dSAndroid Build Coastguard Worker"""A parser for HTML and XHTML.""" 2*cda5da8dSAndroid Build Coastguard Worker 3*cda5da8dSAndroid Build Coastguard Worker# This file is based on sgmllib.py, but the API is slightly different. 4*cda5da8dSAndroid Build Coastguard Worker 5*cda5da8dSAndroid Build Coastguard Worker# XXX There should be a way to distinguish between PCDATA (parsed 6*cda5da8dSAndroid Build Coastguard Worker# character data -- the normal case), RCDATA (replaceable character 7*cda5da8dSAndroid Build Coastguard Worker# data -- only char and entity references and end tags are special) 8*cda5da8dSAndroid Build Coastguard Worker# and CDATA (character data -- only end tags are special). 9*cda5da8dSAndroid Build Coastguard Worker 10*cda5da8dSAndroid Build Coastguard Worker 11*cda5da8dSAndroid Build Coastguard Workerimport re 12*cda5da8dSAndroid Build Coastguard Workerimport _markupbase 13*cda5da8dSAndroid Build Coastguard Worker 14*cda5da8dSAndroid Build Coastguard Workerfrom html import unescape 15*cda5da8dSAndroid Build Coastguard Worker 16*cda5da8dSAndroid Build Coastguard Worker 17*cda5da8dSAndroid Build Coastguard Worker__all__ = ['HTMLParser'] 18*cda5da8dSAndroid Build Coastguard Worker 19*cda5da8dSAndroid Build Coastguard Worker# Regular expressions used for parsing 20*cda5da8dSAndroid Build Coastguard Worker 21*cda5da8dSAndroid Build Coastguard Workerinteresting_normal = re.compile('[&<]') 22*cda5da8dSAndroid Build Coastguard Workerincomplete = re.compile('&[a-zA-Z#]') 23*cda5da8dSAndroid Build Coastguard Worker 24*cda5da8dSAndroid Build Coastguard Workerentityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') 25*cda5da8dSAndroid Build Coastguard Workercharref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]') 26*cda5da8dSAndroid Build Coastguard Worker 27*cda5da8dSAndroid Build Coastguard Workerstarttagopen = re.compile('<[a-zA-Z]') 28*cda5da8dSAndroid Build Coastguard Workerpiclose = re.compile('>') 29*cda5da8dSAndroid Build Coastguard Workercommentclose = re.compile(r'--\s*>') 30*cda5da8dSAndroid Build Coastguard Worker# Note: 31*cda5da8dSAndroid Build Coastguard Worker# 1) if you change tagfind/attrfind remember to update locatestarttagend too; 32*cda5da8dSAndroid Build Coastguard Worker# 2) if you change tagfind/attrfind and/or locatestarttagend the parser will 33*cda5da8dSAndroid Build Coastguard Worker# explode, so don't do it. 34*cda5da8dSAndroid Build Coastguard Worker# see http://www.w3.org/TR/html5/tokenization.html#tag-open-state 35*cda5da8dSAndroid Build Coastguard Worker# and http://www.w3.org/TR/html5/tokenization.html#tag-name-state 36*cda5da8dSAndroid Build Coastguard Workertagfind_tolerant = re.compile(r'([a-zA-Z][^\t\n\r\f />\x00]*)(?:\s|/(?!>))*') 37*cda5da8dSAndroid Build Coastguard Workerattrfind_tolerant = re.compile( 38*cda5da8dSAndroid Build Coastguard Worker r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*' 39*cda5da8dSAndroid Build Coastguard Worker r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*') 40*cda5da8dSAndroid Build Coastguard Workerlocatestarttagend_tolerant = re.compile(r""" 41*cda5da8dSAndroid Build Coastguard Worker <[a-zA-Z][^\t\n\r\f />\x00]* # tag name 42*cda5da8dSAndroid Build Coastguard Worker (?:[\s/]* # optional whitespace before attribute name 43*cda5da8dSAndroid Build Coastguard Worker (?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name 44*cda5da8dSAndroid Build Coastguard Worker (?:\s*=+\s* # value indicator 45*cda5da8dSAndroid Build Coastguard Worker (?:'[^']*' # LITA-enclosed value 46*cda5da8dSAndroid Build Coastguard Worker |"[^"]*" # LIT-enclosed value 47*cda5da8dSAndroid Build Coastguard Worker |(?!['"])[^>\s]* # bare value 48*cda5da8dSAndroid Build Coastguard Worker ) 49*cda5da8dSAndroid Build Coastguard Worker \s* # possibly followed by a space 50*cda5da8dSAndroid Build Coastguard Worker )?(?:\s|/(?!>))* 51*cda5da8dSAndroid Build Coastguard Worker )* 52*cda5da8dSAndroid Build Coastguard Worker )? 53*cda5da8dSAndroid Build Coastguard Worker \s* # trailing whitespace 54*cda5da8dSAndroid Build Coastguard Worker""", re.VERBOSE) 55*cda5da8dSAndroid Build Coastguard Workerendendtag = re.compile('>') 56*cda5da8dSAndroid Build Coastguard Worker# the HTML 5 spec, section 8.1.2.2, doesn't allow spaces between 57*cda5da8dSAndroid Build Coastguard Worker# </ and the tag name, so maybe this should be fixed 58*cda5da8dSAndroid Build Coastguard Workerendtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>') 59*cda5da8dSAndroid Build Coastguard Worker 60*cda5da8dSAndroid Build Coastguard Worker 61*cda5da8dSAndroid Build Coastguard Worker 62*cda5da8dSAndroid Build Coastguard Workerclass HTMLParser(_markupbase.ParserBase): 63*cda5da8dSAndroid Build Coastguard Worker """Find tags and other markup and call handler functions. 64*cda5da8dSAndroid Build Coastguard Worker 65*cda5da8dSAndroid Build Coastguard Worker Usage: 66*cda5da8dSAndroid Build Coastguard Worker p = HTMLParser() 67*cda5da8dSAndroid Build Coastguard Worker p.feed(data) 68*cda5da8dSAndroid Build Coastguard Worker ... 69*cda5da8dSAndroid Build Coastguard Worker p.close() 70*cda5da8dSAndroid Build Coastguard Worker 71*cda5da8dSAndroid Build Coastguard Worker Start tags are handled by calling self.handle_starttag() or 72*cda5da8dSAndroid Build Coastguard Worker self.handle_startendtag(); end tags by self.handle_endtag(). The 73*cda5da8dSAndroid Build Coastguard Worker data between tags is passed from the parser to the derived class 74*cda5da8dSAndroid Build Coastguard Worker by calling self.handle_data() with the data as argument (the data 75*cda5da8dSAndroid Build Coastguard Worker may be split up in arbitrary chunks). If convert_charrefs is 76*cda5da8dSAndroid Build Coastguard Worker True the character references are converted automatically to the 77*cda5da8dSAndroid Build Coastguard Worker corresponding Unicode character (and self.handle_data() is no 78*cda5da8dSAndroid Build Coastguard Worker longer split in chunks), otherwise they are passed by calling 79*cda5da8dSAndroid Build Coastguard Worker self.handle_entityref() or self.handle_charref() with the string 80*cda5da8dSAndroid Build Coastguard Worker containing respectively the named or numeric reference as the 81*cda5da8dSAndroid Build Coastguard Worker argument. 82*cda5da8dSAndroid Build Coastguard Worker """ 83*cda5da8dSAndroid Build Coastguard Worker 84*cda5da8dSAndroid Build Coastguard Worker CDATA_CONTENT_ELEMENTS = ("script", "style") 85*cda5da8dSAndroid Build Coastguard Worker 86*cda5da8dSAndroid Build Coastguard Worker def __init__(self, *, convert_charrefs=True): 87*cda5da8dSAndroid Build Coastguard Worker """Initialize and reset this instance. 88*cda5da8dSAndroid Build Coastguard Worker 89*cda5da8dSAndroid Build Coastguard Worker If convert_charrefs is True (the default), all character references 90*cda5da8dSAndroid Build Coastguard Worker are automatically converted to the corresponding Unicode characters. 91*cda5da8dSAndroid Build Coastguard Worker """ 92*cda5da8dSAndroid Build Coastguard Worker self.convert_charrefs = convert_charrefs 93*cda5da8dSAndroid Build Coastguard Worker self.reset() 94*cda5da8dSAndroid Build Coastguard Worker 95*cda5da8dSAndroid Build Coastguard Worker def reset(self): 96*cda5da8dSAndroid Build Coastguard Worker """Reset this instance. Loses all unprocessed data.""" 97*cda5da8dSAndroid Build Coastguard Worker self.rawdata = '' 98*cda5da8dSAndroid Build Coastguard Worker self.lasttag = '???' 99*cda5da8dSAndroid Build Coastguard Worker self.interesting = interesting_normal 100*cda5da8dSAndroid Build Coastguard Worker self.cdata_elem = None 101*cda5da8dSAndroid Build Coastguard Worker _markupbase.ParserBase.reset(self) 102*cda5da8dSAndroid Build Coastguard Worker 103*cda5da8dSAndroid Build Coastguard Worker def feed(self, data): 104*cda5da8dSAndroid Build Coastguard Worker r"""Feed data to the parser. 105*cda5da8dSAndroid Build Coastguard Worker 106*cda5da8dSAndroid Build Coastguard Worker Call this as often as you want, with as little or as much text 107*cda5da8dSAndroid Build Coastguard Worker as you want (may include '\n'). 108*cda5da8dSAndroid Build Coastguard Worker """ 109*cda5da8dSAndroid Build Coastguard Worker self.rawdata = self.rawdata + data 110*cda5da8dSAndroid Build Coastguard Worker self.goahead(0) 111*cda5da8dSAndroid Build Coastguard Worker 112*cda5da8dSAndroid Build Coastguard Worker def close(self): 113*cda5da8dSAndroid Build Coastguard Worker """Handle any buffered data.""" 114*cda5da8dSAndroid Build Coastguard Worker self.goahead(1) 115*cda5da8dSAndroid Build Coastguard Worker 116*cda5da8dSAndroid Build Coastguard Worker __starttag_text = None 117*cda5da8dSAndroid Build Coastguard Worker 118*cda5da8dSAndroid Build Coastguard Worker def get_starttag_text(self): 119*cda5da8dSAndroid Build Coastguard Worker """Return full source of start tag: '<...>'.""" 120*cda5da8dSAndroid Build Coastguard Worker return self.__starttag_text 121*cda5da8dSAndroid Build Coastguard Worker 122*cda5da8dSAndroid Build Coastguard Worker def set_cdata_mode(self, elem): 123*cda5da8dSAndroid Build Coastguard Worker self.cdata_elem = elem.lower() 124*cda5da8dSAndroid Build Coastguard Worker self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I) 125*cda5da8dSAndroid Build Coastguard Worker 126*cda5da8dSAndroid Build Coastguard Worker def clear_cdata_mode(self): 127*cda5da8dSAndroid Build Coastguard Worker self.interesting = interesting_normal 128*cda5da8dSAndroid Build Coastguard Worker self.cdata_elem = None 129*cda5da8dSAndroid Build Coastguard Worker 130*cda5da8dSAndroid Build Coastguard Worker # Internal -- handle data as far as reasonable. May leave state 131*cda5da8dSAndroid Build Coastguard Worker # and data to be processed by a subsequent call. If 'end' is 132*cda5da8dSAndroid Build Coastguard Worker # true, force handling all data as if followed by EOF marker. 133*cda5da8dSAndroid Build Coastguard Worker def goahead(self, end): 134*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 135*cda5da8dSAndroid Build Coastguard Worker i = 0 136*cda5da8dSAndroid Build Coastguard Worker n = len(rawdata) 137*cda5da8dSAndroid Build Coastguard Worker while i < n: 138*cda5da8dSAndroid Build Coastguard Worker if self.convert_charrefs and not self.cdata_elem: 139*cda5da8dSAndroid Build Coastguard Worker j = rawdata.find('<', i) 140*cda5da8dSAndroid Build Coastguard Worker if j < 0: 141*cda5da8dSAndroid Build Coastguard Worker # if we can't find the next <, either we are at the end 142*cda5da8dSAndroid Build Coastguard Worker # or there's more text incoming. If the latter is True, 143*cda5da8dSAndroid Build Coastguard Worker # we can't pass the text to handle_data in case we have 144*cda5da8dSAndroid Build Coastguard Worker # a charref cut in half at end. Try to determine if 145*cda5da8dSAndroid Build Coastguard Worker # this is the case before proceeding by looking for an 146*cda5da8dSAndroid Build Coastguard Worker # & near the end and see if it's followed by a space or ;. 147*cda5da8dSAndroid Build Coastguard Worker amppos = rawdata.rfind('&', max(i, n-34)) 148*cda5da8dSAndroid Build Coastguard Worker if (amppos >= 0 and 149*cda5da8dSAndroid Build Coastguard Worker not re.compile(r'[\s;]').search(rawdata, amppos)): 150*cda5da8dSAndroid Build Coastguard Worker break # wait till we get all the text 151*cda5da8dSAndroid Build Coastguard Worker j = n 152*cda5da8dSAndroid Build Coastguard Worker else: 153*cda5da8dSAndroid Build Coastguard Worker match = self.interesting.search(rawdata, i) # < or & 154*cda5da8dSAndroid Build Coastguard Worker if match: 155*cda5da8dSAndroid Build Coastguard Worker j = match.start() 156*cda5da8dSAndroid Build Coastguard Worker else: 157*cda5da8dSAndroid Build Coastguard Worker if self.cdata_elem: 158*cda5da8dSAndroid Build Coastguard Worker break 159*cda5da8dSAndroid Build Coastguard Worker j = n 160*cda5da8dSAndroid Build Coastguard Worker if i < j: 161*cda5da8dSAndroid Build Coastguard Worker if self.convert_charrefs and not self.cdata_elem: 162*cda5da8dSAndroid Build Coastguard Worker self.handle_data(unescape(rawdata[i:j])) 163*cda5da8dSAndroid Build Coastguard Worker else: 164*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:j]) 165*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, j) 166*cda5da8dSAndroid Build Coastguard Worker if i == n: break 167*cda5da8dSAndroid Build Coastguard Worker startswith = rawdata.startswith 168*cda5da8dSAndroid Build Coastguard Worker if startswith('<', i): 169*cda5da8dSAndroid Build Coastguard Worker if starttagopen.match(rawdata, i): # < + letter 170*cda5da8dSAndroid Build Coastguard Worker k = self.parse_starttag(i) 171*cda5da8dSAndroid Build Coastguard Worker elif startswith("</", i): 172*cda5da8dSAndroid Build Coastguard Worker k = self.parse_endtag(i) 173*cda5da8dSAndroid Build Coastguard Worker elif startswith("<!--", i): 174*cda5da8dSAndroid Build Coastguard Worker k = self.parse_comment(i) 175*cda5da8dSAndroid Build Coastguard Worker elif startswith("<?", i): 176*cda5da8dSAndroid Build Coastguard Worker k = self.parse_pi(i) 177*cda5da8dSAndroid Build Coastguard Worker elif startswith("<!", i): 178*cda5da8dSAndroid Build Coastguard Worker k = self.parse_html_declaration(i) 179*cda5da8dSAndroid Build Coastguard Worker elif (i + 1) < n: 180*cda5da8dSAndroid Build Coastguard Worker self.handle_data("<") 181*cda5da8dSAndroid Build Coastguard Worker k = i + 1 182*cda5da8dSAndroid Build Coastguard Worker else: 183*cda5da8dSAndroid Build Coastguard Worker break 184*cda5da8dSAndroid Build Coastguard Worker if k < 0: 185*cda5da8dSAndroid Build Coastguard Worker if not end: 186*cda5da8dSAndroid Build Coastguard Worker break 187*cda5da8dSAndroid Build Coastguard Worker k = rawdata.find('>', i + 1) 188*cda5da8dSAndroid Build Coastguard Worker if k < 0: 189*cda5da8dSAndroid Build Coastguard Worker k = rawdata.find('<', i + 1) 190*cda5da8dSAndroid Build Coastguard Worker if k < 0: 191*cda5da8dSAndroid Build Coastguard Worker k = i + 1 192*cda5da8dSAndroid Build Coastguard Worker else: 193*cda5da8dSAndroid Build Coastguard Worker k += 1 194*cda5da8dSAndroid Build Coastguard Worker if self.convert_charrefs and not self.cdata_elem: 195*cda5da8dSAndroid Build Coastguard Worker self.handle_data(unescape(rawdata[i:k])) 196*cda5da8dSAndroid Build Coastguard Worker else: 197*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:k]) 198*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, k) 199*cda5da8dSAndroid Build Coastguard Worker elif startswith("&#", i): 200*cda5da8dSAndroid Build Coastguard Worker match = charref.match(rawdata, i) 201*cda5da8dSAndroid Build Coastguard Worker if match: 202*cda5da8dSAndroid Build Coastguard Worker name = match.group()[2:-1] 203*cda5da8dSAndroid Build Coastguard Worker self.handle_charref(name) 204*cda5da8dSAndroid Build Coastguard Worker k = match.end() 205*cda5da8dSAndroid Build Coastguard Worker if not startswith(';', k-1): 206*cda5da8dSAndroid Build Coastguard Worker k = k - 1 207*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, k) 208*cda5da8dSAndroid Build Coastguard Worker continue 209*cda5da8dSAndroid Build Coastguard Worker else: 210*cda5da8dSAndroid Build Coastguard Worker if ";" in rawdata[i:]: # bail by consuming &# 211*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:i+2]) 212*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, i+2) 213*cda5da8dSAndroid Build Coastguard Worker break 214*cda5da8dSAndroid Build Coastguard Worker elif startswith('&', i): 215*cda5da8dSAndroid Build Coastguard Worker match = entityref.match(rawdata, i) 216*cda5da8dSAndroid Build Coastguard Worker if match: 217*cda5da8dSAndroid Build Coastguard Worker name = match.group(1) 218*cda5da8dSAndroid Build Coastguard Worker self.handle_entityref(name) 219*cda5da8dSAndroid Build Coastguard Worker k = match.end() 220*cda5da8dSAndroid Build Coastguard Worker if not startswith(';', k-1): 221*cda5da8dSAndroid Build Coastguard Worker k = k - 1 222*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, k) 223*cda5da8dSAndroid Build Coastguard Worker continue 224*cda5da8dSAndroid Build Coastguard Worker match = incomplete.match(rawdata, i) 225*cda5da8dSAndroid Build Coastguard Worker if match: 226*cda5da8dSAndroid Build Coastguard Worker # match.group() will contain at least 2 chars 227*cda5da8dSAndroid Build Coastguard Worker if end and match.group() == rawdata[i:]: 228*cda5da8dSAndroid Build Coastguard Worker k = match.end() 229*cda5da8dSAndroid Build Coastguard Worker if k <= i: 230*cda5da8dSAndroid Build Coastguard Worker k = n 231*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, i + 1) 232*cda5da8dSAndroid Build Coastguard Worker # incomplete 233*cda5da8dSAndroid Build Coastguard Worker break 234*cda5da8dSAndroid Build Coastguard Worker elif (i + 1) < n: 235*cda5da8dSAndroid Build Coastguard Worker # not the end of the buffer, and can't be confused 236*cda5da8dSAndroid Build Coastguard Worker # with some other construct 237*cda5da8dSAndroid Build Coastguard Worker self.handle_data("&") 238*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, i + 1) 239*cda5da8dSAndroid Build Coastguard Worker else: 240*cda5da8dSAndroid Build Coastguard Worker break 241*cda5da8dSAndroid Build Coastguard Worker else: 242*cda5da8dSAndroid Build Coastguard Worker assert 0, "interesting.search() lied" 243*cda5da8dSAndroid Build Coastguard Worker # end while 244*cda5da8dSAndroid Build Coastguard Worker if end and i < n and not self.cdata_elem: 245*cda5da8dSAndroid Build Coastguard Worker if self.convert_charrefs and not self.cdata_elem: 246*cda5da8dSAndroid Build Coastguard Worker self.handle_data(unescape(rawdata[i:n])) 247*cda5da8dSAndroid Build Coastguard Worker else: 248*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:n]) 249*cda5da8dSAndroid Build Coastguard Worker i = self.updatepos(i, n) 250*cda5da8dSAndroid Build Coastguard Worker self.rawdata = rawdata[i:] 251*cda5da8dSAndroid Build Coastguard Worker 252*cda5da8dSAndroid Build Coastguard Worker # Internal -- parse html declarations, return length or -1 if not terminated 253*cda5da8dSAndroid Build Coastguard Worker # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state 254*cda5da8dSAndroid Build Coastguard Worker # See also parse_declaration in _markupbase 255*cda5da8dSAndroid Build Coastguard Worker def parse_html_declaration(self, i): 256*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 257*cda5da8dSAndroid Build Coastguard Worker assert rawdata[i:i+2] == '<!', ('unexpected call to ' 258*cda5da8dSAndroid Build Coastguard Worker 'parse_html_declaration()') 259*cda5da8dSAndroid Build Coastguard Worker if rawdata[i:i+4] == '<!--': 260*cda5da8dSAndroid Build Coastguard Worker # this case is actually already handled in goahead() 261*cda5da8dSAndroid Build Coastguard Worker return self.parse_comment(i) 262*cda5da8dSAndroid Build Coastguard Worker elif rawdata[i:i+3] == '<![': 263*cda5da8dSAndroid Build Coastguard Worker return self.parse_marked_section(i) 264*cda5da8dSAndroid Build Coastguard Worker elif rawdata[i:i+9].lower() == '<!doctype': 265*cda5da8dSAndroid Build Coastguard Worker # find the closing > 266*cda5da8dSAndroid Build Coastguard Worker gtpos = rawdata.find('>', i+9) 267*cda5da8dSAndroid Build Coastguard Worker if gtpos == -1: 268*cda5da8dSAndroid Build Coastguard Worker return -1 269*cda5da8dSAndroid Build Coastguard Worker self.handle_decl(rawdata[i+2:gtpos]) 270*cda5da8dSAndroid Build Coastguard Worker return gtpos+1 271*cda5da8dSAndroid Build Coastguard Worker else: 272*cda5da8dSAndroid Build Coastguard Worker return self.parse_bogus_comment(i) 273*cda5da8dSAndroid Build Coastguard Worker 274*cda5da8dSAndroid Build Coastguard Worker # Internal -- parse bogus comment, return length or -1 if not terminated 275*cda5da8dSAndroid Build Coastguard Worker # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state 276*cda5da8dSAndroid Build Coastguard Worker def parse_bogus_comment(self, i, report=1): 277*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 278*cda5da8dSAndroid Build Coastguard Worker assert rawdata[i:i+2] in ('<!', '</'), ('unexpected call to ' 279*cda5da8dSAndroid Build Coastguard Worker 'parse_comment()') 280*cda5da8dSAndroid Build Coastguard Worker pos = rawdata.find('>', i+2) 281*cda5da8dSAndroid Build Coastguard Worker if pos == -1: 282*cda5da8dSAndroid Build Coastguard Worker return -1 283*cda5da8dSAndroid Build Coastguard Worker if report: 284*cda5da8dSAndroid Build Coastguard Worker self.handle_comment(rawdata[i+2:pos]) 285*cda5da8dSAndroid Build Coastguard Worker return pos + 1 286*cda5da8dSAndroid Build Coastguard Worker 287*cda5da8dSAndroid Build Coastguard Worker # Internal -- parse processing instr, return end or -1 if not terminated 288*cda5da8dSAndroid Build Coastguard Worker def parse_pi(self, i): 289*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 290*cda5da8dSAndroid Build Coastguard Worker assert rawdata[i:i+2] == '<?', 'unexpected call to parse_pi()' 291*cda5da8dSAndroid Build Coastguard Worker match = piclose.search(rawdata, i+2) # > 292*cda5da8dSAndroid Build Coastguard Worker if not match: 293*cda5da8dSAndroid Build Coastguard Worker return -1 294*cda5da8dSAndroid Build Coastguard Worker j = match.start() 295*cda5da8dSAndroid Build Coastguard Worker self.handle_pi(rawdata[i+2: j]) 296*cda5da8dSAndroid Build Coastguard Worker j = match.end() 297*cda5da8dSAndroid Build Coastguard Worker return j 298*cda5da8dSAndroid Build Coastguard Worker 299*cda5da8dSAndroid Build Coastguard Worker # Internal -- handle starttag, return end or -1 if not terminated 300*cda5da8dSAndroid Build Coastguard Worker def parse_starttag(self, i): 301*cda5da8dSAndroid Build Coastguard Worker self.__starttag_text = None 302*cda5da8dSAndroid Build Coastguard Worker endpos = self.check_for_whole_start_tag(i) 303*cda5da8dSAndroid Build Coastguard Worker if endpos < 0: 304*cda5da8dSAndroid Build Coastguard Worker return endpos 305*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 306*cda5da8dSAndroid Build Coastguard Worker self.__starttag_text = rawdata[i:endpos] 307*cda5da8dSAndroid Build Coastguard Worker 308*cda5da8dSAndroid Build Coastguard Worker # Now parse the data between i+1 and j into a tag and attrs 309*cda5da8dSAndroid Build Coastguard Worker attrs = [] 310*cda5da8dSAndroid Build Coastguard Worker match = tagfind_tolerant.match(rawdata, i+1) 311*cda5da8dSAndroid Build Coastguard Worker assert match, 'unexpected call to parse_starttag()' 312*cda5da8dSAndroid Build Coastguard Worker k = match.end() 313*cda5da8dSAndroid Build Coastguard Worker self.lasttag = tag = match.group(1).lower() 314*cda5da8dSAndroid Build Coastguard Worker while k < endpos: 315*cda5da8dSAndroid Build Coastguard Worker m = attrfind_tolerant.match(rawdata, k) 316*cda5da8dSAndroid Build Coastguard Worker if not m: 317*cda5da8dSAndroid Build Coastguard Worker break 318*cda5da8dSAndroid Build Coastguard Worker attrname, rest, attrvalue = m.group(1, 2, 3) 319*cda5da8dSAndroid Build Coastguard Worker if not rest: 320*cda5da8dSAndroid Build Coastguard Worker attrvalue = None 321*cda5da8dSAndroid Build Coastguard Worker elif attrvalue[:1] == '\'' == attrvalue[-1:] or \ 322*cda5da8dSAndroid Build Coastguard Worker attrvalue[:1] == '"' == attrvalue[-1:]: 323*cda5da8dSAndroid Build Coastguard Worker attrvalue = attrvalue[1:-1] 324*cda5da8dSAndroid Build Coastguard Worker if attrvalue: 325*cda5da8dSAndroid Build Coastguard Worker attrvalue = unescape(attrvalue) 326*cda5da8dSAndroid Build Coastguard Worker attrs.append((attrname.lower(), attrvalue)) 327*cda5da8dSAndroid Build Coastguard Worker k = m.end() 328*cda5da8dSAndroid Build Coastguard Worker 329*cda5da8dSAndroid Build Coastguard Worker end = rawdata[k:endpos].strip() 330*cda5da8dSAndroid Build Coastguard Worker if end not in (">", "/>"): 331*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:endpos]) 332*cda5da8dSAndroid Build Coastguard Worker return endpos 333*cda5da8dSAndroid Build Coastguard Worker if end.endswith('/>'): 334*cda5da8dSAndroid Build Coastguard Worker # XHTML-style empty tag: <span attr="value" /> 335*cda5da8dSAndroid Build Coastguard Worker self.handle_startendtag(tag, attrs) 336*cda5da8dSAndroid Build Coastguard Worker else: 337*cda5da8dSAndroid Build Coastguard Worker self.handle_starttag(tag, attrs) 338*cda5da8dSAndroid Build Coastguard Worker if tag in self.CDATA_CONTENT_ELEMENTS: 339*cda5da8dSAndroid Build Coastguard Worker self.set_cdata_mode(tag) 340*cda5da8dSAndroid Build Coastguard Worker return endpos 341*cda5da8dSAndroid Build Coastguard Worker 342*cda5da8dSAndroid Build Coastguard Worker # Internal -- check to see if we have a complete starttag; return end 343*cda5da8dSAndroid Build Coastguard Worker # or -1 if incomplete. 344*cda5da8dSAndroid Build Coastguard Worker def check_for_whole_start_tag(self, i): 345*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 346*cda5da8dSAndroid Build Coastguard Worker m = locatestarttagend_tolerant.match(rawdata, i) 347*cda5da8dSAndroid Build Coastguard Worker if m: 348*cda5da8dSAndroid Build Coastguard Worker j = m.end() 349*cda5da8dSAndroid Build Coastguard Worker next = rawdata[j:j+1] 350*cda5da8dSAndroid Build Coastguard Worker if next == ">": 351*cda5da8dSAndroid Build Coastguard Worker return j + 1 352*cda5da8dSAndroid Build Coastguard Worker if next == "/": 353*cda5da8dSAndroid Build Coastguard Worker if rawdata.startswith("/>", j): 354*cda5da8dSAndroid Build Coastguard Worker return j + 2 355*cda5da8dSAndroid Build Coastguard Worker if rawdata.startswith("/", j): 356*cda5da8dSAndroid Build Coastguard Worker # buffer boundary 357*cda5da8dSAndroid Build Coastguard Worker return -1 358*cda5da8dSAndroid Build Coastguard Worker # else bogus input 359*cda5da8dSAndroid Build Coastguard Worker if j > i: 360*cda5da8dSAndroid Build Coastguard Worker return j 361*cda5da8dSAndroid Build Coastguard Worker else: 362*cda5da8dSAndroid Build Coastguard Worker return i + 1 363*cda5da8dSAndroid Build Coastguard Worker if next == "": 364*cda5da8dSAndroid Build Coastguard Worker # end of input 365*cda5da8dSAndroid Build Coastguard Worker return -1 366*cda5da8dSAndroid Build Coastguard Worker if next in ("abcdefghijklmnopqrstuvwxyz=/" 367*cda5da8dSAndroid Build Coastguard Worker "ABCDEFGHIJKLMNOPQRSTUVWXYZ"): 368*cda5da8dSAndroid Build Coastguard Worker # end of input in or before attribute value, or we have the 369*cda5da8dSAndroid Build Coastguard Worker # '/' from a '/>' ending 370*cda5da8dSAndroid Build Coastguard Worker return -1 371*cda5da8dSAndroid Build Coastguard Worker if j > i: 372*cda5da8dSAndroid Build Coastguard Worker return j 373*cda5da8dSAndroid Build Coastguard Worker else: 374*cda5da8dSAndroid Build Coastguard Worker return i + 1 375*cda5da8dSAndroid Build Coastguard Worker raise AssertionError("we should not get here!") 376*cda5da8dSAndroid Build Coastguard Worker 377*cda5da8dSAndroid Build Coastguard Worker # Internal -- parse endtag, return end or -1 if incomplete 378*cda5da8dSAndroid Build Coastguard Worker def parse_endtag(self, i): 379*cda5da8dSAndroid Build Coastguard Worker rawdata = self.rawdata 380*cda5da8dSAndroid Build Coastguard Worker assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag" 381*cda5da8dSAndroid Build Coastguard Worker match = endendtag.search(rawdata, i+1) # > 382*cda5da8dSAndroid Build Coastguard Worker if not match: 383*cda5da8dSAndroid Build Coastguard Worker return -1 384*cda5da8dSAndroid Build Coastguard Worker gtpos = match.end() 385*cda5da8dSAndroid Build Coastguard Worker match = endtagfind.match(rawdata, i) # </ + tag + > 386*cda5da8dSAndroid Build Coastguard Worker if not match: 387*cda5da8dSAndroid Build Coastguard Worker if self.cdata_elem is not None: 388*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:gtpos]) 389*cda5da8dSAndroid Build Coastguard Worker return gtpos 390*cda5da8dSAndroid Build Coastguard Worker # find the name: w3.org/TR/html5/tokenization.html#tag-name-state 391*cda5da8dSAndroid Build Coastguard Worker namematch = tagfind_tolerant.match(rawdata, i+2) 392*cda5da8dSAndroid Build Coastguard Worker if not namematch: 393*cda5da8dSAndroid Build Coastguard Worker # w3.org/TR/html5/tokenization.html#end-tag-open-state 394*cda5da8dSAndroid Build Coastguard Worker if rawdata[i:i+3] == '</>': 395*cda5da8dSAndroid Build Coastguard Worker return i+3 396*cda5da8dSAndroid Build Coastguard Worker else: 397*cda5da8dSAndroid Build Coastguard Worker return self.parse_bogus_comment(i) 398*cda5da8dSAndroid Build Coastguard Worker tagname = namematch.group(1).lower() 399*cda5da8dSAndroid Build Coastguard Worker # consume and ignore other stuff between the name and the > 400*cda5da8dSAndroid Build Coastguard Worker # Note: this is not 100% correct, since we might have things like 401*cda5da8dSAndroid Build Coastguard Worker # </tag attr=">">, but looking for > after the name should cover 402*cda5da8dSAndroid Build Coastguard Worker # most of the cases and is much simpler 403*cda5da8dSAndroid Build Coastguard Worker gtpos = rawdata.find('>', namematch.end()) 404*cda5da8dSAndroid Build Coastguard Worker self.handle_endtag(tagname) 405*cda5da8dSAndroid Build Coastguard Worker return gtpos+1 406*cda5da8dSAndroid Build Coastguard Worker 407*cda5da8dSAndroid Build Coastguard Worker elem = match.group(1).lower() # script or style 408*cda5da8dSAndroid Build Coastguard Worker if self.cdata_elem is not None: 409*cda5da8dSAndroid Build Coastguard Worker if elem != self.cdata_elem: 410*cda5da8dSAndroid Build Coastguard Worker self.handle_data(rawdata[i:gtpos]) 411*cda5da8dSAndroid Build Coastguard Worker return gtpos 412*cda5da8dSAndroid Build Coastguard Worker 413*cda5da8dSAndroid Build Coastguard Worker self.handle_endtag(elem) 414*cda5da8dSAndroid Build Coastguard Worker self.clear_cdata_mode() 415*cda5da8dSAndroid Build Coastguard Worker return gtpos 416*cda5da8dSAndroid Build Coastguard Worker 417*cda5da8dSAndroid Build Coastguard Worker # Overridable -- finish processing of start+end tag: <tag.../> 418*cda5da8dSAndroid Build Coastguard Worker def handle_startendtag(self, tag, attrs): 419*cda5da8dSAndroid Build Coastguard Worker self.handle_starttag(tag, attrs) 420*cda5da8dSAndroid Build Coastguard Worker self.handle_endtag(tag) 421*cda5da8dSAndroid Build Coastguard Worker 422*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle start tag 423*cda5da8dSAndroid Build Coastguard Worker def handle_starttag(self, tag, attrs): 424*cda5da8dSAndroid Build Coastguard Worker pass 425*cda5da8dSAndroid Build Coastguard Worker 426*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle end tag 427*cda5da8dSAndroid Build Coastguard Worker def handle_endtag(self, tag): 428*cda5da8dSAndroid Build Coastguard Worker pass 429*cda5da8dSAndroid Build Coastguard Worker 430*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle character reference 431*cda5da8dSAndroid Build Coastguard Worker def handle_charref(self, name): 432*cda5da8dSAndroid Build Coastguard Worker pass 433*cda5da8dSAndroid Build Coastguard Worker 434*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle entity reference 435*cda5da8dSAndroid Build Coastguard Worker def handle_entityref(self, name): 436*cda5da8dSAndroid Build Coastguard Worker pass 437*cda5da8dSAndroid Build Coastguard Worker 438*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle data 439*cda5da8dSAndroid Build Coastguard Worker def handle_data(self, data): 440*cda5da8dSAndroid Build Coastguard Worker pass 441*cda5da8dSAndroid Build Coastguard Worker 442*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle comment 443*cda5da8dSAndroid Build Coastguard Worker def handle_comment(self, data): 444*cda5da8dSAndroid Build Coastguard Worker pass 445*cda5da8dSAndroid Build Coastguard Worker 446*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle declaration 447*cda5da8dSAndroid Build Coastguard Worker def handle_decl(self, decl): 448*cda5da8dSAndroid Build Coastguard Worker pass 449*cda5da8dSAndroid Build Coastguard Worker 450*cda5da8dSAndroid Build Coastguard Worker # Overridable -- handle processing instruction 451*cda5da8dSAndroid Build Coastguard Worker def handle_pi(self, data): 452*cda5da8dSAndroid Build Coastguard Worker pass 453*cda5da8dSAndroid Build Coastguard Worker 454*cda5da8dSAndroid Build Coastguard Worker def unknown_decl(self, data): 455*cda5da8dSAndroid Build Coastguard Worker pass 456