Sisyphus repository
Last update: 1 october 2023 | SRPMs: 18631 | Visits: 37529140
en ru br
ALT Linux repos
S:1.0.0-alt1.hg20100824.1.1.2

Group :: Development/Python3
RPM: python3-module-sgmllib

 Main   Changelog   Spec   Patches   Sources   Download   Gear   Bugs and FR  Repocop 

pax_global_header00006660000000000000000000000064117404657160014525gustar00rootroot0000000000000052 comment=57d078f2d606acd053cd731b74c517f23cf14f86
python3-module-sgmllib-1.0.0/000075500000000000000000000000001174046571600160375ustar00rootroot00000000000000python3-module-sgmllib-1.0.0/.gear/000075500000000000000000000000001174046571600170335ustar00rootroot00000000000000python3-module-sgmllib-1.0.0/.gear/merge000064400000000000000000000000271174046571600200540ustar00rootroot00000000000000merge: upstream master
python3-module-sgmllib-1.0.0/.gear/rules000064400000000000000000000001141174046571600201040ustar00rootroot00000000000000spec: .gear/sgmllib.spec
tar: . name=@name@-@version@ base=@name@-@version@
python3-module-sgmllib-1.0.0/.gear/sgmllib.spec000064400000000000000000000017651174046571600213510ustar00rootroot00000000000000%define oname sgmllib
Name: python3-module-%oname
Version: 1.0.0
Release: alt1.hg20100824
Summary: Py3k port of the old stdlib module
License: BSD
Group: Development/Python3
Url: http://hg.hardcoded.net/sgmllib
Packager: Eugeny A. Rostovtsev (REAL) <real at altlinux.org>

# hg clone https://bitbucket.org/hsoft/sgmllib
Source: %name-%version.tar
BuildArch: noarch

BuildRequires(pre): rpm-build-python3
BuildPreReq: python3-devel python3-module-distribute

%description
sgmllib was dropped in Python 3. For those depending on it, that's
somewhat unfortunate. This is a quick and dirty port of this old module.
I just ran 2to3 on it and published it. I don't intend to maintain it,
so it might be a good idea to eventually think about finding another
module to use.

%prep
%setup

%build
%python3_build

%install
%python3_install

%files
%doc CHANGES README
%python3_sitelibdir/*

%changelog
* Mon Apr 09 2012 Eugeny A. Rostovtsev (REAL) <real at altlinux.org> 1.0.0-alt1.hg20100824
- Initial build for Sisyphus

python3-module-sgmllib-1.0.0/.hgignore000064400000000000000000000000351174046571600176400ustar00rootroot00000000000000syntax: glob

.DS_Store
*.pycpython3-module-sgmllib-1.0.0/CHANGES000064400000000000000000000001121174046571600170240ustar00rootroot00000000000000Version 1.0.0 -- 2010/08/24
---------------------------

* Initial Releasepython3-module-sgmllib-1.0.0/LICENSE000064400000000000000000000027711174046571600170530ustar00rootroot00000000000000Copyright (c) 2010, Hardcoded Software Inc., http://www.hardcoded.net
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of Hardcoded Software Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.python3-module-sgmllib-1.0.0/README000064400000000000000000000006761174046571600167300ustar00rootroot00000000000000==================================================
sgmllib3k -- Py3k port of the old stdlib module
==================================================

sgmllib was dropped in Python 3. For those depending on it, that's somewhat unfortunate. This is a quick and dirty port of this old module. I just ran 2to3 on it and published it. I don't intend to maintain it, so it might be a good idea to eventually think about finding another module to use.
python3-module-sgmllib-1.0.0/setup.py000064400000000000000000000011431174046571600175500ustar00rootroot00000000000000from setuptools import setup

CLASSIFIERS = [
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'License :: OSI Approved :: BSD License',
'Programming Language :: Python :: 3',
]

setup(
name='sgmllib3k',
version='1.0.0',
author='Hardcoded Software',
author_email='hsoft@hardcoded.net',
py_modules=['sgmllib'],
scripts=[],
url='http://hg.hardcoded.net/sgmllib',
license='BSD License',
description='Py3k port of sgmllib.',
long_description=open('README').read(),
classifiers=CLASSIFIERS,
test_suite='test_sgmllib',
)python3-module-sgmllib-1.0.0/sgmllib.py000064400000000000000000000425741174046571600200560ustar00rootroot00000000000000"""A parser for SGML, using the derived class as a static DTD."""

# XXX This only supports those SGML features used by HTML.

# XXX There should be a way to distinguish between PCDATA (parsed
# character data -- the normal case), RCDATA (replaceable character
# data -- only char and entity references and end tags are special)
# and CDATA (character data -- only end tags are special). RCDATA is
# not supported at all.

import _markupbase
import re

__all__ = ["SGMLParser", "SGMLParseError"]

# Regular expressions used for parsing

interesting = re.compile('[&<]')
incomplete = re.compile('&([a-zA-Z][a-zA-Z0-9]*|#[0-9]*)?|'
'<([a-zA-Z][^<>]*|'
'/([a-zA-Z][^<>]*)?|'
'![^<>]*)?')

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
charref = re.compile('&#([0-9]+)[^0-9]')

starttagopen = re.compile('<[>a-zA-Z]')
shorttagopen = re.compile('<[a-zA-Z][-.a-zA-Z0-9]*/')
shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/([^/]*)/')
piclose = re.compile('>')
endbracket = re.compile('[<>]')
tagfind = re.compile('[a-zA-Z][-_.a-zA-Z0-9]*')
attrfind = re.compile(
r'\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[][\-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~\'"@]*))?')


class SGMLParseError(RuntimeError):
"""Exception raised for all parse errors."""
pass


# SGML parser base class -- find tags and call handler functions.
# Usage: p = SGMLParser(); p.feed(data); ...; p.close().
# The dtd is defined by deriving a class which defines methods
# with special names to handle tags: start_foo and end_foo to handle
# <foo> and </foo>, respectively, or do_foo to handle <foo> by itself.
# (Tags are converted to lower case for this purpose.) The data
# between tags is passed to the parser by calling self.handle_data()
# with some data as argument (the data may be split up in arbitrary
# chunks). Entity references are passed by calling
# self.handle_entityref() with the entity reference as argument.

class SGMLParser(_markupbase.ParserBase):
# Definition of entities -- derived classes may override
entity_or_charref = re.compile('&(?:'
'([a-zA-Z][-.a-zA-Z0-9]*)|#([0-9]+)'
')(;?)')

def __init__(self, verbose=0):
"""Initialize and reset this instance."""
self.verbose = verbose
self.reset()

def reset(self):
"""Reset this instance. Loses all unprocessed data."""
self.__starttag_text = None
self.rawdata = ''
self.stack = []
self.lasttag = '???'
self.nomoretags = 0
self.literal = 0
_markupbase.ParserBase.reset(self)

def setnomoretags(self):
"""Enter literal mode (CDATA) till EOF.

Intended for derived classes only.
"""
self.nomoretags = self.literal = 1

def setliteral(self, *args):
"""Enter literal mode (CDATA).

Intended for derived classes only.
"""
self.literal = 1

def feed(self, data):
"""Feed some data to the parser.

Call this as often as you want, with as little or as much text
as you want (may include '\n'). (This just saves the text,
all the processing is done by goahead().)
"""

self.rawdata = self.rawdata + data
self.goahead(0)

def close(self):
"""Handle the remaining data."""
self.goahead(1)

def error(self, message):
raise SGMLParseError(message)

# Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
def goahead(self, end):
rawdata = self.rawdata
i = 0
n = len(rawdata)
while i < n:
if self.nomoretags:
self.handle_data(rawdata[i:n])
i = n
break
match = interesting.search(rawdata, i)
if match: j = match.start()
else: j = n
if i < j:
self.handle_data(rawdata[i:j])
i = j
if i == n: break
if rawdata[i] == '<':
if starttagopen.match(rawdata, i):
if self.literal:
self.handle_data(rawdata[i])
i = i+1
continue
k = self.parse_starttag(i)
if k < 0: break
i = k
continue
if rawdata.startswith("</", i):
k = self.parse_endtag(i)
if k < 0: break
i = k
self.literal = 0
continue
if self.literal:
if n > (i + 1):
self.handle_data("<")
i = i+1
else:
# incomplete
break
continue
if rawdata.startswith("<!--", i):
# Strictly speaking, a comment is --.*--
# within a declaration tag <!...>.
# This should be removed,
# and comments handled only in parse_declaration.
k = self.parse_comment(i)
if k < 0: break
i = k
continue
if rawdata.startswith("<?", i):
k = self.parse_pi(i)
if k < 0: break
i = i+k
continue
if rawdata.startswith("<!", i):
# This is some sort of declaration; in "HTML as
# deployed," this should only be the document type
# declaration ("<!DOCTYPE html...>").
k = self.parse_declaration(i)
if k < 0: break
i = k
continue
elif rawdata[i] == '&':
if self.literal:
self.handle_data(rawdata[i])
i = i+1
continue
match = charref.match(rawdata, i)
if match:
name = match.group(1)
self.handle_charref(name)
i = match.end(0)
if rawdata[i-1] != ';': i = i-1
continue
match = entityref.match(rawdata, i)
if match:
name = match.group(1)
self.handle_entityref(name)
i = match.end(0)
if rawdata[i-1] != ';': i = i-1
continue
else:
self.error('neither < nor & ??')
# We get here only if incomplete matches but
# nothing else
match = incomplete.match(rawdata, i)
if not match:
self.handle_data(rawdata[i])
i = i+1
continue
j = match.end(0)
if j == n:
break # Really incomplete
self.handle_data(rawdata[i:j])
i = j
# end while
if end and i < n:
self.handle_data(rawdata[i:n])
i = n
self.rawdata = rawdata[i:]
# XXX if end: check for empty stack

# Extensions for the DOCTYPE scanner:
_decl_otherchars = '='

# Internal -- parse processing instr, return length or -1 if not terminated
def parse_pi(self, i):
rawdata = self.rawdata
if rawdata[i:i+2] != '<?':
self.error('unexpected call to parse_pi()')
match = piclose.search(rawdata, i+2)
if not match:
return -1
j = match.start(0)
self.handle_pi(rawdata[i+2: j])
j = match.end(0)
return j-i

def get_starttag_text(self):
return self.__starttag_text

# Internal -- handle starttag, return length or -1 if not terminated
def parse_starttag(self, i):
self.__starttag_text = None
start_pos = i
rawdata = self.rawdata
if shorttagopen.match(rawdata, i):
# SGML shorthand: <tag/data/ == <tag>data</tag>
# XXX Can data contain &... (entity or char refs)?
# XXX Can data contain < or > (tag characters)?
# XXX Can there be whitespace before the first /?
match = shorttag.match(rawdata, i)
if not match:
return -1
tag, data = match.group(1, 2)
self.__starttag_text = '<%s/' % tag
tag = tag.lower()
k = match.end(0)
self.finish_shorttag(tag, data)
self.__starttag_text = rawdata[start_pos:match.end(1) + 1]
return k
# XXX The following should skip matching quotes (' or ")
# As a shortcut way to exit, this isn't so bad, but shouldn't
# be used to locate the actual end of the start tag since the
# < or > characters may be embedded in an attribute value.
match = endbracket.search(rawdata, i+1)
if not match:
return -1
j = match.start(0)
# Now parse the data between i+1 and j into a tag and attrs
attrs = []
if rawdata[i:i+2] == '<>':
# SGML shorthand: <> == <last open tag seen>
k = j
tag = self.lasttag
else:
match = tagfind.match(rawdata, i+1)
if not match:
self.error('unexpected call to parse_starttag')
k = match.end(0)
tag = rawdata[i+1:k].lower()
self.lasttag = tag
while k < j:
match = attrfind.match(rawdata, k)
if not match: break
attrname, rest, attrvalue = match.group(1, 2, 3)
if not rest:
attrvalue = attrname
else:
if (attrvalue[:1] == "'" == attrvalue[-1:] or
attrvalue[:1] == '"' == attrvalue[-1:]):
# strip quotes
attrvalue = attrvalue[1:-1]
attrvalue = self.entity_or_charref.sub(
self._convert_ref, attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
j = j+1
self.__starttag_text = rawdata[start_pos:j]
self.finish_starttag(tag, attrs)
return j

# Internal -- convert entity or character reference
def _convert_ref(self, match):
if match.group(2):
return self.convert_charref(match.group(2)) or \
'&#%s%s' % match.groups()[1:]
elif match.group(3):
return self.convert_entityref(match.group(1)) or \
'&%s;' % match.group(1)
else:
return '&%s' % match.group(1)

# Internal -- parse endtag
def parse_endtag(self, i):
rawdata = self.rawdata
match = endbracket.search(rawdata, i+1)
if not match:
return -1
j = match.start(0)
tag = rawdata[i+2:j].strip().lower()
if rawdata[j] == '>':
j = j+1
self.finish_endtag(tag)
return j

# Internal -- finish parsing of <tag/data/ (same as <tag>data</tag>)
def finish_shorttag(self, tag, data):
self.finish_starttag(tag, [])
self.handle_data(data)
self.finish_endtag(tag)

# Internal -- finish processing of start tag
# Return -1 for unknown tag, 0 for open-only tag, 1 for balanced tag
def finish_starttag(self, tag, attrs):
try:
method = getattr(self, 'start_' + tag)
except AttributeError:
try:
method = getattr(self, 'do_' + tag)
except AttributeError:
self.unknown_starttag(tag, attrs)
return -1
else:
self.handle_starttag(tag, method, attrs)
return 0
else:
self.stack.append(tag)
self.handle_starttag(tag, method, attrs)
return 1

# Internal -- finish processing of end tag
def finish_endtag(self, tag):
if not tag:
found = len(self.stack) - 1
if found < 0:
self.unknown_endtag(tag)
return
else:
if tag not in self.stack:
try:
method = getattr(self, 'end_' + tag)
except AttributeError:
self.unknown_endtag(tag)
else:
self.report_unbalanced(tag)
return
found = len(self.stack)
for i in range(found):
if self.stack[i] == tag: found = i
while len(self.stack) > found:
tag = self.stack[-1]
try:
method = getattr(self, 'end_' + tag)
except AttributeError:
method = None
if method:
self.handle_endtag(tag, method)
else:
self.unknown_endtag(tag)
del self.stack[-1]

# Overridable -- handle start tag
def handle_starttag(self, tag, method, attrs):
method(attrs)

# Overridable -- handle end tag
def handle_endtag(self, tag, method):
method()

# Example -- report an unbalanced </...> tag.
def report_unbalanced(self, tag):
if self.verbose:
print('*** Unbalanced </' + tag + '>')
print('*** Stack:', self.stack)

def convert_charref(self, name):
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
if not 0 <= n <= 127:
return
return self.convert_codepoint(n)

def convert_codepoint(self, codepoint):
return chr(codepoint)

def handle_charref(self, name):
"""Handle character reference, no need to override."""
replacement = self.convert_charref(name)
if replacement is None:
self.unknown_charref(name)
else:
self.handle_data(replacement)

# Definition of entities -- derived classes may override
entitydefs = \
{'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}

def convert_entityref(self, name):
"""Convert entity references.

As an alternative to overriding this method; one can tailor the
results by setting up the self.entitydefs mapping appropriately.
"""
table = self.entitydefs
if name in table:
return table[name]
else:
return

def handle_entityref(self, name):
"""Handle entity references, no need to override."""
replacement = self.convert_entityref(name)
if replacement is None:
self.unknown_entityref(name)
else:
self.handle_data(replacement)

# Example -- handle data, should be overridden
def handle_data(self, data):
pass

# Example -- handle comment, could be overridden
def handle_comment(self, data):
pass

# Example -- handle declaration, could be overridden
def handle_decl(self, decl):
pass

# Example -- handle processing instruction, could be overridden
def handle_pi(self, data):
pass

# To be overridden -- handlers for unknown objects
def unknown_starttag(self, tag, attrs): pass
def unknown_endtag(self, tag): pass
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass


class TestSGMLParser(SGMLParser):

def __init__(self, verbose=0):
self.testdata = ""
SGMLParser.__init__(self, verbose)

def handle_data(self, data):
self.testdata = self.testdata + data
if len(repr(self.testdata)) >= 70:
self.flush()

def flush(self):
data = self.testdata
if data:
self.testdata = ""
print('data:', repr(data))

def handle_comment(self, data):
self.flush()
r = repr(data)
if len(r) > 68:
r = r[:32] + '...' + r[-32:]
print('comment:', r)

def unknown_starttag(self, tag, attrs):
self.flush()
if not attrs:
print('start tag: <' + tag + '>')
else:
print('start tag: <' + tag, end=' ')
for name, value in attrs:
print(name + '=' + '"' + value + '"', end=' ')
print('>')

def unknown_endtag(self, tag):
self.flush()
print('end tag: </' + tag + '>')

def unknown_entityref(self, ref):
self.flush()
print('*** unknown entity ref: &' + ref + ';')

def unknown_charref(self, ref):
self.flush()
print('*** unknown char ref: &#' + ref + ';')

def unknown_decl(self, data):
self.flush()
print('*** unknown decl: [' + data + ']')

def close(self):
SGMLParser.close(self)
self.flush()


def test(args = None):
import sys

if args is None:
args = sys.argv[1:]

if args and args[0] == '-s':
args = args[1:]
klass = SGMLParser
else:
klass = TestSGMLParser

if args:
file = args[0]
else:
file = 'test.html'

if file == '-':
f = sys.stdin
else:
try:
f = open(file, 'r')
except IOError as msg:
print(file, ":", msg)
sys.exit(1)

data = f.read()
if f is not sys.stdin:
f.close()

x = klass()
for c in data:
x.feed(c)
x.close()


if __name__ == '__main__':
test()
python3-module-sgmllib-1.0.0/test_sgmllib.py000064400000000000000000000354371174046571600211150ustar00rootroot00000000000000import pprint
import re
import unittest
import sgmllib

class EventCollector(sgmllib.SGMLParser):

def __init__(self):
self.events = []
self.append = self.events.append
sgmllib.SGMLParser.__init__(self)

def get_events(self):
# Normalize the list of events so that buffer artefacts don't
# separate runs of contiguous characters.
L = []
prevtype = None
for event in self.events:
type = event[0]
if type == prevtype == "data":
L[-1] = ("data", L[-1][1] + event[1])
else:
L.append(event)
prevtype = type
self.events = L
return L

# structure markup

def unknown_starttag(self, tag, attrs):
self.append(("starttag", tag, attrs))

def unknown_endtag(self, tag):
self.append(("endtag", tag))

# all other markup

def handle_comment(self, data):
self.append(("comment", data))

def handle_charref(self, data):
self.append(("charref", data))

def handle_data(self, data):
self.append(("data", data))

def handle_decl(self, decl):
self.append(("decl", decl))

def handle_entityref(self, data):
self.append(("entityref", data))

def handle_pi(self, data):
self.append(("pi", data))

def unknown_decl(self, decl):
self.append(("unknown decl", decl))


class CDATAEventCollector(EventCollector):
def start_cdata(self, attrs):
self.append(("starttag", "cdata", attrs))
self.setliteral()


class HTMLEntityCollector(EventCollector):

entity_or_charref = re.compile('(?:&([a-zA-Z][-.a-zA-Z0-9]*)'
'|&#(x[0-9a-zA-Z]+|[0-9]+))(;?)')

def convert_charref(self, name):
self.append(("charref", "convert", name))
if name[0] != "x":
return EventCollector.convert_charref(self, name)

def convert_codepoint(self, codepoint):
self.append(("codepoint", "convert", codepoint))
EventCollector.convert_codepoint(self, codepoint)

def convert_entityref(self, name):
self.append(("entityref", "convert", name))
return EventCollector.convert_entityref(self, name)

# These to record that they were called, then pass the call along
# to the default implementation so that it's actions can be
# recorded.

def handle_charref(self, data):
self.append(("charref", data))
sgmllib.SGMLParser.handle_charref(self, data)

def handle_entityref(self, data):
self.append(("entityref", data))
sgmllib.SGMLParser.handle_entityref(self, data)


class SGMLParserTestCase(unittest.TestCase):

collector = EventCollector

def get_events(self, source):
parser = self.collector()
try:
for s in source:
parser.feed(s)
parser.close()
except:
#self.events = parser.events
raise
return parser.get_events()

def check_events(self, source, expected_events):
try:
events = self.get_events(source)
except:
#import sys
#print >>sys.stderr, pprint.pformat(self.events)
raise
if events != expected_events:
self.fail("received events did not match expected events\n"
"Expected:\n" + pprint.pformat(expected_events) +
"\nReceived:\n" + pprint.pformat(events))

def check_parse_error(self, source):
parser = EventCollector()
try:
parser.feed(source)
parser.close()
except sgmllib.SGMLParseError:
pass
else:
self.fail("expected SGMLParseError for %r\nReceived:\n%s"
% (source, pprint.pformat(parser.get_events())))

def test_doctype_decl_internal(self):
inside = """\
DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'
SYSTEM 'http://www.w3.org/TR/html401/strict.dtd' [
<!ELEMENT html - O EMPTY>
<!ATTLIST html
version CDATA #IMPLIED
profile CDATA 'DublinCore'>
<!NOTATION datatype SYSTEM 'http://xml.python.org/notations/python-module'>
<!ENTITY myEntity 'internal parsed entity'>
<!ENTITY anEntity SYSTEM 'http://xml.python.org/entities/something.xml'>
<!ENTITY % paramEntity 'name|name|name'>
%paramEntity;
<!-- comment -->
]"""
self.check_events(["<!%s>" % inside], [
("decl", inside),
])

def test_doctype_decl_external(self):
inside = "DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'"
self.check_events("<!%s>" % inside, [
("decl", inside),
])

def test_underscore_in_attrname(self):
# SF bug #436621
"""Make sure attribute names with underscores are accepted"""
self.check_events("<a has_under _under>", [
("starttag", "a", [("has_under", "has_under"),
("_under", "_under")]),
])

def test_underscore_in_tagname(self):
# SF bug #436621
"""Make sure tag names with underscores are accepted"""
self.check_events("<has_under></has_under>", [
("starttag", "has_under", []),
("endtag", "has_under"),
])

def test_quotes_in_unquoted_attrs(self):
# SF bug #436621
"""Be sure quotes in unquoted attributes are made part of the value"""
self.check_events("<a href=foo'bar\"baz>", [
("starttag", "a", [("href", "foo'bar\"baz")]),
])

def test_xhtml_empty_tag(self):
"""Handling of XHTML-style empty start tags"""
self.check_events("<br />text<i></i>", [
("starttag", "br", []),
("data", "text"),
("starttag", "i", []),
("endtag", "i"),
])

def test_processing_instruction_only(self):
self.check_events("<?processing instruction>", [
("pi", "processing instruction"),
])

def test_bad_nesting(self):
self.check_events("<a><b></a></b>", [
("starttag", "a", []),
("starttag", "b", []),
("endtag", "a"),
("endtag", "b"),
])

def test_bare_ampersands(self):
self.check_events("this text & contains & ampersands &", [
("data", "this text & contains & ampersands &"),
])

def test_bare_pointy_brackets(self):
self.check_events("this < text > contains < bare>pointy< brackets", [
("data", "this < text > contains < bare>pointy< brackets"),
])

def test_attr_syntax(self):
output = [
("starttag", "a", [("b", "v"), ("c", "v"), ("d", "v"), ("e", "e")])
]
self.check_events("""<a b='v' c="v" d=v e>""", output)
self.check_events("""<a b = 'v' c = "v" d = v e>""", output)
self.check_events("""<a\nb\n=\n'v'\nc\n=\n"v"\nd\n=\nv\ne>""", output)
self.check_events("""<a\tb\t=\t'v'\tc\t=\t"v"\td\t=\tv\te>""", output)

def test_attr_values(self):
self.check_events("""<a b='xxx\n\txxx' c="yyy\t\nyyy" d='\txyz\n'>""",
[("starttag", "a", [("b", "xxx\n\txxx"),
("c", "yyy\t\nyyy"),
("d", "\txyz\n")])
])
self.check_events("""<a b='' c="">""", [
("starttag", "a", [("b", ""), ("c", "")]),
])
# URL construction stuff from RFC 1808:
safe = "$-_.+"
extra = "!*'(),"
reserved = ";/?:@&="
url = "http://example.com:8080/path/to/file?%s%s%s" % (
safe, extra, reserved)
self.check_events("""<e a=%s>""" % url, [
("starttag", "e", [("a", url)]),
])
# Regression test for SF patch #669683.
self.check_events("<e a=rgb(1,2,3)>", [
("starttag", "e", [("a", "rgb(1,2,3)")]),
])

def test_attr_values_entities(self):
"""Substitution of entities and charrefs in attribute values"""
# SF bug #1452246
self.check_events("""<a b=&lt; c=&lt;&gt; d=&lt-&gt; e='&lt; '
f="&xxx;" g='&#32;&#33;' h='&#500;'
i='x?a=b&c=d;'
j='&amp;#42;' k='&#38;#42;'>""",
[("starttag", "a", [("b", "<"),
("c", "<>"),
("d", "&lt->"),
("e", "< "),
("f", "&xxx;"),
("g", " !"),
("h", "&#500;"),
("i", "x?a=b&c=d;"),
("j", "&#42;"),
("k", "&#42;"),
])])

def test_convert_overrides(self):
# This checks that the character and entity reference
# conversion helpers are called at the documented times. No
# attempt is made to really change what the parser accepts.
#
self.collector = HTMLEntityCollector
self.check_events(('<a title="&ldquo;test&#x201d;">foo</a>'
'&foobar;&#42;'), [
('entityref', 'convert', 'ldquo'),
('charref', 'convert', 'x201d'),
('starttag', 'a', [('title', '&ldquo;test&#x201d;')]),
('data', 'foo'),
('endtag', 'a'),
('entityref', 'foobar'),
('entityref', 'convert', 'foobar'),
('charref', '42'),
('charref', 'convert', '42'),
('codepoint', 'convert', 42),
])

def test_attr_funky_names(self):
self.check_events("""<a a.b='v' c:d=v e-f=v>""", [
("starttag", "a", [("a.b", "v"), ("c:d", "v"), ("e-f", "v")]),
])

def test_attr_value_ip6_url(self):
# http://www.python.org/sf/853506
self.check_events(("<a href='http://[1080::8:800:200C:417A]/'>"
"<a href=http://[1080::8:800:200C:417A]/>"), [
("starttag", "a", [("href", "http://[1080::8:800:200C:417A]/")]),
("starttag", "a", [("href", "http://[1080::8:800:200C:417A]/")]),
])

def test_weird_starttags(self):
self.check_events("<a<a>", [
("starttag", "a", []),
("starttag", "a", []),
])
self.check_events("</a<a>", [
("endtag", "a"),
("starttag", "a", []),
])

def test_declaration_junk_chars(self):
self.check_parse_error("<!DOCTYPE foo $ >")

def test_get_starttag_text(self):
s = """<foobar \n one="1"\ttwo=2 >"""
self.check_events(s, [
("starttag", "foobar", [("one", "1"), ("two", "2")]),
])

def test_cdata_content(self):
s = ("<cdata> <!-- not a comment --> &not-an-entity-ref; </cdata>"
"<notcdata> <!-- comment --> </notcdata>")
self.collector = CDATAEventCollector
self.check_events(s, [
("starttag", "cdata", []),
("data", " <!-- not a comment --> &not-an-entity-ref; "),
("endtag", "cdata"),
("starttag", "notcdata", []),
("data", " "),
("comment", " comment "),
("data", " "),
("endtag", "notcdata"),
])
s = """<cdata> <not a='start tag'> </cdata>"""
self.check_events(s, [
("starttag", "cdata", []),
("data", " <not a='start tag'> "),
("endtag", "cdata"),
])

def test_illegal_declarations(self):
s = 'abc<!spacer type="block" height="25">def'
self.check_events(s, [
("data", "abc"),
("unknown decl", 'spacer type="block" height="25"'),
("data", "def"),
])

def test_enumerated_attr_type(self):
s = "<!DOCTYPE doc [<!ATTLIST doc attr (a | b) >]>"
self.check_events(s, [
('decl', 'DOCTYPE doc [<!ATTLIST doc attr (a | b) >]'),
])

def test_only_decode_ascii(self):
# SF bug #1651995, make sure non-ascii character references are not decoded
s = '<signs exclamation="&#33" copyright="&#169" quoteleft="&#8216;">'
self.check_events(s, [
('starttag', 'signs',
[('exclamation', '!'), ('copyright', '&#169'),
('quoteleft', '&#8216;')]),
])

# XXX These tests have been disabled by prefixing their names with
# an underscore. The first two exercise outstanding bugs in the
# sgmllib module, and the third exhibits questionable behavior
# that needs to be carefully considered before changing it.

def _test_starttag_end_boundary(self):
self.check_events("<a b='<'>", [("starttag", "a", [("b", "<")])])
self.check_events("<a b='>'>", [("starttag", "a", [("b", ">")])])

def _test_buffer_artefacts(self):
output = [("starttag", "a", [("b", "<")])]
self.check_events(["<a b='<'>"], output)
self.check_events(["<a ", "b='<'>"], output)
self.check_events(["<a b", "='<'>"], output)
self.check_events(["<a b=", "'<'>"], output)
self.check_events(["<a b='<", "'>"], output)
self.check_events(["<a b='<'", ">"], output)

output = [("starttag", "a", [("b", ">")])]
self.check_events(["<a b='>'>"], output)
self.check_events(["<a ", "b='>'>"], output)
self.check_events(["<a b", "='>'>"], output)
self.check_events(["<a b=", "'>'>"], output)
self.check_events(["<a b='>", "'>"], output)
self.check_events(["<a b='>'", ">"], output)

output = [("comment", "abc")]
self.check_events(["", "<!--abc-->"], output)
self.check_events(["<", "!--abc-->"], output)
self.check_events(["<!", "--abc-->"], output)
self.check_events(["<!-", "-abc-->"], output)
self.check_events(["<!--", "abc-->"], output)
self.check_events(["<!--a", "bc-->"], output)
self.check_events(["<!--ab", "c-->"], output)
self.check_events(["<!--abc", "-->"], output)
self.check_events(["<!--abc-", "->"], output)
self.check_events(["<!--abc--", ">"], output)
self.check_events(["<!--abc-->", ""], output)

def _test_starttag_junk_chars(self):
self.check_parse_error("<")
self.check_parse_error("<>")
self.check_parse_error("</$>")
self.check_parse_error("</")
self.check_parse_error("</a")
self.check_parse_error("<$")
self.check_parse_error("<$>")
self.check_parse_error("<!")
self.check_parse_error("<a $>")
self.check_parse_error("<a")
self.check_parse_error("<a foo='bar'")
self.check_parse_error("<a foo='bar")
self.check_parse_error("<a foo='>'")
self.check_parse_error("<a foo='>")
self.check_parse_error("<a foo=>")


if __name__ == "__main__":
unittest.main()
 
design & coding: Vladimir Lettiev aka crux © 2004-2005, Andrew Avramenko aka liks © 2007-2008
current maintainer: Michael Shigorin