Since you’re using the html5lib parser you have access to the linenumber if you’re using BeautifulSoup version 4.8.1 or higher as described in the docs:
The
html.parser
andhtml5lib
parsers can keep track of where in the original document each Tag was found. You can access this information asTag.sourceline
(line number) andTag.sourcepos
(position of the start tag within a line) […]
In your example you can easily access these information:
from bs4 import BeautifulSoup
html = """<!DOCTYPE html>
<html>
<body>
<p align="left">
<b><font face="Times New Roman" size="5" color="red">Some text</font></b>
</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html5lib")
for elem in soup.find_all('font'):
print(elem.sourceline, elem.sourcepos, elem.string)
This will output 5 60 Some text
, where the first number is your linenumber.
If there is any potential error, e.g. getting a NoneType
, you should take care of it before reaching the error. So instead of doing this:
target = elem.findParent().findParent()
you can check first, if you get a result for your first findParent()
-method, and then do the second request, e.g.:
target = elem.findParent()
err_line, err_source, err_str = target.sourceline, target.sourcepos, target.string
if target:
target = target.findParent()
else:
print(f"Error near line {err_line} ({err_source}). Last good text: {err_str}")