Solution 1 :

Since you’re using the html5lib parser you have access to the linenumber if you’re using BeautifulSoup version 4.8.1 or higher as described in the docs:

The html.parser and html5lib parsers can keep track of where in the original document each Tag was found. You can access this information as Tag.sourceline (line number) and Tag.sourcepos (position of the start tag within a line) […]

In your example you can easily access these information:

from bs4 import BeautifulSoup

html = """<!DOCTYPE html>
<html>
  <body>
      <p align="left">
        <b><font face="Times New Roman" size="5" color="red">Some text</font></b> 
      </p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html5lib")

for elem in soup.find_all('font'):
    print(elem.sourceline, elem.sourcepos, elem.string)

This will output 5 60 Some text, where the first number is your linenumber.

If there is any potential error, e.g. getting a NoneType, you should take care of it before reaching the error. So instead of doing this:

target = elem.findParent().findParent()

you can check first, if you get a result for your first findParent()-method, and then do the second request, e.g.:

target = elem.findParent()
err_line, err_source, err_str = target.sourceline, target.sourcepos, target.string
if target:
    target = target.findParent()
else:
    print(f"Error near line {err_line} ({err_source}). Last good text: {err_str}")

Problem :

I have the following code that removes duplicates paragraphs from html file.

from bs4 import BeautifulSoup

fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")

elms = []
for elem in soup.find_all('font'):
    if elem not in elms:
        elms.append(elem)
    else:
        target =elem.findParent().findParent()
        target.decompose()
print(soup.html)

Is almost working, but for some elements I get this error

attributeerror: 'nonetype' object has no attribute 'findparent'

Is there a way to print the line number within the HTML file where the error happens to check what is the format?

the structure of elements for which the code doesn’t have issues is like this

<!DOCTYPE html>
<html>
  <body>
      <p align="left">
        <b><font face="Times New Roman" size="5" color="red">Some text</font></b> 
      </p>
  </body>
</html>

But since the file is a kind of large, I don’t have identified the structure of the elements where the code stucks.

Comments

Comment posted by AMC

I would recommend using a context manager to handle file objects.

Comment posted by Ger Cas

Thanks for yout answer. I´ve tried your code. First part prints the sourceline, sourcepos and string correctly. The second code to show in which line of HTML file the error happens it doesn´t show any output. Playing around with your code and my code ( the code I show on original post), I see that the error appears when is present the line

Comment posted by colidyre

I guess that your

By