Solution 1 :

You can try

soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)

Refer – https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)

Problem :

I just got into web scraping and I’m using beautifulsoup to perform web scraping but I only want to extract contents just with the “p” tags. So I want to ignore tags if there are additional class/style/etc…

Example:

<p>what I want to extract</p>

<p class="copy">what I do not want to extract from HTML page</p>

So far I can only extract all the “p” tags with this code

from bs4 import BeautifulSoup as BS
import requests

URL = input("Enter url to scrape: ")
content = requests.get(URL)
soup = BS(content.text, 'html.parser')
content_p = soup.find_all('p')
print(content_p)

Comments

Comment posted by jason

Thank you so much! Is there a way to remove all unwanted encodings like ‘u2060’ ‘xa0’?

Comment posted by Deepak

you can use something like this – badString = “FooBar
Baz” BeautifulSoup(badString)

By