You can try
soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)
You can try
soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)
I just got into web scraping and I’m using beautifulsoup to perform web scraping but I only want to extract contents just with the “p” tags. So I want to ignore tags if there are additional class/style/etc…
Example:
<p>what I want to extract</p>
<p class="copy">what I do not want to extract from HTML page</p>
So far I can only extract all the “p” tags with this code
from bs4 import BeautifulSoup as BS
import requests
URL = input("Enter url to scrape: ")
content = requests.get(URL)
soup = BS(content.text, 'html.parser')
content_p = soup.find_all('p')
print(content_p)
Thank you so much! Is there a way to remove all unwanted encodings like ‘u2060’ ‘xa0’?
you can use something like this – badString = “FooBar
Baz” BeautifulSoup(badString)