BeautifulSoup
has special function get_text()
which has option separator
to separate text from different children. As default it uses empty string as separtor so you get BulldogFrench
but you can use space as separtor. If you have strings with spaces which you want to keep then you can use some unique char like |
to use later split("|")
.
from bs4 import BeautifulSoup as BS
text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('td')
for item in all_items:
text = item.get_text(separator='|')
print('before:', text)
text = text.split('|')[0]
print('after:', text)
Result:
before: Bulldog1
after: Bulldog1
---
before: Bulldog2|French
after: Bulldog2
---
BTW: get_text()
has also option strip=True
to remove spaces before joining elements to one string – it can be useful when you have many spaces between elements.
You can also use .children
to create list with all subelements and get only first element
from bs4 import BeautifulSoup as BS
text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('td')
for item in all_items:
elements = list(item.children)
print(' All:', elements)
print('First:', elements[0])
print('---')
Result:
All: ['Bulldog1']
First: Bulldog1
---
All: ['Bulldog2', <br/>, 'French']
First: Bulldog2
BTW: to get only text elements
elements = [x for x in item.children if isinstance(x, str)]
Result:
All: ['Bulldog1']
All: ['Bulldog2', 'French']
EDIT: Instead of list(item.children)
you can try
elements = item.contents
You can also try item.next
but it may get next td
(or n
) if current td
is empty.
from bs4 import BeautifulSoup as BS
text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed"></td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('td')
for item in all_items:
print(' item:', item)
print('children:', list(item.children))
print('contents:', item.contents)
print(' next:', item.next)
print(' 2x next:', item.next.next)
print(' 3x next:', item.next.next.next)
#elements = list(item.children)
elements = item.contents
#elements = [x for x in item.children if isinstance(x, str)]
print(' All:', elements)
if elements:
print(' First:', elements[0])
else:
print(' First:')
print('---')
Result:
item: <td class="searchResultsDogBreed">Bulldog1</td>
children: ['Bulldog1']
contents: ['Bulldog1']
next: Bulldog1
2x next:
3x next: <td class="searchResultsDogBreed"></td>
All: ['Bulldog1']
First: Bulldog1
---
item: <td class="searchResultsDogBreed"></td>
children: []
contents: []
next:
2x next: <td class="searchResultsDogBreed">Bulldog2<br/>French</td>
3x next: Bulldog2
All: []
First:
---
item: <td class="searchResultsDogBreed">Bulldog2<br/>French</td>
children: ['Bulldog2', <br/>, 'French']
contents: ['Bulldog2', <br/>, 'French']
next: Bulldog2
2x next: <br/>
3x next: French
All: ['Bulldog2', <br/>, 'French']
First: Bulldog2
---