Solution 1 :

BeautifulSoup has special function get_text() which has option separator to separate text from different children. As default it uses empty string as separtor so you get BulldogFrench but you can use space as separtor. If you have strings with spaces which you want to keep then you can use some unique char like | to use later split("|").

from bs4 import BeautifulSoup as BS

text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''

soup = BS(text, 'html.parser')

all_items = soup.find_all('td')
for item in all_items:
    text = item.get_text(separator='|')
    print('before:', text)
    text = text.split('|')[0]
    print('after:', text)

Result:

before: Bulldog1
 after: Bulldog1
---
before: Bulldog2|French
 after: Bulldog2
---

BTW: get_text() has also option strip=True to remove spaces before joining elements to one string – it can be useful when you have many spaces between elements.


You can also use .children to create list with all subelements and get only first element

from bs4 import BeautifulSoup as BS

text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''

soup = BS(text, 'html.parser')

all_items = soup.find_all('td')
for item in all_items:
    elements = list(item.children)
    print('  All:', elements)
    print('First:', elements[0])
    print('---')

Result:

  All: ['Bulldog1']
First: Bulldog1
---
  All: ['Bulldog2', <br/>, 'French']
First: Bulldog2

BTW: to get only text elements

elements = [x for x in item.children if isinstance(x, str)]

Result:

All: ['Bulldog1']
All: ['Bulldog2', 'French']

EDIT: Instead of list(item.children) you can try

elements = item.contents

You can also try item.next but it may get next td (or n) if current td is empty.

from bs4 import BeautifulSoup as BS

text = '''
<td class="searchResultsDogBreed">Bulldog1</td>
<td class="searchResultsDogBreed"></td>
<td class="searchResultsDogBreed">Bulldog2<br/>French</td>
'''

soup = BS(text, 'html.parser')
all_items = soup.find_all('td')

for item in all_items:
    print('    item:', item)
    print('children:', list(item.children))
    print('contents:', item.contents)
    print('    next:', item.next)
    print(' 2x next:', item.next.next)
    print(' 3x next:', item.next.next.next)
    #elements = list(item.children)
    elements = item.contents
    #elements = [x for x in item.children if isinstance(x, str)]
    print('     All:', elements)
    if elements:
        print('   First:', elements[0])
    else:
        print('   First:')
    print('---')

Result:

   item: <td class="searchResultsDogBreed">Bulldog1</td>
children: ['Bulldog1']
contents: ['Bulldog1']
    next: Bulldog1
 2x next: 

 3x next: <td class="searchResultsDogBreed"></td>
     All: ['Bulldog1']
   First: Bulldog1
---
    item: <td class="searchResultsDogBreed"></td>
children: []
contents: []
    next: 

 2x next: <td class="searchResultsDogBreed">Bulldog2<br/>French</td>
 3x next: Bulldog2
     All: []
   First:
---
    item: <td class="searchResultsDogBreed">Bulldog2<br/>French</td>
children: ['Bulldog2', <br/>, 'French']
contents: ['Bulldog2', <br/>, 'French']
    next: Bulldog2
 2x next: <br/>
 3x next: French
     All: ['Bulldog2', <br/>, 'French']
   First: Bulldog2
---

Problem :

Ok, here I go with my first question.
I am trying to parse some content from a website with BeautifulSoup. The content I want to grab is in a td tag; but sometimes comes as two lines and other times not (includes a line break in the code)

Example for a bulldog:

sometimes <td class="searchResultsDogBreed">Bulldog</td>
other times <td class="searchResultsDogBreed">Bulldog<br/>French</td>

When I use the following to make a list of the dog breeds:

for db in soup.body.find_all('td', class_="searchResultsDogBreed"):
         list_dogbreed.append(db.text.strip())

it brings the some results as BulldogFrench as expected since it strips all spaces. I want to either ignore the French and have only Bulldog since I only care if it is a bulldog or not, or at least list it so that the output is “Bulldog French” so that I can separate the two words.

I have to strip spaces somehow because the actual output without strip() is something like

"                               BulldogFrench      "

Thanks for the help!

Comments

Comment posted by furas

first

By