Solution 1 :

You can use BeautifulSoup4 (bs4), which is a a Python library for pulling data out of HTML and XML files, with a combination of Regular Expressions (RegEx). In this case I used the python re library for the RegEx purposes.

Something like this is what you want (source):

enter image description here

In the example above soup.find_all(class_=re.compile("itle")) returns all instances where the word “itle” is found in the class tag, such as class = "title" from the html document shown below.

enter image description here

For your RegEx it would look something like "arrowTo*" or even just "arrowTo". soup.find_all(class_=re.compile("arrowTo")).

Your final code should look something like:

from bs4 import BeautifulSoup
#i think result was your html document from requests library
#the first parameter is your html document variable
soup = BeautifulSoup(result, 'html.parser') 
myArrowToList = soup.find_all(class_=re.compile("arrowTo"))

If you wanted "arrowToStrongBuy" just use that in the regex input to the find_all function.


Problem :

I’m trying to scrape data in a widget using python and the requests-html library.

The the value I want is in a gauge with an arrow pointing to five possible results.
Each label on the gauge is the same on all pages of the website. The problem I face is I cannot use a css selector on the gauge labels to extract the text, I need to extract the value of the arrow itself as it will be pointing to a label. The arrow doesn’t have a text attribute so if I use a css selector I get none as a response.

Each arrow has a unique class name.

<div class="arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo arrowStrongBuyShudder-3xsGK8k5">

<div class="arrow-F-uE7IX8 arrowToBuy-1R7d8UMJ arrowBuyShudder-3GMCnG5u">

<div class="arrow-F-uE7IX8 arrowToStrongSell-3UWimXJs arrowStrongSellShudder-2UJhm0_C"> 

What can I do to ensure I get the correct value? I’m not sure how I can check if the selector contains the arrowTo{foo} and store as variable.

import pyppdf.patch_pyppeteer
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()

async def get_page():
    code = 'NASDAQ-MDB'
    r = await asession.get(f'{code}/')
    await r.html.arender(wait=3)
    return r

results =
for result in results:

    arrow_class_placeholder = "//div[contains(@class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]"
    arrow_class_name = result.html.xpath(arrow_class_placeholder,first=True)

    if arrow_class_name == "//div[contains(@class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]":
        print('not strong buy')


Comment posted by tomoc4

When I try your solution I get an error

Comment posted by Aleka

check the encoding of the input document?

Comment posted by tomoc4

I’ve just realised that the xpath’s aren’t being found in the response Results so requests-html isn’t rendering properly.


Leave a Reply

Your email address will not be published. Required fields are marked *