Solution 1 :


Nevermind, I was wrong, session should be handling all cookies, seems like the website rejects requests from borwsers without a User-Agent, just do this:

def login_tokyo(s):
    r = s.get('')
    str_number = re.findall("<span[^>]+(.*?)</span>", r.text)[0]

Orignal Answer:

You are not handling PHPSESSID cookie, many websites use it track logins serverside, try doing this

def login_tokyo(s):
    header={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.','Cookies':'PHPSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxx'}
    r = s.get('')
    str_number = re.findall("<span[^>]+(.*?)</span>", r.text)[0]

You can also use cookielib (more info in this question), if you do not want to manually handle cookies, though most servers do not care if the sessionid set is already in their database, and would accept any random sessionid.

Problem :

I have been trying to get through a request where the first page is a mathematical calculus to pass to the main page. This part is solved. However when I try to obtain something else i get the following:


I have learned this method in a while but only now I am trying it for the first time:

import re
import requests

def login_tokyo(s):
    r = s.get('')
    str_number = re.findall("<span[^>]+(.*?)</span>", r.text)[0]
    numbers = re.findall('[0-9]+', str_number)
    captcha = int(numbers[0]) + int(numbers[1])
    payload = {'captcha': captcha}
    r ='', data=payload)
    check_text = re.findall('<b>(.*?)</b>', r.text)[0]
    payload1 = {'Param': 0, 'Value': 5797164, 'imo': '', 'callsign': '', 'name': '', 'compimo': 5797164,
                'compname': '', 'From': '01.06.2020', 'Till': '31.08.2020', 'authority': 0, 'flag': 0, 'class': 0,
                'ro': 0, 'type': 0, 'result': 0, 'insptype': -1, 'sort1': 0, 'sort2': 'DESC', 'sort3': 0,
                'sort4': 'DESC'
    r ='', data=payload1)
    perf_tm = re.findall("<p class=[^>]+(.*?)</p>", r.text)

if __name__ == '__main__':
    with requests.Session() as s:

The print(check_text) tells me that I am on the main page but then… nothing. From this specific request, I expected the print(perf_tm) to get me Medium.
Appreciate all help!


Comment posted by Zaraki Kenpachi

what information you need to scrape?

Comment posted by Anunay

I love the math question captcha on the website, they are just as effective as spell.

Comment posted by Eduardo

I am trying to scrape the company performance @ZarakiKenpachi

Comment posted by Anunay

btw I would highly recommend using postman/insomnia to manually test requests and checking which headers are required and which are not.

Comment posted by

also if you are lazy like me, just copy the request from network tab in your browser as a cURL request and paste it here:

Comment posted by Eduardo

This worked like a charm. Thank you for identifying the issue. I will certainly learn more about it to understand it better.