Solution 1 :

The problem here is that some group of classes data are loaded after the page load, as you notice from html attribute data-lload="comp", while the 12 items that you successfully crawl has attribute data-lload="false". BeautifulSoup parse html exactly how you see a source of a webpage, and you can see from the source that only 12 items are loaded, so the rest of the items are probably loaded with other way (maybe using ajax or something), but in this case I find that the items are actually delivered on source as json on the script tag on the bottom of the source, thus you don’t actually need to scrape the data anymore, you can directly access the json as follows:

import pandas as pd
from bs4 import BeautifulSoup
import requests
import json

source = requests.get('https://www.sephora.com/shop/perfume')
soup = BeautifulSoup(source.content, 'html.parser')

scriptContent = soup.find(id="linkJSON").text

catalog = json.loads(scriptContent)

products = catalog[3]['props']['products']

extracted = []
for p in products:
  extracted.append({'brand': p['brandName'], 'displayName': p['displayName'], 'price': p['currentSku']['listPrice']})

print(extracted)

Problem :

so I am trying to scrape the name, brand, and price of perfumes on the Sephora website. But I noticed that only the first 12 out of 60 perfumes will show up(There is 60 perfume on one page). I tried to print out the length of the “item_container” and it shows that there are 60 of them, but starting from 12th item, some code that has different structure starts to appear there. I have checked their HTML structure and I couldn’t understand why my code is not working for the rest of them.
I have also tried to change the ‘class’ to a more specific one such as from:

perfume_containers = soup.find_all('div', class_="css-12egk0t")

to

perfume_containers = soup.find_all('div', class_="css-ix8km1")

but it either gives me the same result or it has nothing to return back to me.
HTML code of item that is not showing up

HTML code of item that works

Here are my code and I am only showing the part where I extract the brand because it would be too long to show the whole thing. Please send some help! Thank you!!

import pandas as pd
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.sephora.com/shop/perfume')
soup = BeautifulSoup(source.content, 'html.parser')
perfume_containers = soup.find_all('div', class_="css-12egk0t")
brands = []
for container in perfume_containers:
# The brand
    brand = container.find('span', class_='css-ktoumz')
    try:
        brands.append(brand.text)
    except:
        continue

Comments

Comment posted by DJ-coding

Hi Damzaky! Thank you so much for helping! So is that the only way it would work? Is it possible to do it by scrapping the data?

Comment posted by Damzaky

Well this way what you do is still actually called “scrapping”, but this way you don’t need to parse every html element since the data is already inside a script tag, but inside the script tag it’s a json, so you need to do extra parsing, so what this code does is: scrap -> html parse -> json parse. Maybe there are other ways, but this way can be considered simple already.

By