The page content of this page is constructed inside your browser based on js. You need a framework with js support to do this.
Using HtmlUnit i got the page like this
String url = "https://open.sap.com/courses";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
System.out.println("-------------------------------");
System.out.println(page.asText());
System.out.println("-------------------------------");
}
HtmlUnit has a rich API to do everything you like with the page object like searching for controls/content, clicking controls or extracting the text from parts of the page.
When i try to do CTRL+U on website then also its more then what i get from jsoup. The site am using is Open SAP -> https://open.sap.com/courses
Have tried timeout and maxbodysize along with jsoup.connect.
Right now my code looks like this:
private static String getHtml(String location) throws IOException {
URL url = new URL(location);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String input;
StringBuilder builder = new StringBuilder();
while((input = in.readLine()) != null)
{
builder.append(input);
}
return builder.toString();
}
document = Jsoup.parse(getHtml(URL));
But still same HTML returned. By selenium its possible but it a bit slow so any other way to achieve this?
Because by aim is to find out the links of the courses and then load them to find their course summary which with selenium will be too slow.
Please suggest what can be done here.
You don’t have to download the HTML. Jsoup can do this for you, just use:
Thanks for the reply but I have mentioned in the question that I have tried using jsoup.connect also but it didn’t work.
I think your solution will work too. but i implemented the desired result using selenium headless mode to scrape the information i required. Its slow but works. Do you think this above solution will me more speedy in fetching results?