Solution 1 :

Most of the content of the web page are rendered by javascript started async. You have to wait a bit to get the content….

String url = "https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.classification.exact%22%3A%22Biology%20(General)%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22match_all%22%3A%7B%7D%7D%7D%7D%2C%22sort%22%3A%5B%7B%22created_date%22%3A%7B%22order%22%3A%22desc%22%7D%7D%5D%7D";

try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
    // js is enabled by default
    webClient.waitForBackgroundJavaScriptStartingBefore(1_000);

    HtmlPage page = webClient.getPage(url);
    webClient.waitForBackgroundJavaScript(10_000);

    System.out.println(page.asText());
}

Works here with version 2.42.0-SNAPSHOT but should work also with the 2.41.0 release.

Problem :

I’m trying to scrape this page with HtmlUnit. In the Xml, it says “You are currently browsing with JavaScript turned off which means you can’t use our search functionality.” I’ve been able to scrape it before using Python and Selenium, but I haven’t been able to get it to work with Java and

import java.io.IOException;
import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class WebScraper {
    public static void main(String[] args) throws MalformedURLException, IOException {
        try {
            WebClient webClient = new WebClient(BrowserVersion.CHROME);
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.waitForBackgroundJavaScript(2000);
            webClient.setCssErrorHandler(new SilentCssErrorHandler());
            HtmlPage page = webClient.getPage("https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.classification.exact%22%3A%22Biology%20(General)%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22match_all%22%3A%7B%7D%7D%7D%7D%2C%22sort%22%3A%5B%7B%22created_date%22%3A%7B%22order%22%3A%22desc%22%7D%7D%5D%7D");
            System.out.println(page.asXml());

            webClient.close();
        } catch (IOException ex) {
            ex.printStackTrace();
        }

    }
}

Comments

Comment posted by Thorbjørn Ravn Andersen

You understand the technical difference between the two?

Comment posted by AllAmericanBreakfast

No, I’ll look into it. If you have any insights to share, I’d appreciate it.

Comment posted by Thorbjørn Ravn Andersen

Selenium drives a full browser running aside your test code(which has Javascript). htmlunit is a simple browser implemented in Java which does not have Javascript.

Comment posted by AllAmericanBreakfast

Thanks very much! I switched to Selenium and it’s working now.

Comment posted by level of support

@ThorbjørnRavnAndersen – htmlunit