Most of the content of the web page are rendered by javascript started async. You have to wait a bit to get the content….
String url = "https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.classification.exact%22%3A%22Biology%20(General)%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22match_all%22%3A%7B%7D%7D%7D%7D%2C%22sort%22%3A%5B%7B%22created_date%22%3A%7B%22order%22%3A%22desc%22%7D%7D%5D%7D";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
// js is enabled by default
webClient.waitForBackgroundJavaScriptStartingBefore(1_000);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
System.out.println(page.asText());
}
Works here with version 2.42.0-SNAPSHOT but should work also with the 2.41.0 release.
I’m trying to scrape this page with HtmlUnit. In the Xml, it says “You are currently browsing with JavaScript turned off which means you can’t use our search functionality.” I’ve been able to scrape it before using Python and Selenium, but I haven’t been able to get it to work with Java and
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class WebScraper {
public static void main(String[] args) throws MalformedURLException, IOException {
try {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.waitForBackgroundJavaScript(2000);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
HtmlPage page = webClient.getPage("https://doaj.org/search?source=%7B%22query%22%3A%7B%22filtered%22%3A%7B%22filter%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22index.classification.exact%22%3A%22Biology%20(General)%22%7D%7D%5D%7D%7D%2C%22query%22%3A%7B%22match_all%22%3A%7B%7D%7D%7D%7D%2C%22sort%22%3A%5B%7B%22created_date%22%3A%7B%22order%22%3A%22desc%22%7D%7D%5D%7D");
System.out.println(page.asXml());
webClient.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
No, I’ll look into it. If you have any insights to share, I’d appreciate it.
Selenium drives a full browser running aside your test code(which has Javascript). htmlunit is a simple browser implemented in Java which does not have Javascript.
Thanks very much! I switched to Selenium and it’s working now.