Solution 1 :

Looks like you either need to provide DTD or replace the entity name auml with its corresponding hex or decimal value, i.e. ä or ä respectively. See A.2. Entity Sets and HTML 4 Entity Names.

The html content would look like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
        <!ENTITY auml "&#228;">

Alternatively, you can run through the html string and replace the entity names with their corresponding dec/hex values, which should be fine, or just prepend the DTD to your html string before passing it to the pdf builder.


You might want to give the jsoup library a try. It It parses and provides you with a org.w3c.dom.Document, e.g.

Document jsoupDoc = Jsoup.parse(html); // org.jsoup.nodes.Document
W3CDom w3cDom = new W3CDom(); // org.jsoup.helper.W3CDom
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

You can then pass the w3cDoc to the pdf builder like so

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(w3cDoc, "file://localhost/");

Problem :

I’m using openhtmltopdf to transform html to pdf. Currently I’m getting an exception if the html contains german characters, like for example ä,ö,ü.

  PdfRendererBuilder builder = new PdfRendererBuilder();

org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 31; The
entity “auml” was referenced, but not declared.

Here my html:

      <meta charset="UTF-8" />

The exported word is “käse” (cheese).


I have tried with an entity resolver, in this way:

 DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
    DocumentBuilder builder=null;

      ByteArrayInputStream input=new ByteArrayInputStream(html.getBytes("UTF-8"));
      org.w3c.dom.Document doc=builder.parse(input);

    }catch(Exception e){

but I’m still getting the same exception at “parse”.


