Looks like you either need to provide DTD or replace the entity name auml
with its corresponding hex or decimal value, i.e. ä
or ä
respectively. See A.2. Entity Sets and HTML 4 Entity Names.
The html content would look like this:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
<!ENTITY auml "ä">
]>
<html>
<head>
</head>
<body>
käse
</body>
</html>
Alternatively, you can run through the html string and replace the entity names with their corresponding dec/hex values, which should be fine, or just prepend the DTD to your html string before passing it to the pdf builder.
Update
You might want to give the jsoup library a try. It It parses and provides you with a org.w3c.dom.Document
, e.g.
Document jsoupDoc = Jsoup.parse(html); // org.jsoup.nodes.Document
W3CDom w3cDom = new W3CDom(); // org.jsoup.helper.W3CDom
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);
You can then pass the w3cDoc
to the pdf builder like so
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(w3cDoc, "file://localhost/");
I’m using openhtmltopdf to transform html to pdf. Currently I’m getting an exception if the html contains german characters, like for example ä,ö,ü.
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.useFastMode();
builder.withHtmlContent(html,"file://localhost/");
builder.toStream(out);
builder.run();
org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 31; The
entity “auml” was referenced, but not declared.
Here my html:
<html>
<head>
<meta charset="UTF-8" />
</head>
<body>
käse
</body>
</html>
The exported word is “käse” (cheese).
UPDATE
I have tried with an entity resolver, in this way:
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
DocumentBuilder builder=null;
try{
builder=factory.newDocumentBuilder();
ByteArrayInputStream input=new ByteArrayInputStream(html.getBytes("UTF-8"));
builder.setEntityResolver(FSEntityResolver.instance());
org.w3c.dom.Document doc=builder.parse(input);
}catch(Exception e){
logger.error(e.getMessage(),e);
}
but I’m still getting the same exception at “parse”.
Your answer goes in the right direction, thx. I’m pretty sure I can do it programmatically, instead of declaring the DTD in the html. I have tried using an entity resolver (I have updated my question), still not working, but I think I’m closer…