在Jsoup的字符集

2018-06-28 02:07:21

我使用Jsoup库。

执行以下代码后：

Document doc = new Document(language);

File input = new File("filePath" + "filename.html");
PrintWriter writer = new PrintWriter(input, "UTF-8");

String contentType = "<%@ page contentType="text/html; charset=UTF-8" %>";
doc.appendText(contentType);

writer.write(doc.toString());
writer.flush();
writer.close();

在输出的html文件中，我收到以下一行文本：

&lt;%@ page contentType=&quot;text/html; charset=UTF-8&quot; %&gt;

代替

<%@ page contentType="text/html; charset=UTF-8" %>

可能是什么问题呢？

这些是用于防止浏览器将它们视为html标签的转义字符。这不是一个问题。当您通过浏览器打开页面时，它将正确呈现

这里有些问题：

Document doc = new Document(language);

不要这样做。改用Jsoup.parse(...) 。

<%@ page contentType="text/html; charset=UTF-8" %>

这不是HTML，并且可能无法正确解析。

现在，为您的问题。你应该使用类似的东西

Document document = Jsoup.parse(new ByteArrayInputStream(myHtmlString.getBytes(StandardCharsets.UTF_8)), "ISO-8859-1", BaseUrl);

检查这一点，这和你可能需要的outputSetting。

链接地址: http://www.djcxy.com/p/78473.html

上一篇: Charset in Jsoup

下一篇: AccessDeniedException in JUnit test using a TemporaryFolder