如何从apache solr的pdf内容中获取日期字符串
大家好我是新来的阿帕奇solr。 我有一个包含日期信息的pdf,如 - bla bla bla 2012-11-23 11:11:12 bla bla ...-我想从内容中获取所有日期。
我阅读了一些文档(http://wiki.apache.org/solr/ExtractingRequestHandler),我添加了date.formats到/ update / extract
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
<lst name="date.formats">
<str>yyyy-MM-dd</str>
<str>yyyy-MM-dd'T'HH:mm:ss'Z'</str>
<str>yyyy-MM-dd'T'HH:mm:ss</str>
<str>yyyy-MM-dd</str>
<str>yyyy-MM-dd hh:mm:ss</str>
<str>yyyy-MM-dd HH:mm:ss</str>
</lst>
我在下面添加pdf
curl“http:// localhost:8983 / solr / update / extract?literal.id = sql.txt&uprefix = attr_&fmap.content = attr_content&commit = true”&stream.file =“/home/example/example.pdf”
还有关于日期的注意事项? 和内容?
Thnks
链接地址: http://www.djcxy.com/p/67033.html上一篇: how to get date strings from content of pdf with apache solr
下一篇: How do I create a solr core with the data from an existing one?