how to get date strings from content of pdf with apache solr

2018-06-23 22:05:28

Hi all i am new to apache solr. i have a pdf which is containing date informations like - bla bla bla 2012-11-23 11:11:12 bla bla ...- i want to get all dates from content.

i read some documentation (http://wiki.apache.org/solr/ExtractingRequestHandler) and i added date.formats to /update/extract

 <requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <!-- All the main content goes into "text"... if you need to return
       the extracted text or do highlighting, use a stored field. -->
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
<lst name="date.formats">
  <str>yyyy-MM-dd</str>
  <str>yyyy-MM-dd'T'HH:mm:ss'Z'</str>
  <str>yyyy-MM-dd'T'HH:mm:ss</str>
  <str>yyyy-MM-dd</str>
  <str>yyyy-MM-dd hh:mm:ss</str>
  <str>yyyy-MM-dd HH:mm:ss</str>
</lst>

i am adding pdf like below

curl "http://localhost:8983/solr/update/extract?literal.id=sql.txt&uprefix=attr_&fmap.content=attr_content&commit=true"&stream.file="/home/example/example.pdf"

and there is noting about date ? and content ?

Thnks

链接地址: http://www.djcxy.com/p/67034.html

上一篇: Apache Solr索引Bechmarking

下一篇: 如何从apache solr的pdf内容中获取日期字符串