Quartz Scheduler: Maintain processed files list in the event job metadata

We use Quartz scheduler in our application to scan a particular folder for any new files and if there is a new file, then kick off the associated workflow in the application to process it. For this, we have created our custom listener object which is associated with a job and a trigger that polls the file location every 5 min.

The requirement is to process only the new file that arrives in that folder location while ignoring the already processed files. Also, we don't want the folder location to grow enormously with large number of files (otherwise it will slow down the folder scanning) - so at the end of workflow we actually delete the source file.

In order to implement it, we decided to maintain the list of processed files in job metadata. So at each polling, we fetch the list of processed files from the job metadata, compare it against current list of files and if the file is not yet processed - then kick off the associated process flow.

The problem with the above approach is that over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.

To address this problem, we decided to refresh the list of processed files in job metadata with the current snapshot of the folder. This way, since we delete the processed file from folder location at the end of each workflow, the list of processed files remains in limit. But then we started having problem of processing of the duplicate files if it arrives with same name next day.

What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrives with same name? Shall we consider the approach of persiting processed file list in the external database instead of job metadata? I am looking for recommended approach for implementing this solution. Thanks!


We had a similar request recently with our scheduler. If you are on linux, why not using solutions such as inotify ? Other systems may have other ways to monitor file system events.

Our solution was to trigger some file processing at each creation event and then every x days removing the older files (similar to Walen DB suggestion). In that case, the list does not inflate too much and duplicate file can be handled in their own specific case.

(Sorry I do not have the rights to comment yet.)

链接地址: http://www.djcxy.com/p/23634.html

上一篇: java的

下一篇: Quartz Scheduler:在事件作业元数据中维护已处理的文件列表