Design patterns for aggregating heterogeneous tabular data
I'm working on some C++ code that integrates information from about several dozen csv files. They all contain some time-stamped record data I want to extract, but the representation is somewhat different in each file. The differences between representations go beyond different column orderings and column names - for example, what's one row with multiple columns in one file may be multiple rows in a different file.
So I need some custom handling for each file to put together a unified data structure that includes the necessary information from all the files. My question is whether there's a preferred code pattern to keep the complexity manageable and the code elegant? Or if there's a good case study I should examine to see how this sort of complexity has been handled in the past.
(I realize something like this might be easier in a scripting language like perl, but the project is in C++ for now. Also, my question is more regarding whether there's a code pattern to deal with this - so the answer doesn't have to be too language specific.)
There are several phrases that you use in your question that stick out to me: custom handling for each file
, representation is somewhat different
, complexity manageable
. Based upon the fact that you are going to have to use varying variations of parsing algorithms based upon the format of the csv file and you are (from what I can tell) wanting to loosely couple your parsing mechanism I would recommend the strategy pattern.
The strategy pattern will decouple the parsing mechanism from the users of the data contained in the CSV file. The users of the data have no interest as to what format the CSV file is in they are only interested in the information within that file which makes the strategy pattern an excellent choice. If there are similarities between your parsing mechanisms you can use both the template and strategy patterns together to reduce duplication and take advantage of inheritance.
By using the strategy pattern you can then extract strategy creation into a factory method or abstract factory as you see fit further allowing clients to be decoupled from the parsing method.
I am not quite sure what you want to do with the different files. If the idea is to use them like database tables and you have some keys with attached information scattered in multiple files, you might want to have a look at something like MapReduce, where you build up part of the information from each files first and aggregate the information sharing the same key in a second step.
As for data structures, it depends on the layout of your files. I would probably have a dedicated reader for each file type which would store the information in dedicated data structures representing the information in the file. You could attach a key to each information and use a reduce operation to merge all the information fragments using the same key and aggregate them in a proxy structure.
On the other hand, if the idea is to build identical objects from different serialization methods (ie the different files are independent but represent the same type of data with a different layout), without knowing in advance which serialization method has been employed, i am afraid the only solution left is to brute-force the deserialization. You can have a set of readers, one for each input type, and try to parse the file, if it fails, the next one starts and so on, until you discover a new file format or find the appropriate reader. I don't think there is any pattern covering this.
链接地址: http://www.djcxy.com/p/12388.html下一篇: 用于聚合异构表格数据的设计模式