Aggregate functions, Lack of Types, Entropy in Query problems on NoSQL DBs

While we believe that NoSQL Databases have come to fill a number of gaps which are challenging on the side of RDBMS, i have had several challenges over time with NoSQL DBs in the area of their query eco-system.

Couchbase for example, like its mother CouchDB have had major improvements in reading data using views, lists, Key lookups, map reduce, etc Couchbase has even moved to create an SQL-like query engine for their huge 2.X verson. MongoDB has also made serious improvements and complex queries are possible on it and many other NoSQL DB developments going on out there.

Most NoSQL DBs can perform Complex queries based on LOGICAL and COMPARISON OPERATORS eg AND , OR , == etc However, aggergation and performing complex relations on data are a problem on my part. For example, in CouchDB and/or Couchbase, Views span only a single DB. It is not possible to write a view which will aggregate data from two or more databases.

Let me now get to the problem. Functions (whether aggregate or not): AVG , SUM , ROUND , TRUNC , MAX , MIN , etc The lack of data types makes it impossible to efficiently work with Date and Times hence the lack of Date and time functions eg TO_DATE , SYSDATE (for system date/time), ADD_MONTHSs , DATE BETWEEN , DATE/TIME format Conversion etc It is true, that many will say that , they lack Schemas, types and stuff, but, i have found myself not running away from the need for atleast any one of the functions listed up there. For example because NoSQL DBs have no Date/Time data type, it is hard to perform queries based on those, because you might want to analyse trends based on time. Also, others have tried to use UNIX/EPOC Time stamps and stuff to solve this but it aint a single size fits all solution.

Map Reduce can be used to attain aggregation to a certain (small) degree, but the overhead has been realised to be great. However, the lack of GROUP BY functionality makes it a straineous solution to filter through what ou want. Look at the query below:

SELECT 
doc.field1, doc.field3, SUM(doc.field2 + doc.field4)
FROM 
couchdb.my_database 
GROUP BY doc.field1, doc.field3 
HAVING SUM(doc.field2 + doc.field4) > 20000;

This is not very easy to attain on CouchDB or Couchbase. i am not sure if its possible on MongoDB. I wish it were possible out of the box. This has made it difficult to use NoSQL as a Data warehouse or OLTP/OLAP solution. I found that, each time a complex analysis needs to be made, one needs to do it in the middle ware by paging through different datasets. Now, most experienced Guys (eg CLOUDANT ) have tweaked LUCENE to perform complex queries, but because it was initially meant for indexing and text search, it has not solved the lack of FUNCTIONS and DATA AGGREGATION on most NoSQL DBs.

Because of lack of FUNCTIONS , most NoSQL DBs have the NULL data type but lack the option of converting NULL Objects to something else, like it is in some RDBMS. For example in Oracle, i could: NVL(COLUMN,0) in order to include all the rows while performing say an AVG calculation on a given column (since say, by default the null columns will not be counted/included in the query processing).

To fully understand the problem, CouchDB views for example operate within the scope of a doc like this below:

function(doc){
    // if statements, logical operators, comparison operators
    // e.t.c here. until you do am emit of that doc
    // if it satisfies the conditions set
    // emit(null, doc) OR emit(doc.x,[doc.y, doc.z]) e.t.c.
    // you can only emit javascript data types anyways
    emit(doc.field1,doc)
}
The docs which satisfy the filters, are let through and go onto the next stage or to a reduce function. Imagine a doc structure like this below:

{
    x: '',
    y: '',
    z: {
        p: '',
        n: N    // integer or number data type        
    },
    date: 'DD/MON/YYYY'  // date format
}

Now, lets imagine the possibility of this kind of query:

function(){
    var average = select AVG(doc.z.n) from couchdb.my_database;
    var Result = select doc.x,doc.y from couchdb.my_database where 
            doc.z.n > average and doc.y = 'some string' and 
            doc.date between '01-JUN-2012' and '03-AUG-2012';
    emit(Result);
}

OR if this query were possible:

function(){
    var latest = select MAX(doc.date) from couchdb.my_database;
    var Result  = select 
                    doc.x,doc.z.p,MONTHS_BETWEEN(doc.date,latest) as "Months_interval" 
                    from couchdb.my_database where doc.y like '%john%' 
                    order by doc.z.p;
    emit(Result);
}
Qn 1: Which NoSQL Database solution has attained to a great degree, the query capability being talked about in the details above ? what key features make it stand out ?

Qn 2: Is the lack of a Schema, or the characteristic of being Key-Value a reason for the lack of FUNCTIONS in Querying these Databases ? What is the reason for the lack of Aggregate functionality in most NoSQL DBs ?

Qn 3: If the query ability above is possible in any of the NoSQL DBs, show how the last two (2) query problems above can be attained using the existing NoSQL infrastracture (consider any NoSQL technology of your choice)


MongoDB has something called Aggregation Framework and it works pretty well. I would say that almost every SQL Aggregation query could be carried out with this framework. Here you have some examples of "conversion" from SQL to Aggregation Framework.

Anyway MongoDB is a document oriented database and not key-value like CouchDB, so I don't know if it fits your requirements.

链接地址: http://www.djcxy.com/p/66412.html

上一篇: 哪个NoSQL数据库最适合追加审计日志用例?

下一篇: NoSQL DB上的查询问题中的聚集函数,缺少类型,熵