BQ SQL解决方案,用于根据方差比较行

我试图比较BigQuery中的零售商品价格数据(根据时间段和零售商的不同,大约2-3B行); 目的是确定有意义的价格差异。 例如1.99美元vs 2.00美元没有意义,但1.99美元vs 2.50美元是有意义的。 有意义的定价为价格之间的2%差异。

一个项目的示例数据集如下所示:

ITEM       Price($)  Meaningful (This is the column I'm trying to flag) 
Apple      $1.99     Y (lowest price would always be flagged)
Apple      $2.00     N ($1.99 v $2.00)
Apple      $2.01     N ($1.99 v $2.01)  Still using $1.99 for comparison
Apple      $2.50     Y ($1.99 v $2.50)  Still using $1.99 for comparison
Apple      $2.56     Y ($2.50 v $2.56)  Now using $2.50 as new comp. price
Apple      $2.62     Y ($2.55 v $2.62)  Now using $2.56 as new comp. price

我希望只是使用SQL窗口函数(lead,lag,partition over等)来解决问题,将当前行的价格与下一行进行比较。 然而,当我得到一个无意义的价格时,这并不正确,因为我总是希望将下一个值与最新的有意义的价格进行比较(参见2.50美元以上的行示例,与前一行的2.00美元和2.01美元相比) )

我的问题:

  • 在BigQuery中单独使用SQL可以解决这个问题吗? (例如,我忽略了哪些创造性的SQL逻辑解决方案,如基于差异量的分组?
  • 由于我不能在BQ中使用存储过程,因此我有哪些程序化选项? GCP Datalab中的Python / Dataframes? BQ UDFs?

  • 以下是BigQuery标准SQL

    #standardSQL
    CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
    RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
    LANGUAGE js AS """
      var result = [];
      var last = 0;
      var flag = '';
      for (i = 0; i < prices.length; i++){
        if (i == 0) {
          last = prices[i];
          flag = 'Y'
        } else {
          if ((prices[i] - last)/last > 0.02) {
            last = prices[i];
            flag = 'Y'
          } else {flag = 'N'}
        }
        var rec = [];
        rec.price = prices[i];
        rec.flag = flag;
        result.push(rec); 
      } 
      return result;
    """;
    SELECT item, rec.* 
    FROM (
      SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
      FROM `yourTable`
      GROUP BY item
    ), UNNEST(x(prices) ) AS rec
    -- ORDER BY item, price  
    

    你可以用你的问题中的下面的虚拟数据来玩/测试它

    #standardSQL
    CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
    RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
    LANGUAGE js AS """
      var result = [];
      var last = 0;
      var flag = '';
      for (i = 0; i < prices.length; i++){
        if (i == 0) {
          last = prices[i];
          flag = 'Y'
        } else {
          if ((prices[i] - last)/last > 0.02) {
            last = prices[i];
            flag = 'Y'
          } else {flag = 'N'}
        }
        var rec = [];
        rec.price = prices[i];
        rec.flag = flag;
        result.push(rec); 
      } 
      return result;
    """;
    WITH `yourTable` AS (
      SELECT 'Apple' AS item, 1.99 AS price UNION ALL
      SELECT 'Apple', 2.00 UNION ALL
      SELECT 'Apple', 2.01 UNION ALL
      SELECT 'Apple', 2.50 UNION ALL
      SELECT 'Apple', 2.56 UNION ALL
      SELECT 'Apple', 2.62 
    )
    SELECT item, rec.* 
    FROM (
      SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
      FROM `yourTable`
      GROUP BY item
    ), UNNEST(x(prices) ) AS rec
    ORDER BY item, price    
    

    结果如下

    item    price   flag     
    ----    -----   ----
    Apple   1.99    Y    
    Apple   2.0     N    
    Apple   2.01    N    
    Apple   2.5     Y    
    Apple   2.56    Y    
    Apple   2.62    Y    
    
    链接地址: http://www.djcxy.com/p/62073.html

    上一篇: BQ SQL solution solution for comparing rows based on variance

    下一篇: Insert missing row(s)