BQ SQL solution solution for comparing rows based on variance

I'm trying to compare scraped retail item price data in BigQuery (~2-3B rows depending on the time period and retailers included); with the intent to identify meaningful price differences. For example $1.99 vs $2.00 isn't meaningful, but $1.99 vs $2.50 is meaningful. Meaningful is quantified as a 2% difference between prices.

Example dataset for one item looks like this:

ITEM       Price($)  Meaningful (This is the column I'm trying to flag) 
Apple      $1.99     Y (lowest price would always be flagged)
Apple      $2.00     N ($1.99 v $2.00)
Apple      $2.01     N ($1.99 v $2.01)  Still using $1.99 for comparison
Apple      $2.50     Y ($1.99 v $2.50)  Still using $1.99 for comparison
Apple      $2.56     Y ($2.50 v $2.56)  Now using $2.50 as new comp. price
Apple      $2.62     Y ($2.55 v $2.62)  Now using $2.56 as new comp. price

I was hoping to solve the problem just using SQL Window functions (lead, lag, partition over, etc..) comparing the current row's price to the next following row. However, that doesn't work correctly when I get to a non-meaningful price because I always want the next value to be compared to the most recent meaningful price (see $2.50 row example above that's compared to $2.00 and NOT $2.01 in the prior row)

My Questions:

  • Is it possible to solve this with SQL alone in BigQuery? (eg What creative SQL logic solution am I overlooking, like bucketing based on the variance amounts?)
  • What programmatic options do I have since I can't use stored procedures with BQ? Python/Dataframes in GCP Datalab? BQ UDFs?

  • Below is for BigQuery Standard SQL

    #standardSQL
    CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
    RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
    LANGUAGE js AS """
      var result = [];
      var last = 0;
      var flag = '';
      for (i = 0; i < prices.length; i++){
        if (i == 0) {
          last = prices[i];
          flag = 'Y'
        } else {
          if ((prices[i] - last)/last > 0.02) {
            last = prices[i];
            flag = 'Y'
          } else {flag = 'N'}
        }
        var rec = [];
        rec.price = prices[i];
        rec.flag = flag;
        result.push(rec); 
      } 
      return result;
    """;
    SELECT item, rec.* 
    FROM (
      SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
      FROM `yourTable`
      GROUP BY item
    ), UNNEST(x(prices) ) AS rec
    -- ORDER BY item, price  
    

    You can play with / test it with below dummy data from your question

    #standardSQL
    CREATE TEMPORARY FUNCTION x(prices ARRAY<FLOAT64>)
    RETURNS ARRAY<STRUCT<price FLOAT64, flag STRING>>
    LANGUAGE js AS """
      var result = [];
      var last = 0;
      var flag = '';
      for (i = 0; i < prices.length; i++){
        if (i == 0) {
          last = prices[i];
          flag = 'Y'
        } else {
          if ((prices[i] - last)/last > 0.02) {
            last = prices[i];
            flag = 'Y'
          } else {flag = 'N'}
        }
        var rec = [];
        rec.price = prices[i];
        rec.flag = flag;
        result.push(rec); 
      } 
      return result;
    """;
    WITH `yourTable` AS (
      SELECT 'Apple' AS item, 1.99 AS price UNION ALL
      SELECT 'Apple', 2.00 UNION ALL
      SELECT 'Apple', 2.01 UNION ALL
      SELECT 'Apple', 2.50 UNION ALL
      SELECT 'Apple', 2.56 UNION ALL
      SELECT 'Apple', 2.62 
    )
    SELECT item, rec.* 
    FROM (
      SELECT item, ARRAY_AGG(price ORDER BY price) AS prices
      FROM `yourTable`
      GROUP BY item
    ), UNNEST(x(prices) ) AS rec
    ORDER BY item, price    
    

    Result is as below

    item    price   flag     
    ----    -----   ----
    Apple   1.99    Y    
    Apple   2.0     N    
    Apple   2.01    N    
    Apple   2.5     Y    
    Apple   2.56    Y    
    Apple   2.62    Y    
    
    链接地址: http://www.djcxy.com/p/62074.html

    上一篇: 在Android中设置壁纸

    下一篇: BQ SQL解决方案,用于根据方差比较行