Partially denormalising unicorn observations

2018-06-03 06:30:27

There are a number of researchers observing the world's last remaining unicorns, on Easter Island1. Each day the researchers record which unicorn they sighted, the date of the sighting, the number of babies each unicorn has and whether they were drunk when the sighting took place. These are individually uploaded to a central location, which then spits out a flat file to me of all new observations each day.

I have a table that looks like this to contain the information:

create table unicorn_observations (
     observer_id number not null
   , unicorn_id number not null
   , created date not null -- date the record was inserted into the database
   , lastseen date not null -- date the record was last seen
   , observation_date date not null
   , no_of_babies number not null
   , drunk varchar2(1) not null
   , constraint pk_uo primary key ( observer_id, unicorn_id, created )
   , constraint chk_uo_babies check ( no_of_babies >= 0 )
   , constraint chk_uo_drunk check ( drunk in ('y','n') )
     );

The table is separately unique on observer_id , unicorn_id and observation_date or lastseen .

Sometimes the Cobold [sic] managing the output of data gets it slightly wrong and re-outputs the same data twice. In this situation I update the lastseen instead of creating a new record. I only do this in situations where every column is the same

Unfortunately, the researchers aren't fully aware of the third normal form. Each month they upload the previous months observations for a few unicorns, even if no new observations have been made. They do this with a new observation_date , which means a new record gets inserted into the table.

I have a separate created and lastseen for full traceability as the researchers sometimes submit some observations late. These are created by the database and are not part of the submitted information.

Here is some sample data (with partially changed column names in order to make it fit without a scroll bar).

+--------+--------+-----------+-----------+-----------+---------+-------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK |
+--------+--------+-----------+-----------+-----------+---------+-------+
|      1 |      1 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      2 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      3 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 |      10 | n     |
|      1 |      6 | 10-NOV-11 | 10-NOV-11 | 07-NOV-11 |       0 | n     |
|      1 |      1 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      2 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      3 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     |
|      1 |      6 | 17-NOV-11 | 17-NOV-11 | 17-NOV-11 |       0 | n     |
|      1 |      6 | 01-DEC-11 | 01-DEC-11 | 01-DEC-11 |       0 | n     |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |
|      1 |      6 | 01-FEB-12 | 01-FEB-12 | 01-FEB-12 |       0 | n     |
|      1 |      6 | 01-MAR-12 | 01-MAR-12 | 01-MAR-12 |       0 | n     |
|      1 |      6 | 01-APR-12 | 01-APR-12 | 01-APR-12 |       0 | n     |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |
|      1 |      6 | 01-MAY-12 | 01-MAY-12 | 01-MAY-12 |       0 | n     |
+--------+--------+-----------+-----------+-----------+---------+-------+

I would like to partially denormalise these observations so that if a new record is received with the same observer_id , unicorn_id , no_of_babies and drunk (the payload) but with a newer observation_date I update a new column in the table, last_observation_date , instead of inserting a new record. I would still update the lastseen in this situation.

I need to do this as I have a number of complicated unicorn related queries that join to this table; the researchers upload old observations with new dates about 10m times a month and I receive approximately 9m genuinely new records a month. I've been running for a year and already have 225m unicorn observations. As I only need to know the last observation date for each payload combination I would rather massively reduce the size of the table and save myself a lot of time full-scanning it.

This means that the table would become:

create table unicorn_observations (
     observer_id number not null
   , unicorn_id number not null
   , created date not null -- date the record was inserted into the database
   , lastseen date not null -- date the record was last seen
   , observation_date date not null
   , no_of_babies number not null
   , drunk varchar2(1) not null
   , last_observation_date date
   , constraint pk_uo primary key ( observer_id, unicorn_id, created )
   , constraint chk_uo_babies check ( no_of_babies >= 0 )
   , constraint chk_uo_drunk check ( drunk in ('y','n') )
     );

and the data stored in the table would look like the below; it doesn't matter whether last_observation_date is null or not if the observation has only been "seen" once. I do not need help in loading the data, only in partially denormalising the current table to look like this.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 |       0 | n     | 01-DEC-11   |
|      1 |      1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |             |
|      1 |      6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 |       0 | n     | 01-MAY-12   |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     |             |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

The obvious answer

select observer_id as obs_id
     , unicorn_id as uni_id
     , min(created) as created
     , max(lastseen) as lastseen
     , min(observation_date) as obs_date
     , no_of_babies as "#BABIES"
     , drunk
     , max(observation_date) as last_obs_date
  from unicorn_observations
 group by observer_id
        , unicorn_id
        , no_of_babies
        , drunk

doesn't work as it ignores the single observation of 3 unicorn babies for unicorn 6 on the 1st January 2012; this in turn means that the lastseen for the record created on the 10th November is incorrect.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 |      10 | n     | 31-OCT-11   |
|      1 |      6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 |       0 | n     | 01-MAY-12   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     | 01-JAN-12   |
|      1 |      1 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
|      1 |      2 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
|      1 |      3 | 19-APR-12 | 19-APR-12 | 19-APR-12 |       7 | y     | 19-APR-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

I do not currently see a way of doing this without some procedural logic, ie a loop. I would much rather avoid a loop in this situation as I would have to full-scan a 225m row table 260 times (number of distinct created dates). Even using lag() and lead() would need to be recursive as there is an indeterminate amount of observations per unicorn.

Is there a way of creating this data-set in a single SQL statement?

The table specification and sample data is also in a SQL Fiddle.

Attempted better explanation:

The problem is maintaining when something was true. On 01-Jan-2012 unicorn 6 had 3 babies.

Looking at just unicorn 6 in the "table" created by the GROUP BY; if I try to find the number of babies on the 1st of January I will get two records returned, which is a contradiction.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 |       0 | n     | 01-MAY-12   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     | 01-JAN-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

However, I would want only one row, as in the second table. Here, for any point in time there is at most one "correct" value because the two periods of time where unicorn 6 had 0 babies have been separated into two rows by the day when it had 3.

+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
| OBS_ID | UNI_ID |  CREATED  | LASTSEEN  | OBS_DATE  | #BABIES | DRUNK | LAST_OBS_DT |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+
|      1 |      6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 |       0 | n     | 01-DEC-11   |
|      1 |      6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 |       3 | n     |             |
|      1 |      6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 |       0 | n     | 01-MAY-12   |
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+

1. grazing around the moai

Based on what I think you're trying to do, largely on your update regarding the specific issues with unicorn 6, I think this gets the result you want. It doesn't need recursive lead and lag , but does need two levels.

select *
from (
    select observer_id, unicorn_id,
        case when first_obs_dt is null then created
            else lag(created) over (order by rn) end as created,
        case when last_obs_dt is null then lastseen
            else lead(lastseen) over (order by rn) end as lastseen,
        case when first_obs_dt is null then observation_date
            else lag(observation_date) over (order by rn)
            end as observation_date,
        no_of_babies,
        drunk,
        case when last_obs_dt is null then observation_date
            else null end as last_obs_dt
    from (
        select observer_id, unicorn_id, created, lastseen, 
            observation_date, no_of_babies, drunk,
            case when lag_no_babies != no_of_babies or lag_drunk != drunk
                or lag_obs_dt is null then null
                else lag_obs_dt end as first_obs_dt,
            case when lead_no_babies != no_of_babies or lead_drunk != drunk
                or lead_obs_dt is null then null
                else lead_obs_dt end as last_obs_dt,
            rownum rn
        from (
            select observer_id, unicorn_id, created, lastseen,
                observation_date, no_of_babies, drunk,
                lag(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                            drunk
                        order by observation_date) lag_obs_dt,
                lag(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lag_no_babies,
                lag(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lag_drunk,
                lead(observation_date)
                    over (partition by observer_id, unicorn_id, no_of_babies,
                        drunk
                    order by observation_date) lead_obs_dt,
                lead(no_of_babies)
                    over (partition by observer_id, unicorn_id, drunk
                        order by observation_date) lead_no_babies,
                lead(drunk)
                    over (partition by observer_id, unicorn_id, no_of_babies
                        order by observation_date) lead_drunk
            from unicorn_observations
            order by 1,2,5
        )
    )
    where first_obs_dt is null or last_obs_dt is null
)
where last_obs_dt is not null
order by 1,2,3,4;

Which gives:

OBSERVER_ID UNICORN_ID CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
----------- ---------- --------- --------- --------- ------------ - ---------
          1          1 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          1 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          2 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          2 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          3 17-NOV-11 01-NOV-11 09-APR-11           10 n 31-OCT-11
          1          3 19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12
          1          6 10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
          1          6 01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
          1          6 01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

9 rows selected.

It's got the three records for unicorn 6, but the lastseen and observation_date for the third are the opposite way round to your sample, so I'm not sure if I'm still not understanding that. I've assumed that you want to keep the earliest observation_date and latest lastseen within each grouping, on the grounds that it seems to be what would happen when adding new records, but I'm not sure...

So, the innermost query get the raw data from the table and gets a lead and lag for the observation_date and the no_of_babies and drunk columns using slightly different partitions. The order by is so a rownum can be used later, obtained in the next step and used for ordering in the one after that. Just for unicorn 6 for brevity:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
10-NOV-11 10-NOV-11 07-NOV-11            0 n                           17-NOV-11              0 n
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11             0 n 01-DEC-11              0 n
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11             0 n 01-FEB-12              3 n
01-JAN-12 01-JAN-12 01-JAN-12            3 n                       0                          0
01-FEB-12 01-FEB-12 01-FEB-12            0 n 01-DEC-11             3 n 01-MAR-12              0 n
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12             0 n 01-APR-12              0 n
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12             0 n 01-MAY-12              0 n
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12             0 n

The next level blanks out the lead and lag values for observation_date if either the num_of_babies or drunk value has changed - you only specifically referred to splitting on the baby count, but I assume you want to split on sobriety too. After this, anything that has null for either first_obs_date or last_obs_date is the start or end of a mini-range.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
10-NOV-11 10-NOV-11 07-NOV-11            0 n           17-NOV-11          1
17-NOV-11 17-NOV-11 17-NOV-11            0 n 07-NOV-11 01-DEC-11          2
01-DEC-11 01-DEC-11 01-DEC-11            0 n 17-NOV-11                    3
01-JAN-12 01-JAN-12 01-JAN-12            3 n                              4
01-FEB-12 01-FEB-12 01-FEB-12            0 n           01-MAR-12          5
01-MAR-12 01-MAR-12 01-MAR-12            0 n 01-FEB-12 01-APR-12          6
01-APR-12 01-APR-12 01-APR-12            0 n 01-MAR-12 01-MAY-12          7
01-MAY-12 01-MAY-12 01-MAY-12            0 n 01-APR-12                    8

Anything that isn't the start or end of a mini-range can now be ignored, as the values are either same as or are superseded by those before or after. This deals with the indeterminate number of observations problem - it doesn't matter how many you ignore at this point. So the next level eliminates those intermediate values by filtering rows where both first_obs_dt and last_obs_dt are non-null. Within that filtered set there's a second layer of lead and lag to get the first or last value for each date - and that's the bit I'm not sure is right as it doesn't match one of your samples.

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
10-NOV-11 01-DEC-11 07-NOV-11            0 n
10-NOV-11 01-DEC-11 07-NOV-11            0 n 01-DEC-11
01-JAN-12 01-JAN-12 01-JAN-12            3 n 01-JAN-12
01-FEB-12 01-MAY-12 01-FEB-12            0 n
01-FEB-12 01-MAY-12 01-FEB-12            0 n 01-MAY-12

Finally the remaining rows that don't have a last_obs_dt are filtered out.

Now I'll wait to see which bit(s) I've misunderstood... *8-)

Following correction to lead and lag ordering, the same info for each stage for unicorn 1:

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
17-NOV-11 17-NOV-11 09-APR-11           10 n                           31-OCT-11             10 n
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11            10 n
19-APR-12 19-APR-12 19-APR-12            7 y

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_         RN
--------- --------- --------- ------------ - --------- --------- ----------
17-NOV-11 17-NOV-11 09-APR-11           10 n           31-OCT-11          1
01-NOV-11 01-NOV-11 31-OCT-11           10 n 09-APR-11                    2
19-APR-12 19-APR-12 19-APR-12            7 y                              3

CREATED   LASTSEEN  OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
17-NOV-11 17-NOV-11 09-APR-11           10 n 09-APR-11
19-APR-12 19-APR-12 19-APR-12            7 y 19-APR-12

I'm not sure what shoudl happen with the preserved observation_date and lastseen when the original data was entered out of sequence like this, or what you'll do in that situation with new records added in the future.

尝试这个。

with cte as
(
    select v.*,  ROW_NUMBER() over (partition by grp, unicorn_id order by grp, unicorn_id) rn
    from
    (
        select u.*, 
            ROW_NUMBER() over (partition by unicorn_id order by no_of_babies, drunk, created )
            -ROW_NUMBER() over (partition by unicorn_id order by created) as grp
        from unicorn_observations u
    ) v
) 
    select 
        observer_id, cte.unicorn_id, mincreated,maxlastseen,minobsdate,no_of_babies,drunk,maxobsdate
    from cte 
        inner join 
        (    
            select 
                unicorn_id, grp, 
                min(created) as mincreated,
                max(lastseen) as maxlastseen, 
                min(observation_date) as minobsdate,
                max(observation_date) as maxobsdate
            from cte 
            group by unicorn_id, grp
        ) v
        on cte.grp = v.grp
        and cte.unicorn_id = v.unicorn_id
    where rn=1  
    order by created;

This type of problem can be solved by first creating some flags in a subquery, then using them.

with obs_flags as (
    select 
       observer_id as obs_id
     , unicorn_id as uni_id
     , case when lag(observation_date) over (
           partition by unicorn_id, no_of_babies, drunk
           order by unicorn_id, observation_date
       ) is null then 1 else 0 end as group_start
     , case when lead(observation_date) over (
           partition by unicorn_id, no_of_babies,drunk
           order by unicorn_id, observation_date
       ) is null then 1 else 0 end as group_end
     , observation_date
     , no_of_babies
     , drunk
     , lastseen
     , created
  from unicorn_observations
)
select obs_start.obs_id
     , obs_start.uni_id
     , obs_start.created
     , obs_end.lastseen as lastseen
     , obs_start.observation_date
     , obs_start.no_of_babies as "#BABIES"
     , obs_start.drunk
     , obs_end.observation_date as last_obs_date
  from obs_flags obs_start
  join obs_flags obs_end on 
      obs_start.group_start = 1 and
      obs_end.group_end = 1 and
      obs_start.uni_id = obs_end.uni_id and
      obs_start.no_of_babies = obs_end.no_of_babies and
      obs_start.drunk = obs_end.drunk and
      obs_start.observation_date <= obs_end.observation_date and
      --Only join with the first end point we find:
      not exists (
          select * from obs_flags f where
              obs_start.uni_id = f.uni_id and
              obs_start.no_of_babies = f.no_of_babies and
              obs_start.drunk = f.drunk and
              f.group_end = 1 and
              f.observation_date < obs_end.observation_date and
              f.observation_date >= obs_start.observation_date
      );

This is a complex problem; I may have not quite met your requirements (or there could be a typo in there. I don't have Oracle to test it). However, it should give you an idea of how it can be done.

Basically, you first find all start and end records of the periods you are interested in. Then you join each start record to the next end record within the same grouping.

Update: my original code didn't check that the end came after the start. I fixed that.

Update2: as Ben pointed out, the not exists clause will be slow here. An alternative that has helped me speed things up in the past is to do this in two steps: first find all potential pairings, then separately select only the correct pairings out of that.

In this case, in a temporary table or subquery join each obs_start to every potentially correct obs_end .

Then, out of these pairings, select the one that has the earliest obs_end for each obs_start .

链接地址: http://www.djcxy.com/p/11240.html

上一篇: 我如何有效地使用Alex的启动代码功能？

下一篇: 部分去归一化独角兽观测