Partially denormalising unicorn observations
There are a number of researchers observing the world's last remaining unicorns, on Easter Island1. Each day the researchers record which unicorn they sighted, the date of the sighting, the number of babies each unicorn has and whether they were drunk when the sighting took place. These are individually uploaded to a central location, which then spits out a flat file to me of all new observations each day.
I have a table that looks like this to contain the information:
create table unicorn_observations (
observer_id number not null
, unicorn_id number not null
, created date not null -- date the record was inserted into the database
, lastseen date not null -- date the record was last seen
, observation_date date not null
, no_of_babies number not null
, drunk varchar2(1) not null
, constraint pk_uo primary key ( observer_id, unicorn_id, created )
, constraint chk_uo_babies check ( no_of_babies >= 0 )
, constraint chk_uo_drunk check ( drunk in ('y','n') )
);
The table is separately unique on observer_id
, unicorn_id
and observation_date
or lastseen
.
Sometimes the Cobold [sic] managing the output of data gets it slightly wrong and re-outputs the same data twice. In this situation I update the lastseen
instead of creating a new record. I only do this in situations where every column is the same
Unfortunately, the researchers aren't fully aware of the third normal form. Each month they upload the previous months observations for a few unicorns, even if no new observations have been made. They do this with a new observation_date
, which means a new record gets inserted into the table.
I have a separate created
and lastseen
for full traceability as the researchers sometimes submit some observations late. These are created by the database and are not part of the submitted information.
Here is some sample data (with partially changed column names in order to make it fit without a scroll bar).
+--------+--------+-----------+-----------+-----------+---------+-------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | +--------+--------+-----------+-----------+-----------+---------+-------+ | 1 | 1 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 2 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 3 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n | | 1 | 6 | 10-NOV-11 | 10-NOV-11 | 07-NOV-11 | 0 | n | | 1 | 1 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 2 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 3 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | | 1 | 6 | 17-NOV-11 | 17-NOV-11 | 17-NOV-11 | 0 | n | | 1 | 6 | 01-DEC-11 | 01-DEC-11 | 01-DEC-11 | 0 | n | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | 1 | 6 | 01-FEB-12 | 01-FEB-12 | 01-FEB-12 | 0 | n | | 1 | 6 | 01-MAR-12 | 01-MAR-12 | 01-MAR-12 | 0 | n | | 1 | 6 | 01-APR-12 | 01-APR-12 | 01-APR-12 | 0 | n | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | 1 | 6 | 01-MAY-12 | 01-MAY-12 | 01-MAY-12 | 0 | n | +--------+--------+-----------+-----------+-----------+---------+-------+
I would like to partially denormalise these observations so that if a new record is received with the same observer_id
, unicorn_id
, no_of_babies
and drunk
(the payload) but with a newer observation_date
I update a new column in the table, last_observation_date
, instead of inserting a new record. I would still update the lastseen
in this situation.
I need to do this as I have a number of complicated unicorn related queries that join to this table; the researchers upload old observations with new dates about 10m times a month and I receive approximately 9m genuinely new records a month. I've been running for a year and already have 225m unicorn observations. As I only need to know the last observation date for each payload combination I would rather massively reduce the size of the table and save myself a lot of time full-scanning it.
This means that the table would become:
create table unicorn_observations (
observer_id number not null
, unicorn_id number not null
, created date not null -- date the record was inserted into the database
, lastseen date not null -- date the record was last seen
, observation_date date not null
, no_of_babies number not null
, drunk varchar2(1) not null
, last_observation_date date
, constraint pk_uo primary key ( observer_id, unicorn_id, created )
, constraint chk_uo_babies check ( no_of_babies >= 0 )
, constraint chk_uo_drunk check ( drunk in ('y','n') )
);
and the data stored in the table would look like the below; it doesn't matter whether last_observation_date
is null or not if the observation has only been "seen" once. I do not need help in loading the data, only in partially denormalising the current table to look like this.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 | | 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | | 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
The obvious answer
select observer_id as obs_id
, unicorn_id as uni_id
, min(created) as created
, max(lastseen) as lastseen
, min(observation_date) as obs_date
, no_of_babies as "#BABIES"
, drunk
, max(observation_date) as last_obs_date
from unicorn_observations
group by observer_id
, unicorn_id
, no_of_babies
, drunk
doesn't work as it ignores the single observation of 3 unicorn babies for unicorn 6 on the 1st January 2012; this in turn means that the lastseen
for the record created on the 10th November is incorrect.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 | | 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 | | 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | | 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | | 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
I do not currently see a way of doing this without some procedural logic, ie a loop. I would much rather avoid a loop in this situation as I would have to full-scan a 225m row table 260 times (number of distinct created
dates). Even using lag()
and lead()
would need to be recursive as there is an indeterminate amount of observations per unicorn.
Is there a way of creating this data-set in a single SQL statement?
The table specification and sample data is also in a SQL Fiddle.
Attempted better explanation:
The problem is maintaining when something was true. On 01-Jan-2012 unicorn 6 had 3 babies.
Looking at just unicorn 6 in the "table" created by the GROUP BY; if I try to find the number of babies on the 1st of January I will get two records returned, which is a contradiction.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
However, I would want only one row, as in the second table. Here, for any point in time there is at most one "correct" value because the two periods of time where unicorn 6 had 0 babies have been separated into two rows by the day when it had 3.
+--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+ | 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 | | 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | | | 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 | +--------+--------+-----------+-----------+-----------+---------+-------+-------------+
1. grazing around the moai
Based on what I think you're trying to do, largely on your update regarding the specific issues with unicorn 6, I think this gets the result you want. It doesn't need recursive lead
and lag
, but does need two levels.
select *
from (
select observer_id, unicorn_id,
case when first_obs_dt is null then created
else lag(created) over (order by rn) end as created,
case when last_obs_dt is null then lastseen
else lead(lastseen) over (order by rn) end as lastseen,
case when first_obs_dt is null then observation_date
else lag(observation_date) over (order by rn)
end as observation_date,
no_of_babies,
drunk,
case when last_obs_dt is null then observation_date
else null end as last_obs_dt
from (
select observer_id, unicorn_id, created, lastseen,
observation_date, no_of_babies, drunk,
case when lag_no_babies != no_of_babies or lag_drunk != drunk
or lag_obs_dt is null then null
else lag_obs_dt end as first_obs_dt,
case when lead_no_babies != no_of_babies or lead_drunk != drunk
or lead_obs_dt is null then null
else lead_obs_dt end as last_obs_dt,
rownum rn
from (
select observer_id, unicorn_id, created, lastseen,
observation_date, no_of_babies, drunk,
lag(observation_date)
over (partition by observer_id, unicorn_id, no_of_babies,
drunk
order by observation_date) lag_obs_dt,
lag(no_of_babies)
over (partition by observer_id, unicorn_id, drunk
order by observation_date) lag_no_babies,
lag(drunk)
over (partition by observer_id, unicorn_id, no_of_babies
order by observation_date) lag_drunk,
lead(observation_date)
over (partition by observer_id, unicorn_id, no_of_babies,
drunk
order by observation_date) lead_obs_dt,
lead(no_of_babies)
over (partition by observer_id, unicorn_id, drunk
order by observation_date) lead_no_babies,
lead(drunk)
over (partition by observer_id, unicorn_id, no_of_babies
order by observation_date) lead_drunk
from unicorn_observations
order by 1,2,5
)
)
where first_obs_dt is null or last_obs_dt is null
)
where last_obs_dt is not null
order by 1,2,3,4;
Which gives:
OBSERVER_ID UNICORN_ID CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_
----------- ---------- --------- --------- --------- ------------ - ---------
1 1 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11
1 1 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12
1 2 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11
1 2 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12
1 3 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11
1 3 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12
1 6 10-NOV-11 01-DEC-11 07-NOV-11 0 n 01-DEC-11
1 6 01-JAN-12 01-JAN-12 01-JAN-12 3 n 01-JAN-12
1 6 01-FEB-12 01-MAY-12 01-FEB-12 0 n 01-MAY-12
9 rows selected.
It's got the three records for unicorn 6, but the lastseen
and observation_date
for the third are the opposite way round to your sample, so I'm not sure if I'm still not understanding that. I've assumed that you want to keep the earliest observation_date
and latest lastseen
within each grouping, on the grounds that it seems to be what would happen when adding new records, but I'm not sure...
So, the innermost query get the raw data from the table and gets a lead
and lag
for the observation_date
and the no_of_babies
and drunk
columns using slightly different partitions. The order by
is so a rownum
can be used later, obtained in the next step and used for ordering in the one after that. Just for unicorn 6 for brevity:
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
10-NOV-11 10-NOV-11 07-NOV-11 0 n 17-NOV-11 0 n
17-NOV-11 17-NOV-11 17-NOV-11 0 n 07-NOV-11 0 n 01-DEC-11 0 n
01-DEC-11 01-DEC-11 01-DEC-11 0 n 17-NOV-11 0 n 01-FEB-12 3 n
01-JAN-12 01-JAN-12 01-JAN-12 3 n 0 0
01-FEB-12 01-FEB-12 01-FEB-12 0 n 01-DEC-11 3 n 01-MAR-12 0 n
01-MAR-12 01-MAR-12 01-MAR-12 0 n 01-FEB-12 0 n 01-APR-12 0 n
01-APR-12 01-APR-12 01-APR-12 0 n 01-MAR-12 0 n 01-MAY-12 0 n
01-MAY-12 01-MAY-12 01-MAY-12 0 n 01-APR-12 0 n
The next level blanks out the lead
and lag
values for observation_date
if either the num_of_babies
or drunk
value has changed - you only specifically referred to splitting on the baby count, but I assume you want to split on sobriety too. After this, anything that has null
for either first_obs_date
or last_obs_date
is the start or end of a mini-range.
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_ RN
--------- --------- --------- ------------ - --------- --------- ----------
10-NOV-11 10-NOV-11 07-NOV-11 0 n 17-NOV-11 1
17-NOV-11 17-NOV-11 17-NOV-11 0 n 07-NOV-11 01-DEC-11 2
01-DEC-11 01-DEC-11 01-DEC-11 0 n 17-NOV-11 3
01-JAN-12 01-JAN-12 01-JAN-12 3 n 4
01-FEB-12 01-FEB-12 01-FEB-12 0 n 01-MAR-12 5
01-MAR-12 01-MAR-12 01-MAR-12 0 n 01-FEB-12 01-APR-12 6
01-APR-12 01-APR-12 01-APR-12 0 n 01-MAR-12 01-MAY-12 7
01-MAY-12 01-MAY-12 01-MAY-12 0 n 01-APR-12 8
Anything that isn't the start or end of a mini-range can now be ignored, as the values are either same as or are superseded by those before or after. This deals with the indeterminate number of observations problem - it doesn't matter how many you ignore at this point. So the next level eliminates those intermediate values by filtering rows where both first_obs_dt
and last_obs_dt
are non-null. Within that filtered set there's a second layer of lead
and lag
to get the first or last value for each date - and that's the bit I'm not sure is right as it doesn't match one of your samples.
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
10-NOV-11 01-DEC-11 07-NOV-11 0 n
10-NOV-11 01-DEC-11 07-NOV-11 0 n 01-DEC-11
01-JAN-12 01-JAN-12 01-JAN-12 3 n 01-JAN-12
01-FEB-12 01-MAY-12 01-FEB-12 0 n
01-FEB-12 01-MAY-12 01-FEB-12 0 n 01-MAY-12
Finally the remaining rows that don't have a last_obs_dt
are filtered out.
Now I'll wait to see which bit(s) I've misunderstood... *8-)
Following correction to lead
and lag
ordering, the same info for each stage for unicorn 1:
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L
--------- --------- --------- ------------ - --------- ------------- - --------- -------------- -
17-NOV-11 17-NOV-11 09-APR-11 10 n 31-OCT-11 10 n
01-NOV-11 01-NOV-11 31-OCT-11 10 n 09-APR-11 10 n
19-APR-12 19-APR-12 19-APR-12 7 y
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_ RN
--------- --------- --------- ------------ - --------- --------- ----------
17-NOV-11 17-NOV-11 09-APR-11 10 n 31-OCT-11 1
01-NOV-11 01-NOV-11 31-OCT-11 10 n 09-APR-11 2
19-APR-12 19-APR-12 19-APR-12 7 y 3
CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_
--------- --------- --------- ------------ - ---------
17-NOV-11 17-NOV-11 09-APR-11 10 n 09-APR-11
19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12
I'm not sure what shoudl happen with the preserved observation_date
and lastseen
when the original data was entered out of sequence like this, or what you'll do in that situation with new records added in the future.
尝试这个。
with cte as
(
select v.*, ROW_NUMBER() over (partition by grp, unicorn_id order by grp, unicorn_id) rn
from
(
select u.*,
ROW_NUMBER() over (partition by unicorn_id order by no_of_babies, drunk, created )
-ROW_NUMBER() over (partition by unicorn_id order by created) as grp
from unicorn_observations u
) v
)
select
observer_id, cte.unicorn_id, mincreated,maxlastseen,minobsdate,no_of_babies,drunk,maxobsdate
from cte
inner join
(
select
unicorn_id, grp,
min(created) as mincreated,
max(lastseen) as maxlastseen,
min(observation_date) as minobsdate,
max(observation_date) as maxobsdate
from cte
group by unicorn_id, grp
) v
on cte.grp = v.grp
and cte.unicorn_id = v.unicorn_id
where rn=1
order by created;
This type of problem can be solved by first creating some flags in a subquery, then using them.
with obs_flags as (
select
observer_id as obs_id
, unicorn_id as uni_id
, case when lag(observation_date) over (
partition by unicorn_id, no_of_babies, drunk
order by unicorn_id, observation_date
) is null then 1 else 0 end as group_start
, case when lead(observation_date) over (
partition by unicorn_id, no_of_babies,drunk
order by unicorn_id, observation_date
) is null then 1 else 0 end as group_end
, observation_date
, no_of_babies
, drunk
, lastseen
, created
from unicorn_observations
)
select obs_start.obs_id
, obs_start.uni_id
, obs_start.created
, obs_end.lastseen as lastseen
, obs_start.observation_date
, obs_start.no_of_babies as "#BABIES"
, obs_start.drunk
, obs_end.observation_date as last_obs_date
from obs_flags obs_start
join obs_flags obs_end on
obs_start.group_start = 1 and
obs_end.group_end = 1 and
obs_start.uni_id = obs_end.uni_id and
obs_start.no_of_babies = obs_end.no_of_babies and
obs_start.drunk = obs_end.drunk and
obs_start.observation_date <= obs_end.observation_date and
--Only join with the first end point we find:
not exists (
select * from obs_flags f where
obs_start.uni_id = f.uni_id and
obs_start.no_of_babies = f.no_of_babies and
obs_start.drunk = f.drunk and
f.group_end = 1 and
f.observation_date < obs_end.observation_date and
f.observation_date >= obs_start.observation_date
);
This is a complex problem; I may have not quite met your requirements (or there could be a typo in there. I don't have Oracle to test it). However, it should give you an idea of how it can be done.
Basically, you first find all start and end records of the periods you are interested in. Then you join each start record to the next end record within the same grouping.
Update: my original code didn't check that the end came after the start. I fixed that.
Update2: as Ben pointed out, the not exists
clause will be slow here. An alternative that has helped me speed things up in the past is to do this in two steps: first find all potential pairings, then separately select only the correct pairings out of that.
In this case, in a temporary table or subquery join each obs_start
to every potentially correct obs_end
.
Then, out of these pairings, select the one that has the earliest obs_end
for each obs_start
.
上一篇: 我如何有效地使用Alex的启动代码功能?
下一篇: 部分去归一化独角兽观测