Thursday, June 18, 2009

You don't really need that index (or that SQL)

I was given this simple looking piece of code to review.


select audit_rec_updt_dts, mod(rownm,4) as rownm, pol_seq_nbr
from
(select a.AUDIT_REC_UPDT_DTS, a.POL_SEQ_NBR, rownum as rownm
from TR_POLICY_REGISTER a, TB_TRANSACTION b
where a.POL_STAT_CD = 'REPORTED' and a.JACKET_SOURCE_CD = 'LAT'and
a.AUDIT_REC_UPDT_DTS > b.EXTRACT_START_DTS and
a.AUDIT_REC_UPDT_DTS < b.EXTRACT_END_DTS and
b.TRANS_SYS_ID in (select max(trans_sys_id) from tb_transaction)) dt
where mod(rownm,4) = 0;


I was also given 3 other almost identical statements except that the where clause changed the 0 to 1, 2, and 3. Danger Will Robinson! (Yes, that reference dates me. How about you? Remember "Lost in Space"?) It looks like somebody is thinking the result set is so big they need to break it up into chunks. More on this later...

Anyway, in order to get the query to complete in a reasonable time (my test without the index didn't come back within 30 minutes so I killed it), a "temporary" index was being created just before this statement was executed. By "temporary", I mean that they created the index just prior to running the SQL and then dropped it immediately after the SQL completed. Creating the index (parallel nologging) took several minutes to complete.

Here's execution plan for the query with the index in place:


-----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
-----------------------------------------------------------------------------------------------------------------------
|* 1 | VIEW | | 1 | 995 | 2627 |00:00:00.11 | 9815 |
| 2 | COUNT | | 1 | | 10508 |00:00:00.08 | 9815 |
| 3 | NESTED LOOPS | | 1 | 995 | 10508 |00:00:00.07 | 9815 |
| 4 | TABLE ACCESS BY INDEX ROWID | TB_TRANSACTION | 1 | 1 | 1 |00:00:00.01 | 3 |
|* 5 | INDEX UNIQUE SCAN | TB_TRANSACTION_PK | 1 | 1 | 1 |00:00:00.01 | 2 |
| 6 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1 |
| 7 | INDEX FULL SCAN (MIN/MAX)| TB_TRANSACTION_PK | 1 | 289 | 1 |00:00:00.01 | 1 |
| 8 | TABLE ACCESS BY INDEX ROWID | TR_POLICY_REGISTER | 1 | 995 | 10508 |00:00:00.07 | 9812 |
|* 9 | INDEX RANGE SCAN | TR_POLICY_REGISTER_DMIDX1 | 1 | 2 | 10508 |00:00:00.01 | 73 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(MOD("ROWNM",4)=0)
5 - access("B"."TRANS_SYS_ID"=)
9 - access("A"."JACKET_SOURCE_CD"='LAT' AND "A"."POL_STAT_CD"='REPORTED' AND
"A"."SYS_NC00029$">SYS_OP_DESCEND("B"."EXTRACT_END_DTS") AND
"A"."SYS_NC00029$" filter((SYS_OP_UNDESCEND("A"."SYS_NC00029$")>"B"."EXTRACT_START_DTS" AND
SYS_OP_UNDESCEND("A"."SYS_NC00029$")<"B"."EXTRACT_END_DTS"))


Looks good, right? The query executed in less than 1 second (.11 secs) and returned 2627 rows. Remember the predicate? There were 4 queries. Each one retrieved 1/4 of the data (2627 x 4 = 10508). That's a pretty small result set. Why run the query 4 times when once will do? And, it's not doing what the developer thought. I think the developer thought it would only access 1/4 of the rows each time and that somehow that would be faster than accessing all of them at once. Nope. Even though running this query 4 times in a row is still a fairly insignificant amount of resources used, it's not necessary. Each execution is getting all the rows and then throwing out 75% of them. Why not just run it once?

The next question is why is an index created just before and dropped just after this? Does having it be permanent cause a problem somewhere? Plus, there's already a persistent index that would/should work although it does not include all the columns used in the temp index. But, obviously something went wrong with the query that it didn't use that index properly or they wouldn't have created the temp index and attempted to break up the query into 4 parts, right?

OK. So, how about just rewriting the query to see if the current index can be used efficiently enough to meet their needs? I tried this:


select a.AUDIT_REC_UPDT_DTS, a.POL_SEQ_NBR
from TR_POLICY_REGISTER a
where a.POL_STAT_CD = 'REPORTED'
and a.JACKET_SOURCE_CD = 'LAT'
and a.AUDIT_REC_UPDT_DTS > (select EXTRACT_START_DTS from tb_transaction where trans_sys_id = (select max(trans_sys_id) from tb_transaction))
and a.AUDIT_REC_UPDT_DTS < (select EXTRACT_END_DTS from tb_transaction where trans_sys_id = (select max(trans_sys_id) from tb_transaction))



And here's the execution plan:


------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------------------------------
|* 1 | TABLE ACCESS BY INDEX ROWID | TR_POLICY_REGISTER | 1 | 995 | 10508 |00:00:00.07 | 14259 |
|* 2 | INDEX RANGE SCAN | TR_POLICY_REGISTER_N6 | 1 | 269K| 23576 |00:00:00.01 | 138 |
| 3 | TABLE ACCESS BY INDEX ROWID | TB_TRANSACTION | 1 | 1 | 1 |00:00:00.01 | 3 |
|* 4 | INDEX UNIQUE SCAN | TB_TRANSACTION_PK | 1 | 1 | 1 |00:00:00.01 | 2 |
| 5 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1 |
| 6 | INDEX FULL SCAN (MIN/MAX)| TB_TRANSACTION_PK | 1 | 289 | 1 |00:00:00.01 | 1 |
| 7 | TABLE ACCESS BY INDEX ROWID | TB_TRANSACTION | 1 | 1 | 1 |00:00:00.01 | 3 |
|* 8 | INDEX UNIQUE SCAN | TB_TRANSACTION_PK | 1 | 1 | 1 |00:00:00.01 | 2 |
| 9 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 1 |
| 10 | INDEX FULL SCAN (MIN/MAX)| TB_TRANSACTION_PK | 1 | 289 | 1 |00:00:00.01 | 1 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(("A"."JACKET_SOURCE_CD"='LAT' AND "A"."POL_STAT_CD"='REPORTED'))
2 - access("A"."AUDIT_REC_UPDT_DTS"> AND "A"."AUDIT_REC_UPDT_DTS"<)
4 - access("TRANS_SYS_ID"=)
8 - access("TRANS_SYS_ID"=)


I can live with that.

It took me a lot longer to type this than to rewrite the SQL and test it! Now, the query doesn't need to be "chunked up" nor does the index need to be temporarily created. Maybe there's another way to do it but this was my first thought and it worked as desired, so all is well!

Good deal...

6 comments:

Rob van Wijk said...

Hi Karen,

Nice post about the unfortunately not-so-rare urge to split a long running query into several parts.

Another very small improvement can be made by rewriting the two "subquery predicates" to this form:

and a.audit_rec_updt_dts > ( select max(extract_start_dts) keep (dense_rank last order by id) from tb_transaction )

It won't save you very much time, but your plan will likely look even more simple.

Regards,
Rob.

Karen said...

Good one Rob, thanks!

Tom Gaines said...

Karen -

Long time fan, first time blog commenter... :)

I like Rob's comment, and I'll have to "keep" that construct
in mind. There are always multiple ways of doing SQL, and my first
thought went here:

with my_factor as
(select extract_start_dts,
extract_end_dts
from tb_transaction
where trans_sys_id = (select max(trans_sys_id) from tb_transaction)
)
--rest of query using my_factor here--
/

The important idea, of course, is to keep the amount of work to a minimum
by reducing your two operations to one. I realize that the difference is
trivial, but it's good to keep in mind for general purposes.

Back to the original set of four separate queries for a moment. I can
imagine all sorts of havoc being created depending upon the circumstances.
With four separate queries, does the
process behind the original code guarantee that no changes whatsoever can
occur to the underlying tb_transaction and tr_policy_register tables during
the period when these four queries are executed? Without
that guarantee, you can never be absolutely certain that the work is
correct at a certain point in time.

On a similar note, the use of rownum without an order by should raise all sorts
of eyebrows. As Tom Kyte points out repeatedly, Oracle will not guarantee the
order of results delivered by a query. Even without changes to the underlying
tables over the course of a few seconds or minutes, it's probably possible to
have the 2627 rows of one "sub-query" overlap with the 2627 rows of another.
Changes to the table/index statistics after one "sub-query" but before the
next could probably do that, I'd bet.

That's all. Good work. I find it hard to believe that this approach of breaking
up larger sets into slightly more manageable one is common, but I see all sorts
of bad habits in my day-to-day work. Perhaps it does occur!

Tom Gaines

Narendra said...

Karen/Rob,

Interesting.
But I am not sure how is
a.AUDIT_REC_UPDT_DTS > (select EXTRACT_START_DTS from tb_transaction where trans_sys_id = (select max(trans_sys_id) from tb_transaction))
and a.AUDIT_REC_UPDT_DTS < (select EXTRACT_END_DTS from tb_transaction where trans_sys_id = (select max(trans_sys_id) from tb_transaction))

equivalent to
a.audit_rec_updt_dts > ( select max(extract_start_dts) keep (dense_rank last order by id) from tb_transaction )

Why can EXTRACT_END_DTS be left out?

Rob van Wijk said...

Hi Narendra,

You are right, EXTRACT_END_DTS cannot be left out. I meant that you need two of those lines, I just wrote one of them.

Regards,
Rob.

Joel Garry said...

Check out the Robotman.

(I used to hang out with Fred when we were both a lot younger and worked at the same company).

word: nomout