Saturday, May 16, 2009

Once again: why guess when you can know?

I recently was asked to review a bit of SQL that was performing poorly. By poorly I mean that a single execution of the query may take hours to complete (that's pretty poor). The SQL was quite simple:



SELECT ph.cust_bus_nm
, ph.clup_fmt1
, ph.batch_period_month_yr_dt
, ph.batch_sys_id
, ph.our_prem_amt
, ph.liab_amt
, ph.code
, ph.pol_effect_dt
, ph.pol_nbr
, ph.agent_prem_amt
, ph.uw_sys_id
, ph.cost_cntr_sys_id
, ph.state_cd
, ph.loc_rep_sys_id
, ph.cust_sys_id
, ph.clup_sys_id
, ph.item_type
, NVL(ph.pol_type_cd, p.pol_type_cd) AS pol_type_cd
, ph.acct_period_month_yr_dt
, ph.hist_rev_ind
, ph.validation_dts
FROM tr_pol_h_endr_h_v ph
INNER JOIN tr_policy p ON ph.pol_sys_id = p.pol_sys_id
WHERE ph.liab_amt >= 2500000
AND TRUNC(ph.acct_period_month_yr_dt) BETWEEN TO_DATE ('200902', 'YYYYMM') AND TO_DATE ('200902', 'YYYYMM')
AND DECODE (0
, 0, ph.uw_sys_id
, 0) = ph.uw_sys_id
AND DECODE (0
, 0, ph.cost_cntr_sys_id
, 0) = ph.cost_cntr_sys_id
AND DECODE ('ALL'
, 'ALL', ph.state_cd
, 'ALL') = ph.state_cd
AND DECODE (0
, 0, NVL (ph.loc_rep_sys_id, 0)
, 0) = NVL (ph.loc_rep_sys_id, 0)
AND DECODE (0
, 0, ph.cust_sys_id
, 0) = ph.cust_sys_id
AND DECODE (0
, 0, ph.clup_sys_id
, 0) = ph.clup_sys_id
/



Note that the DECODEs look a bit odd but are there to accommodate how Crystal Reports has to handle (or doesn't handle well) variables. That wasn't really part of the problem.

The query used a view (tr_pol_h_endr_h_v) so I started by reviewing the view source. In the view source I found a few things that were suspect.
1) The view source included a join between tr_policy_history (alias pol_hist) and tr_policy (note that these tables are both 70 million+ rows each and are growing in size). This is the table being joined to the view in the offending SQL. As a matter of fact, the view source join was identical to the join in the SQL (INNER JOIN tr_policy p ON (pol_hist.pol_sys_id = p.pol_sys_id).
2) The view source defined the acct_period_month_yr_dt column as TRUNC(pol_hist.acct_period_month_yr_dt). There was a function-based index on TRUNC(pol_hist.acct_period_month_yr_dt).
3) The execution plan indicated a full scan using a filter on acct_period_month_yr_dt was occurring on the pol_hist table even though the FBI was in place.

Here's where I question whether this was an oversight or just a lack of understanding.

First, since the view source accesses the tr_policy table, why not perform the NVL check on pol_hist.pol_type_cd in the view source? Why join back to tr_policy again in the SQL that uses the view to make the NVL check?

Next, the view source uses TRUNC(pol_hist.acct_period_month_yr_dt) to create the column acct_period_month_yr_dt. But, in the SQL statement, the coder used TRUNC again. Why? By using it again, the effect is like writing TRUNC(TRUNC(acct_period_month_yr_dt). Since the function-based index is based on the expression using a single TRUNC() function, the index is ignored for use since the expressions don't match.

So, what to do?

Change the view source to execute NVL(pol_hist.pol_type_cd, p.pol_type_cd) AS pol_type_cd. Then, change the SQL as follows:



SELECT ph.cust_bus_nm
, ph.clup_fmt1
, ph.batch_period_month_yr_dt
, ph.batch_sys_id
, ph.our_prem_amt
, ph.liab_amt
, ph.code
, ph.pol_effect_dt
, ph.pol_nbr
, ph.agent_prem_amt
, ph.uw_sys_id
, ph.cost_cntr_sys_id
, ph.state_cd
, ph.loc_rep_sys_id
, ph.cust_sys_id
, ph.clup_sys_id
, ph.item_type
, ph.pol_type_cd
, ph.acct_period_month_yr_dt
, ph.hist_rev_ind
, ph.validation_dts
FROM tr_pol_h_endr_h_v ph
WHERE ph.liab_amt >= 2500000
AND ph.acct_period_month_yr_dt BETWEEN TO_DATE ('200902', 'YYYYMM') AND TO_DATE ('200902', 'YYYYMM')
AND DECODE (0
, 0, ph.uw_sys_id
, 0) = ph.uw_sys_id
AND DECODE (0
, 0, ph.cost_cntr_sys_id
, 0) = ph.cost_cntr_sys_id
AND DECODE ('ALL'
, 'ALL', ph.state_cd
, 'ALL') = ph.state_cd
AND DECODE (0
, 0, NVL (ph.loc_rep_sys_id, 0)
, 0) = NVL (ph.loc_rep_sys_id, 0)
AND DECODE (0
, 0, ph.cust_sys_id
, 0) = ph.cust_sys_id
AND DECODE (0
, 0, ph.clup_sys_id
, 0) = ph.clup_sys_id
/



I also added the liab_amt column to the FBI so that both conditions could be satisfied/matched in the index to eliminate a TABLE ACCESS to filter on liab_amt.

The result was execution times that fell into the 4-6 second range. From hours to seconds. Not bad.

But, I really wonder how something so simple was overlooked for so long. Apparently, this query had been on the top of the "bad SQL" list for ages and multiple attempts had been made to optimize it. But, within just a few minutes, I noted the issues and was able to recommend the fix. I don't think the fix was that hard but it seemed to be a stumper for some time. Why?

I think it may have been both a bit of oversight and a bit of lack of understanding. The oversight was in accessing a table (tr_policy) twice. Mantra => Never do anything more than absolutely necessary.

It may have also been oversight that the FBI wasn't getting used. But for that one, I'm going to point the finger at lack of understanding. The expression in the index definition must match the expression in the SQL. I don't think the coder realized that when the view source used TRUNC, that was all that was needed. And, I think that the reason nobody caught it was that nobody was looking at an actual execution plan for the query. They may have either only been looking at an explain plan (which may have shown an index use???) or they didn't look at a plan at all and just assumed the index was in use (this is what I think is most likely).

So, the moral of the story once again is: why guess when you can know? I haven't included the tests I conducted here for the sake of brevity (well OK... this post is not short, but it could've been a lot longer!), but with the test data and actual execution plan information in front of me, this one seemed quite easy. Guessing at the problem and possible solutions had allowed an easy fix to be overlooked for months. With good data, the problem was quite obvious and the solution easy to implement.

Repeat after me: why guess when you can know?!

3 comments:

Joel Garry said...

This OTN thread was a bit of an eye-opener for me, for FBI statistics.

word: preco

Karen said...

Joel - Good stuff. I've been hit with FBI statistics issues in the past so this wasn't totally unfamiliar. But, the thread is a nice one for things to consider with FBIs and the hidden columns behind them. Thanks for the link!

Unknown said...

Seems like I have seen this kind of thing before. I have pushed for the following solution:

Assuming the trunc of the day is what is required then simply do the truncation work at insert/update time by means of a trigger. Seems like this would avoid the trunc function in the query. This would avoid the TRUNC(TRUNC(date)) situation. Assuming the developer checks what is going on :)

It has been my experience in most business situations this is what is required.