Wednesday, July 16, 2008

Analytics to the rescue

I see a lot of SQL. Actually, I see a lot of bad SQL. I guess that's because my job is mainly to help people with performance problems and the majority of the time, the problems are rooted in one or more inefficient SQL statements. Yesterday I was doing a demo of our Profiler software and showed one case that I see over and over, so I thought I'd share it here.

I have a profile for a 4 hour and 44 minute trace of an application process that was taking too long. The top 3 events in the profile (accounting for 82% of the total overall response time) were:
CPU service, FETCH calls  8,357.40s (48.9%)
db file scattered read 3,075.95s (18.0%)
db file sequential read 2,577.67s (15.1%)
The SQL statement that topped the list taking 12,695.90s (74.3% of the total response time) was:
select utrsotp_desc
from utrsotp, ucbsvco
where ucbsvco_sotp_code = utrsotp_code
and ucbsvco_dispatch_date =
(select max(ucbsvco_dispatch_date)
from ucbsvco
where ucbsvco_cust_code = :b1
and ucbsvco_prem_code = :b2
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null)
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null;
Given the fact that this one statement accounted for 74% of the overall response time, with a quick look at the rest of the profile, you could see that this one query was responsible for almost all the response time consumed by the top 3 events in the profile. So, the cool thing about this profile is that it was easy to see that by fixing one SQL statement, we'd basically get rid of our dominant problem.

Now, a bit more info. This statement was executed 1,332 times during the almost 5 hour period captured by the trace. The query returned only 587 rows and did a total of 141,084,499 block accesses (LIOs). That means that on average it took almost 160,000 LIOs for each execution. Ummm...that's not good. It should be using something like 20 LIOs or less per row (based on a maximum tolerance of no more than 10 LIOs per joined table per row).

It kinda seems obvious that there's a big problem with this query doesn't it? So, why had this process been running for so long without someone catching it? I mean, if we could fix this one statement, we could cause as much as a 74% reduction in response time. That means the 4 hr and 44 minute process could drop down to about 1 hour and 15 minutes.

But the problem was in the way the code was being reviewed on the test instance. First, (mistake 1) only EXPLAIN PLAN was used to review the anticipated plan for the SQL and (mistake 2) the SQL was executed to make sure the response time (for a single execution), seemed reasonable. Another thing worth noting is (mistake 3) that the test instance had only about 10% of the volume of data as production, so virtually anything that was executed took only a second or so due to the lack of volume. So, the response time never was really noted as an issue because it always ran fast. The EXPLAIN PLAN didn't really rasie any eyebrows either because it looked like this:

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 38 | 1768 |
| 1 | NESTED LOOPS | | 1 | 38 | 1768 |
| 2 | TABLE ACCESS FULL | UCBSVCO | 1 | 12 | 1767 |
| 3 | SORT AGGREGATE | | 1 | 20 | |
| 4 | TABLE ACCESS BY INDEX ROWID| UCBSVCO | 1 | 20 | 1 |
| 5 | INDEX RANGE SCAN | UCBSVCO_CUST_INDEX | 8 | | 1 |
| 6 | TABLE ACCESS BY INDEX ROWID | UTRSOTP | 1 | 26 | 1 |
| 7 | INDEX UNIQUE SCAN | UTRSOTP_KEY_INDEX | 1 | | |
--------------------------------------------------------------------------------------
Now, the TABLE ACCESS FULL step might have raised a few eyebrows, but since they saw Rows = 1 at that step, they figured it shouldn't be a problem. I mean, hey, if there's only going to be one row pulled from the table, they thought, how bad could it be? (Remember that they were seeing sub-second response times in test.)

But, once I was able to point out the combined effect of executing that SQL over 1300 times in their production, full volume environment, the problem was (all of a sudden) easy to see. Something needed to be done to make this SQL statement more efficient.

As I mentioned when I first started this post, I see SQL statements similar to this one over and over again. They are very typical in that lots of apps use tables that store both historical and current info in them. For instance, imagine that if you wanted to get the latest price from a product table that stored each price in a separate product row with an effective date column value used to indicate the date the price was put into effect. In order to do that, you'd have to make sure you retrieved the row which had the MAX effective date for that product. The way the query is most commonly formulated to do this is with this WHERE clause:
WHERE price_eff_dt = (SELECT MAX(price_eff_dt) FROM product_tab WHERE product_id = outer.product_id)
This is exactly what was going on in the problem query in this process.

Analytics to the rescue! By using an analytic MAX function instead, Oracle is able to make a single pass over the data to get the answer it needs. Here's the query rewritten to use an analytic MAX function instead:
select utrsotp_desc
from utrsotp,
(select ucbsvco_sotp_code sotp_cd, ucbsvco_dispatch_date dispatch_dt,
max(ucbsvco_dispatch_date) over (partition by ucbsvco_cust_code,
ucbsvco_prem_code) max_dispatch_dt
from ucbsvco
where ucbsvco_cust_code = :b1
and ucbsvco_prem_code = :b2
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null) svco
where svco.dispatch_dt = svco.max_dispatch_dt
and svco.sotp_cd = utrsotp_code;
The plan for this query was:

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 48 | 4 |
| 1 | NESTED LOOPS | | 1 | 48 | 4 |
|* 2 | VIEW | | 1 | 22 | 3 |
| 3 | WINDOW BUFFER | | 1 | 25 | 3 |
|* 4 | TABLE ACCESS BY INDEX ROWID| UCBSVCO | 1 | 25 | 1 |
|* 5 | INDEX RANGE SCAN | UCBSVCO_CUST_INDEX | 8 | | 1 |
| 6 | TABLE ACCESS BY INDEX ROWID | UTRSOTP | 1 | 26 | 1 |
|* 7 | INDEX UNIQUE SCAN | UTRSOTP_KEY_INDEX | 1 | | |
--------------------------------------------------------------------------------------

Tests of the two queries on a full volume data source (for a single set of bind values expected to return 1 row), had the following results:
Original query
160,693 LIOs
20.16s

New query
15 LIOs
.18s

That's a 99%+ reduction in LIOs and response time. Not bad, huh? :)

Simple change, massive difference. I didn't get to stay around long enough on the client site to see the fix implemented in production, but the expected benefit would mean, as mentioned above, that the response time should drop down from nearly 5 hours to just over 1 hour. Nice.

So, I'm always glad to see examples like this one pop up when I'm helping a customer with performance issues. The analytic function is a nice alternative to the traditional way of writing these types of queries that works extremely well and provides the performance needed for high numbers of executions in a high data volume environment.

11 comments:

Gary Myers said...

"Now, the TABLE ACCESS FULL step might have raised a few eyebrows, but since they saw Rows = 1 at that step, they figured it should be a problem."
Aarg. The rows shows the rows returned, not processed. Full scanning a million row table to return (an estimated) one row still involves full scanning a million rows. Both the Rows and Cost values need to be thought about.

Karen said...

Gary,

Yep...my point exactly! But, unfortunately, many people just don't *see* the full scan (and all the work associated with it)...all they see is the estimated end result. Bleh...

Tony said...

I just used analytics to solve a very similar problem last week. There is a query that is used all over the place in my application...dozens and dozens of times. It is inherited and every person before me has just accepted it and propagated it. I finally got around to really look at the query. A quick introduction of analytics just like you did here solved a lot of problems. Queries that were taking 3 minutes, now take two seconds, queries that were taking 9 seconds are now sub one second. It is awesome. This is one of the best tools we can use in tuning SQL I think.

Asif Momen said...

This post is an eye-opener for all the developers on how to write an efficient SQL.

Analytic Functions have been around since Oracle 8i but are rarely used proficiently.

Even I keep coming across queries which can really be benefited by using them.

Bundit said...

Karen,

Is there any reason to avoid using analytic function due to performance issue ?

tylermuth said...

"Oracle is able to make a single pass over the data to get the answer it needs". I think you should make that line bold and about 4x larger than the rest of the text. The key difference is that in the original example, the subquery in the predicate is executed once for EVERY ROW ("Row by Row = Slow by Slow" - Tom Kyte). I see this over and over, especially when people write custom reports against eBusiness Suite or PeopleSoft.

The more I work with data models that store historical data, the more I wish they would flag the "current record" whenever they to an insert or update. This would be trivial to implement, greatly simplify most of the queries, and likely yield massive performance improvements in both the applications and any adhoc reports written on them.

freekdhooge said...

tylermuth,

If I look at the explain plan, I don't think the subquery is done for each record investigated.
Also, in the subquery there is no predicate linking it back to the top query.

Karen,

Do you have the explain plans with the predicates? Maybe they give a clearer picture?

regards

Freek

Craig Martin said...

A couple quick questions...

Are the columns in the partition by necessary? It seems like it isn't adding anything since you are filtering on those columns.

Also, I use analytic functions frequently, however I am not sure I would have ever thought to this problem in this way. Would it provide an advantage over something like (sorry in advance if code doesn't come through formatted well):

with
max_ucbsvco as
(
    select sotp_cd
    from
    (
        select
            ucbsvco_sotp_code sotp_cd
        from
            ucbsvco
        where
            ucbsvco_cust_code = :b1
            and ucbsvco_prem_code = :b2
            and ucbsvco_stus_code = 'C'
            and ucbsvco_dispatch_ind is not null
            and ucbsvco_dispatch_date is not null
        order by
            ucbsvco_dispatch_date desc
    )
    where rownum = 1
)
select
    utrsotp_desc
from
    utrsotp
where
    utrsotp_code = (
        select
            sotp_cd
        from
            max_ucbsvco
    )

Joel Garry said...

Bundit:

If you are not on the latest patch set, see the list of bug fixes and look at the analytic sql section (ie, metalink Note:401436.1)

But in general, helping Oracle to know that it can reduce the passes through the data is a good thing.

word: rddgn
2nd word: cgrcgi

Joel Garry said...

Oh yeah, see Note:6917874.8 sometimes they break stuff fixing other stuff.

word: envmg

robert said...

Like Freek I also have the impression that there's something missing. The filter is only in the max query but not in the main query.

Karen, did you consider a solution with GROUP BY? Something along the lines of

select utrsotp_desc
from utrsotp, ucbsvco
join (select max(ucbsvco_dispatch_date) max_date
from ucbsvco
where ucbsvco_cust_code = :b1
and ucbsvco_prem_code = :b2
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null) m
where ucbsvco_sotp_code = utrsotp_code
and ucbsvco_dispatch_date = m.max_date
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null

Would it yield similar improvements?