I see a lot of SQL. Actually, I see a lot of
bad SQL. I guess that's because my job is mainly to help people with performance problems and the majority of the time, the problems are rooted in one or more inefficient SQL statements. Yesterday I was doing a demo of our
Profiler software and showed one case that I see over and over, so I thought I'd share it here.
I have a profile for a 4 hour and 44 minute trace of an application process that was taking too long. The top 3 events in the profile (accounting for 82% of the total overall response time) were:
CPU service, FETCH calls 8,357.40s (48.9%)
db file scattered read 3,075.95s (18.0%)
db file sequential read 2,577.67s (15.1%)
The SQL statement that topped the list taking 12,695.90s (74.3% of the total response time) was:
select utrsotp_desc
from utrsotp, ucbsvco
where ucbsvco_sotp_code = utrsotp_code
and ucbsvco_dispatch_date =
(select max(ucbsvco_dispatch_date)
from ucbsvco
where ucbsvco_cust_code = :b1
and ucbsvco_prem_code = :b2
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null)
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null;
Given the fact that this one statement accounted for 74% of the overall response time, with a quick look at the rest of the profile, you could see that this one query was responsible for almost all the response time consumed by the top 3 events in the profile. So, the cool thing about this profile is that it was easy to see that by fixing one SQL statement, we'd basically get rid of our dominant problem.
Now, a bit more info. This statement was executed 1,332 times during the almost 5 hour period captured by the trace. The query returned only 587 rows and did a total of 141,084,499 block accesses (LIOs). That means that on average it took almost 160,000 LIOs for each execution. Ummm...that's
not good. It should be using something like 20 LIOs or less per row (based on a maximum tolerance of no more than 10 LIOs per joined table per row).
It kinda seems obvious that there's a big problem with this query doesn't it? So, why had this process been running for so long without someone catching it? I mean, if we could fix this one statement, we could cause as much as a 74% reduction in response time. That means the 4 hr and 44 minute process could drop down to about 1 hour and 15 minutes.
But the problem was in the way the code was being reviewed on the test instance. First, (mistake 1) only EXPLAIN PLAN was used to review the anticipated plan for the SQL and (mistake 2) the SQL was executed to make sure the response time (for a single execution),
seemed reasonable. Another thing worth noting is (mistake 3) that the test instance had only about 10% of the volume of data as production, so virtually anything that was executed took only a second or so due to the lack of volume. So, the response time never was really noted as an issue because it always ran
fast. The EXPLAIN PLAN didn't really rasie any eyebrows either because it looked like this:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 38 | 1768 |
| 1 | NESTED LOOPS | | 1 | 38 | 1768 |
| 2 | TABLE ACCESS FULL | UCBSVCO | 1 | 12 | 1767 |
| 3 | SORT AGGREGATE | | 1 | 20 | |
| 4 | TABLE ACCESS BY INDEX ROWID| UCBSVCO | 1 | 20 | 1 |
| 5 | INDEX RANGE SCAN | UCBSVCO_CUST_INDEX | 8 | | 1 |
| 6 | TABLE ACCESS BY INDEX ROWID | UTRSOTP | 1 | 26 | 1 |
| 7 | INDEX UNIQUE SCAN | UTRSOTP_KEY_INDEX | 1 | | |
--------------------------------------------------------------------------------------
Now, the TABLE ACCESS FULL step might have raised a few eyebrows, but since they saw Rows = 1 at that step, they figured it shouldn't be a problem. I mean, hey, if there's only going to be one row pulled from the table, they thought, how bad could it be? (Remember that they were seeing sub-second response times in test.)
But, once I was able to point out the combined effect of executing that SQL over 1300 times in their production, full volume environment, the problem was (all of a sudden) easy to see. Something needed to be done to make this SQL statement more efficient.
As I mentioned when I first started this post, I see SQL statements similar to this one over and over again. They are very typical in that lots of apps use tables that store both historical and current info in them. For instance, imagine that if you wanted to get the latest price from a product table that stored each price in a separate product row with an effective date column value used to indicate the date the price was put into effect. In order to do that, you'd have to make sure you retrieved the row which had the MAX effective date for that product. The way the query is most commonly formulated to do this is with this WHERE clause:
WHERE price_eff_dt = (SELECT MAX(price_eff_dt) FROM product_tab WHERE product_id = outer.product_id)
This is exactly what was going on in the problem query in this process.
Analytics to the rescue! By using an analytic MAX function instead, Oracle is able to make a single pass over the data to get the answer it needs. Here's the query rewritten to use an analytic MAX function instead:
select utrsotp_desc
from utrsotp,
(select ucbsvco_sotp_code sotp_cd, ucbsvco_dispatch_date dispatch_dt,
max(ucbsvco_dispatch_date) over (partition by ucbsvco_cust_code,
ucbsvco_prem_code) max_dispatch_dt
from ucbsvco
where ucbsvco_cust_code = :b1
and ucbsvco_prem_code = :b2
and ucbsvco_stus_code = 'C'
and ucbsvco_dispatch_ind is not null) svco
where svco.dispatch_dt = svco.max_dispatch_dt
and svco.sotp_cd = utrsotp_code;
The plan for this query was:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 48 | 4 |
| 1 | NESTED LOOPS | | 1 | 48 | 4 |
|* 2 | VIEW | | 1 | 22 | 3 |
| 3 | WINDOW BUFFER | | 1 | 25 | 3 |
|* 4 | TABLE ACCESS BY INDEX ROWID| UCBSVCO | 1 | 25 | 1 |
|* 5 | INDEX RANGE SCAN | UCBSVCO_CUST_INDEX | 8 | | 1 |
| 6 | TABLE ACCESS BY INDEX ROWID | UTRSOTP | 1 | 26 | 1 |
|* 7 | INDEX UNIQUE SCAN | UTRSOTP_KEY_INDEX | 1 | | |
--------------------------------------------------------------------------------------
Tests of the two queries on a full volume data source (for a single set of bind values expected to return 1 row), had the following results:
Original query
160,693 LIOs
20.16s
New query
15 LIOs
.18s
That's a 99%+ reduction in LIOs and response time. Not bad, huh? :)
Simple change, massive difference. I didn't get to stay around long enough on the client site to see the fix implemented in production, but the expected benefit would mean, as mentioned above, that the response time should drop down from nearly 5 hours to just over 1 hour. Nice.
So, I'm always glad to see examples like this one pop up when I'm helping a customer with performance issues. The analytic function is a nice alternative to the traditional way of writing these types of queries that works extremely well and provides the performance needed for high numbers of executions in a high data volume environment.