Friday, August 15, 2008

Strange parsing behavior

My friend Jan is cool. He sends me lots of neat SQL performance issues he's working on so I canto take a look at. I love it when I get to see performance problems get fixed and have the before/after data to show the differences. Like I said, Jan is cool and sends me lots of good stuff like this.

What he sent me yesterday was interesting because I'd never seen it before. The query that was having problems was a UNION of 30 or so different SELECTs. Many of the SELECTs were similar except for the WHERE clause. Here's what Jan said in his email to me:

[Jan]
Now the fun thing is to follow it during execution. This seems to be happening:

-- parse whole query => v$sql shows sql_id child_number 0
-- start parallel slaves for
-- parallel slaves do work
-- when done, parallel slaves for this part of query wait: 'PX Deq: Execution Msg'
...
-- start new set of parallel slaves for
-- slaves wait 'cursor: pin S wait on X'
-- Query coordinator shows wait: 'PX Deq: Parse Reply' => v$sql shows sql_id child_number 1
-- parallel slaves do work
-- when done, parallel slaves for this part of query wait: 'PX Deq: Execution Msg'
...
-- start new set of parallel slaves for
-- slaves wait 'cursor: pin S wait on X'
-- Query coordinator shows wait: 'PX Deq: Parse Reply' => v$sql shows sql_id child_number 2
-- parallel slaves do work
-- when done, parallel slaves for this part of query wait: 'PX Deq: Execution Msg'
etc, etc

By the time the whole query is done we have child_number 0..30, and about 360 sessions because of all the parallel slaves!

Weirdest part is the 'cursor: pin S wait on X' plus 'PX Deq: Parse Reply' waits, they take > 8 seconds every time! Why would it parse over and over again???

When I start the query a second time, it's way faster, since those parsing waits don't happen (if the plan's still there):
First time: Selected 109 records in 712.485 Secs
Second time: Selected 109 records in 142.829 Secs
[/Jan]

The two things about this that interested me where the cursor: pin S wait on X events and the parsing waits. I had just seen the 'pin S' timed event the day before while reviewing a trace file with another customer. I find it funny how I'll often see 2 or 3 of the same kind of error in a short time frame. I guess it really is true that things happen in multiples, huh? Anyway, the 'pin S' events are related to mutex pinning issues. Admittedly, I didn't know much about that until the issue I researched the day before. Here's what I learned:

The timed event for 'cursor: pin S wait on X' (I'm just going to call this pinS for short from now on) is an event that started in occurring in v10 when Oracle added mutex pins into the scheme in addition to latches. Basically, a mutex is kind of like a latch, but just a simpler, faster version. When you get the pinS wait, it's caused by a session holding an exclusive mutex pin on a cursor for which another session wishes to get a shared mutex pin on the same cursor object. From the looks of things with your big UNION query and all the parallel slave processes, that makes sense. The p1-p3 values give you some info:

P1 Hash value of cursor
P2 Mutex value (top 2 bytes contains SID holding mutex in exclusive mode, and bottom two bytes usually hold the value 0)
P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps

and there are a few dynamic system views that provide more information about this timed event: V$MUTEX_SLEEP and V$MUTEX_SLEEP_HISTORY. You just have to check those views while the problem is occurring to get the additional info.


There's not a whole lot of data on this floating around out there, but I found some interesting info from Tanel Poder and Jonathan Lewis that helped.

Anyway, the second issue was the PX Deq: Parse Reply waits. I was wondering why all the parsing was going on and guessing that if the parsing issue was resolved, that would also reduce or eliminate the pinS waits as well.

I wrote back to Jan about the small bit I knew and told him I thought that rewriting the query to combine several of the separate SELECTs together would likely help the problem. At a quick glance, there were 18 of the SELECTs that shared the same FROM clause and join conditions.

Today I did a bit more research to check out the 'PX Deq: Parse Reply' and came across a great write up at Doug Burn's blog about Parallel-specific wait events (the whole article starts here). Good stuff...check it out.

Then, just a few minutes ago, Jan sent me the update on how he was progressing.

[Jan]
After some more digging this is where we get multiple children (simplified case, I'm sure you can simulate it):
 
select /*+ test */ appln_jrnl_id, sum(payment_count) from fihub.tbl_voucher_acct_line d
where appln_jrnl_id = 'AP_PAYMENT'
and fiscal_year_num = 2008
and acct_period_cd = 5
group by appln_jrnl_id
union
select /*+ test */ 'AP_PAYMENT', sum(payment_count) from fihub.tbl_voucher_acct_line d
where appln_jrnl_id = 'AP_PAYMENT'
and fiscal_year_num = 2008
and acct_period_cd = 5;

BTW, this table is partitioned, so it's using parallelism.

The original query had mostly group by's. But there where a few SUMs without group by's. That's what is causing the trouble. v$sql_shared_cursor says it's because of a PQ_SLAVE_MISMATCH. When running serially it doesn't happen. After rewriting the troubling parts to do a group by, it's back to normal. After a total rewrite, the query is much faster.
[/Jan]

Again, I love stuff like this. I thought it would be an interesting case to share.

Thanks Jan!

4 comments:

Erbsock said...

Try setting this:
_kks_use_mutex_pin=false

We hit this problem too.

Karen said...

Yes, this hidden parameter was known, but it wasn't an option to use in this case. Fortunately, the query rewrite did the trick.

Tom said...

ah, sounds like they are trying to create some kind of report format. I'd find out how they were getting that data. If SQL Plus, they can do breaks and calculations in SQL Plus to make it easier. Some of those queries are just crazy.

Jan is a good guy! : ) We had him onsite back in November for a few weeks and then he flew up to Toronto with me to go work with our Application Vendor. Very talented guy and he was able to help us out a great deal.

DomBrooks said...

I'm running into significant issues with the pinS wait when using parallel on 11g.

Just thought I'd post a comment as a follow-up to Erbsock's that you can no longer turn mutexes off using this parameter in 11g.