Wednesday, May 27, 2009

The Right Stuff

Now this is seriously cool. I was sent a link to an article called "They Write the Right Stuff." It's about the software that powers the US space program and the team of developers that write it. Here's a paragraph from the article:

But how much work the software does is not what makes it remarkable. What makes it remarkable is how well the software works. This software never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program -- each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors.


The article goes on to discuss what the team does to ensure the software is this good and why it's so absolutely critical that it is flawless. The answer to "why" is fairly obvious. Think of a multi-ton spacecraft lifting off and shooting into space. Think of a spacecraft moving at 17,500 miles per hour and needing to adjust course precisely to hit a re-entry point. Think of the crew sitting inside. Wow! It gives you a whole new perspective on what it could mean if you have bugs in your code.

I agree with the article that software development seems to be in the "cave man" phase of evolution. I see a lot (and I mean a lot) of buggy code and bad SQL. And the prevailing attitude is usually "log a defect...we'll prioritize it and get it fixed when we can". What if the code that controls the space shuttle was treated that way?

I get it that having bugs in the software that runs our businesses doesn't mean life or death. I also understand that often the speed with which new features in software are delivered is more critical than delivering that feature "bug free" or, for that matter, delivering it knowing it will perform well! But knowing that the ability to create code of this quality is possible makes me wish (even more so than I already do) that the software development process was approached with greater care by everyone. Not just by the developers of code that must be bug free or people could die.

What if the code you wrote would mean someone died if it had a bug? What would you do differently? While I know it may be a poor analogy, this makes me think of the "2 minute offense" in football. The team has been playing poorly until the last 2 minutes of the game. Then, they go into this hurry up offense and really push themselves to score. My question has always been why don't they just play all out the rest of the game?

It seems to me to be similar to how software development is approached. The job gets done, often not very well, but everybody goes into 2 minute mode when the code goes into production and people start screaming about bugs and performance. If the screams are loud enough, things get fixed. It just seems that it could be done so much better to start with.

I understand the arguments for why it doesn't happen "the right way" first. But after reading this article, it makes me marvel at how cool it would be to be part of a team that looks at their code as being something that MUST NOT have bugs...ever. And, they create the processes that support achieving perfect code. I'd think those folks really like what they do and take great pride in what they develop knowing that their diligence and good work keep people alive and keep our space program running.

Cool. Very, very cool.

Saturday, May 16, 2009

Once again: why guess when you can know?

I recently was asked to review a bit of SQL that was performing poorly. By poorly I mean that a single execution of the query may take hours to complete (that's pretty poor). The SQL was quite simple:



SELECT ph.cust_bus_nm
, ph.clup_fmt1
, ph.batch_period_month_yr_dt
, ph.batch_sys_id
, ph.our_prem_amt
, ph.liab_amt
, ph.code
, ph.pol_effect_dt
, ph.pol_nbr
, ph.agent_prem_amt
, ph.uw_sys_id
, ph.cost_cntr_sys_id
, ph.state_cd
, ph.loc_rep_sys_id
, ph.cust_sys_id
, ph.clup_sys_id
, ph.item_type
, NVL(ph.pol_type_cd, p.pol_type_cd) AS pol_type_cd
, ph.acct_period_month_yr_dt
, ph.hist_rev_ind
, ph.validation_dts
FROM tr_pol_h_endr_h_v ph
INNER JOIN tr_policy p ON ph.pol_sys_id = p.pol_sys_id
WHERE ph.liab_amt >= 2500000
AND TRUNC(ph.acct_period_month_yr_dt) BETWEEN TO_DATE ('200902', 'YYYYMM') AND TO_DATE ('200902', 'YYYYMM')
AND DECODE (0
, 0, ph.uw_sys_id
, 0) = ph.uw_sys_id
AND DECODE (0
, 0, ph.cost_cntr_sys_id
, 0) = ph.cost_cntr_sys_id
AND DECODE ('ALL'
, 'ALL', ph.state_cd
, 'ALL') = ph.state_cd
AND DECODE (0
, 0, NVL (ph.loc_rep_sys_id, 0)
, 0) = NVL (ph.loc_rep_sys_id, 0)
AND DECODE (0
, 0, ph.cust_sys_id
, 0) = ph.cust_sys_id
AND DECODE (0
, 0, ph.clup_sys_id
, 0) = ph.clup_sys_id
/



Note that the DECODEs look a bit odd but are there to accommodate how Crystal Reports has to handle (or doesn't handle well) variables. That wasn't really part of the problem.

The query used a view (tr_pol_h_endr_h_v) so I started by reviewing the view source. In the view source I found a few things that were suspect.
1) The view source included a join between tr_policy_history (alias pol_hist) and tr_policy (note that these tables are both 70 million+ rows each and are growing in size). This is the table being joined to the view in the offending SQL. As a matter of fact, the view source join was identical to the join in the SQL (INNER JOIN tr_policy p ON (pol_hist.pol_sys_id = p.pol_sys_id).
2) The view source defined the acct_period_month_yr_dt column as TRUNC(pol_hist.acct_period_month_yr_dt). There was a function-based index on TRUNC(pol_hist.acct_period_month_yr_dt).
3) The execution plan indicated a full scan using a filter on acct_period_month_yr_dt was occurring on the pol_hist table even though the FBI was in place.

Here's where I question whether this was an oversight or just a lack of understanding.

First, since the view source accesses the tr_policy table, why not perform the NVL check on pol_hist.pol_type_cd in the view source? Why join back to tr_policy again in the SQL that uses the view to make the NVL check?

Next, the view source uses TRUNC(pol_hist.acct_period_month_yr_dt) to create the column acct_period_month_yr_dt. But, in the SQL statement, the coder used TRUNC again. Why? By using it again, the effect is like writing TRUNC(TRUNC(acct_period_month_yr_dt). Since the function-based index is based on the expression using a single TRUNC() function, the index is ignored for use since the expressions don't match.

So, what to do?

Change the view source to execute NVL(pol_hist.pol_type_cd, p.pol_type_cd) AS pol_type_cd. Then, change the SQL as follows:



SELECT ph.cust_bus_nm
, ph.clup_fmt1
, ph.batch_period_month_yr_dt
, ph.batch_sys_id
, ph.our_prem_amt
, ph.liab_amt
, ph.code
, ph.pol_effect_dt
, ph.pol_nbr
, ph.agent_prem_amt
, ph.uw_sys_id
, ph.cost_cntr_sys_id
, ph.state_cd
, ph.loc_rep_sys_id
, ph.cust_sys_id
, ph.clup_sys_id
, ph.item_type
, ph.pol_type_cd
, ph.acct_period_month_yr_dt
, ph.hist_rev_ind
, ph.validation_dts
FROM tr_pol_h_endr_h_v ph
WHERE ph.liab_amt >= 2500000
AND ph.acct_period_month_yr_dt BETWEEN TO_DATE ('200902', 'YYYYMM') AND TO_DATE ('200902', 'YYYYMM')
AND DECODE (0
, 0, ph.uw_sys_id
, 0) = ph.uw_sys_id
AND DECODE (0
, 0, ph.cost_cntr_sys_id
, 0) = ph.cost_cntr_sys_id
AND DECODE ('ALL'
, 'ALL', ph.state_cd
, 'ALL') = ph.state_cd
AND DECODE (0
, 0, NVL (ph.loc_rep_sys_id, 0)
, 0) = NVL (ph.loc_rep_sys_id, 0)
AND DECODE (0
, 0, ph.cust_sys_id
, 0) = ph.cust_sys_id
AND DECODE (0
, 0, ph.clup_sys_id
, 0) = ph.clup_sys_id
/



I also added the liab_amt column to the FBI so that both conditions could be satisfied/matched in the index to eliminate a TABLE ACCESS to filter on liab_amt.

The result was execution times that fell into the 4-6 second range. From hours to seconds. Not bad.

But, I really wonder how something so simple was overlooked for so long. Apparently, this query had been on the top of the "bad SQL" list for ages and multiple attempts had been made to optimize it. But, within just a few minutes, I noted the issues and was able to recommend the fix. I don't think the fix was that hard but it seemed to be a stumper for some time. Why?

I think it may have been both a bit of oversight and a bit of lack of understanding. The oversight was in accessing a table (tr_policy) twice. Mantra => Never do anything more than absolutely necessary.

It may have also been oversight that the FBI wasn't getting used. But for that one, I'm going to point the finger at lack of understanding. The expression in the index definition must match the expression in the SQL. I don't think the coder realized that when the view source used TRUNC, that was all that was needed. And, I think that the reason nobody caught it was that nobody was looking at an actual execution plan for the query. They may have either only been looking at an explain plan (which may have shown an index use???) or they didn't look at a plan at all and just assumed the index was in use (this is what I think is most likely).

So, the moral of the story once again is: why guess when you can know? I haven't included the tests I conducted here for the sake of brevity (well OK... this post is not short, but it could've been a lot longer!), but with the test data and actual execution plan information in front of me, this one seemed quite easy. Guessing at the problem and possible solutions had allowed an easy fix to be overlooked for months. With good data, the problem was quite obvious and the solution easy to implement.

Repeat after me: why guess when you can know?!

Friday, May 8, 2009

Wizard

I love magic tricks and have even recently included a couple of card tricks in my "Managing Statistics" presentation. So, as I walked around Downtown Disney yesterday, I found this hat and couldn't resist trying it on!



I may still go back and buy it...and maybe a wand too!

Wednesday, May 6, 2009

Strange road sign

I flew into the Sanford airport in Orlando this week for the Collaborate conference. It's about 20 miles from the main Orlando airport. As I drove to my hotel, I passed a road sign unlike any I've ever seen before. It was located just before I crossed a bridge and it read:


I had visions of hoards of mosquitos or killer bees attacking my moving vehicle so I made sure the windows were rolled up tight, turned the vent to recirculate and sped up until I felt I would be out of the danger zone.

Only in Florida.... :)