Wednesday, February 6, 2013

Understanding Oracle Optimizer Statistics Webinar Q&A

There were quite a few questions asked during the webinar yesterday that I didn't have time to answer online. So, here's the list of questions from the webinar transcripts with my answers. In some cases, I combined several similar questions and gave a single answer. Hope it's helpful!

1. Can you give your thoughts on bind peeking and how this relates to statistics?
Bind-peeking occurs during the hard parse of a SQL statement that includes a bind variable. At the time of the parse, the optimizer will "peek" at the contents of the bind variable and use that value in its cardinality calculation. The issue that can occur happens when the data is skewed (i.e. there are some frequently occurring values and some infrequently occurring values). If a histogram is present on that column and a literal is used in the SQL - instead of a bind variable - the optimizer is able to determine the cardinality exactly for that value because each query that uses a different literal value would require a new hard parse. Each parse could result in a different plan selection depending on the estimates derived for that specific value. But, the problem is that since the bind is present, there will be only one hard parse of the statement and thus whichever bind variable value is present during the initial parse, that's the plan that is in place for that specific SQL. Until that plan gets aged out of the library cache, it will continue to be used. So, if another bind variable value is used later and that value would be better serviced by a different plan operation, it would still use the original plan. This has changed in later versions of Oracle with the advent of features such as adaptive cursor sharing and cardinality feedback. But, the answer to your question is that bind peeking relates to statistics by making it a bit harder for the optimizer to create plans that are optimal for different bind values. Otherwise, how the optimizer uses the statistics is identical to a query that doesn't use binds.

2. How important are the system statistics (no workload statistics and workload statistics)?
Do you recommend to use workload statistics? When to use workload stats?
What about system statistics do you recommend to collect them?

Workload statistics are intended to help the optimizer understand workload characterizations that change over the course of the day (or whatever time period that is specific to you). For example, if your database is heavy OLTP between 8am and 6pm and heavy batch/reporting between 6pm and 8am, you could let the optimizer know about these different types of workload by implementing workload stats. The way it would work is that you would collect/capture workload stats twice: once during the daytime OLTP hours and once during the evening batch hours. The captured stats indicate how cpu, IO and throughput looks during that time period and could show the optimizer that the daytime hours are full of numerous fast-running queries whereas the evening hours are consumed by long-running queries. This information can help the optimizer choose plans more appropriate to each workload (for example, more index scans during the day and more full scans during the evening). Noworkload stats are in place by default and use fairly innocuous values for these stats. If you've never used workload stats, I advise caution before attempting to implement them. Different workload stats can/will effect the optimizer's plan choices and it may do so in ways that effect more plans than you expect (or want). So, while workload stats can provide the optimizer additional information from which to develop plans, make sure you are prepared to do in-depth and thorough testing before implementing them.

3. Do the order of predicates in the where clause influence in any way the order the optimizer will use them, i.e. if I put the filter for my largest table first, would it take it first?
The order of predicates used to make a difference in the long ago prior to the advent of cost-based optimization. But today's optimizer, using stats, will calculate which predicates will provide the best filtering and execute those first. So, the only reason to write your predicates in any order is so that you can verify what you expect with what the optimizer ends up choosing in the plan. If you think the order should be different than the optimizer chooses, you can quickly compare your written order with the steps in the plan and see where things are "off". Then, you can verify where the optimizer went wrong by reviewing the statistics and comparing the actual vs estimated values the optimizer made.

4. So are you stuck with no histogram to account for skew if you use bind variables?
No, you're not stuck. You just have to realize that skew and bind variables don't always play nicely together. As I mentioned in the earlier answer on bind peeking, today's optimizer is much better at handling binds and data skew. If, however, you find that a particular SQL's behavior is unpredictable and you have to have stability, you may have to consider writing the SQL specifically to accommodate certain cases of skew. For example, you may write two SQL statements and use IF/THEN logic in your code to execute the correct SQL based on the bind value to be used. It requires extra code and knowing about specific corner cases where skew is a problem, but when it's something that is important enough, that may be your best option.

5. Are terms "selectivity" and "density" used interchangeably?
The difference is that density refers to the computed answer to the expression 1/number-of-distinct-column-values. Each column has its own density. Selectivity is typically thought of as the combined densities of multiple columns (or predicates). Such that if you had a WHERE clause of WHERE gender = 'F' and district = 12, the density for gender would be 1/2, or .50, and the density for district would be 1/12, or .083. The predicate selectivity would be .50 x .083 = .0415. But, if you only have a single predicate, then you could use the terms interchangeably but I prefer to keep the two terms separate for clarity.

6. As a general default, what value would you recommend for the "method_opt" parameter of dbms_stats.gather_schema_stats, and would your recommendation be different for Oracle 11.2 vs. 9.2?
In v11, particularly 11.2, I'd recommend using the default method_opt of FOR ALL COLUMNS SIZE AUTO. There have been numerous improvements in how stats are collected and I think the defaults provide "close to perfect" results in most cases. I'd have to say that I didn't start using the defaults until this latest version. For pre-11 Oracle versions, I stuck with FOR ALL COLUMNS SIZE 1 and then did separate collections for tables/columns that I knew would benefit from histograms. The bottom-line is that there really is no "one size fits all" way to collect stats. Your data and your SQL have specific nuances that only you can know. But, if you're running the latest version of Oracle, I'd start with the defaults and modify/adjust from there to meet your specific needs.

7. Which version did Extended Statistics come out in?
11g release 1

8. Is there a way to see the transformed statement?
The transformed statement isn't emitted anywhere. The closest you can get is to use an optimizer trace, 10053, and review that information to see which transformations were considered and selected.

9. While SQL remains the definitive language of DBMS, increasingly, Java in the database through JDBC or other server side programming have increased with Oracle and DB2; are DBAs doing well with the complexity of integrating SQL, PL/SQL, JAVA and JEE in delivering effective DBMS performance and in ensuring enterprise application security?
The biggest issue I see is that there are usually multiple groups that each have expertise in the different disciplines. Java developers have certain biases and PL/SQL developers have theirs. Sometimes these biases, or preferences for how to do something, cause the final product to suffer. I think the key is that all groups, regardless of the tool they are using, must remember that the database is the core (and common) element. Understanding how the database does what it does (i.e. how it executes SQL) is critical. Then, each tool can be utilized to exploit its strengths while making sure to support what the database needs and can do best.

10. After using the hint with 2 predicates why is Optimizer estimating 100 rows? I am assuming the last row in stats is showing the optimizer estimated rows the SQL will return
When using a dynamic_sampling hint at level 4, the optimizer will be able to consider relationships/dependencies between columns it previously considered independently of one another. Since the two columns used in the example were identical (i.e. they contained identical data and thus either used alone would return the same answer), by default, they would be considered independently of one another and would cause the selectivity to be too low. However, when the hint is applied and the relationship between the columns is known, the optimizer computes the selectivity properly so that it is the same as if only a single predicate were used.

11. What would be the difference between letting the out of the box gather stats vs just creating the gather stats on the specific schemas that is used by the application?
Out of the box has multiple meanings. It can refer to the default settings and the default scheduled maintenance job where stale stats are gathered. In either case, I think using default settings (like method_opt FOR ALL COLUMNS SIZE AUTO) are a good place to start. But, as I already mentioned in a previous answer, I think each site needs to adjust stats collection parameters to properly handle their own unique needs. As for the default scheduled maintenance job that collects stale stats, I'm a little wary of that one. I prefer to control when stats are collected so the default job makes me feel less in control of things. So, once again I'll say that I think the defaults (all of them) are there because they are intended to suit the needs of most databases, most of the time. Only you can determine if the defaults "as is" work in your situation and if not, then you must adjust accordingly.

12. How often should the data dictionary stats be gathered?

I'll answer your question with another question: how often and by how much does the data dictionary in your database change significantly enough to require updated stats? You have to know the answer to that question before you can decide the best collection strategy for dictionary stats in your environment. I will say that if you are upgrading versions or doing any significant patching that effects data dictionary content, collecting dictionary stats should be done after that effort.

13. Does RULE hint prevent dynamic sampling?
Yes, if the RULE hint is used, dynamic sampling will not be used. The reason is because dynamic sampling is used by the cost-based optimizer to gather statistics that are used in the development of the plan. When the rule-based optimizer is used, then these statistics are irrelevant and will not be collected.

14. Does dynamic sampling occur every sql run even within the same session?
Dynamic sampling occurs during the hard parse of any SQL statement that either doesn't have stats on an object used in the SQL or if a dynamic_sampling hint exists or if the optimizer_dynamic_sampling instance parameter setting is high enough to "kick in". But, once a single SQL has been parsed and the plan chosen and loaded into the library cache, it will be used until the plan is aged out.

15. What do you do for stats on a highly changing table?
If the changes to the table cause the number of rows to increase enough or the distribution of values to change enough so that skewed values shifts within columns, then you need to gather stats frequently enough to allow the optimizer to adjust plans based on the updated stats so that you get (hopefully) the best, optimally performing plan. However, if the changes to number of rows and distribution of values doesn't really effect the plans the optimizer should choose, then I wouldn't collect as frequently and instead collect on a regular schedule that suits your needs (daily, weekly, bi-weekly, etc). In the end, the reason to collect stats is that the plans the optimizer is choosing using the current stats aren't adequate. If you need new plans to be derived, then you'd collect new stats. If you don't want plans to change, then don't collect.

16. You have asked dynamic sampling = 4 . Is there particular reason why 4?
Level 4 is the level at which dynamic sampling will consider complex predicates (an AND or OR operator between multiple predicates on the same table).

17. When would you want to lock stats?  Also if you have a transaction table that is cleared out each minute, would you want to prevent stats from being generated on that table?
I'd want to lock stats on a table if
1) the table is static and never or rarely changes
2) if the table is emptied and reloaded frequently but always contains basically the same amount and type of data
3) I don't want changes to stats on this table to cause plan changes
4) I want to delete all stats on a table and prevent any new stats from being collected

If I have a frequently cleared and reloaded table, I'd either want to collect a set of stats at a time when the contents of the table are representative of what is typically queried and then lock them. Or, I'd want to delete the stats completely and lock them so that future stats collections are not allowed but instead remain empty thus allowing dynamic sampling to kick in based on the optimizer_dynamic_sampling parameter setting (which I'd want to set to at least 2 and most likely 4 for that table).


vr1507 said...

Thanks for posting the QA too. Is the recording going to be available? That will be very helpful to me.

Karen said...

As soon as Embarcadero makes it available, I will post it here on the blog. Thanks!