Tuesday, January 5, 2010

2000 columns - take 2

Thanks to everyone who commented on yesterday's post. After Randolf's comment about the 1000 column limit, I realized that so far, I've not attempted to create the full table in one go. Instead, I've been building 20-ish separate tables that contain the needed raw data plus the computed/aggregated columns. This speaks to Noons comment about having separate tables that access and group data up that can be executed in multiple streams

The SAS guys want to be able to access a single row that represents all the information they need for one type of data (for example, a shipment). So far, as I build the separate tables, I've been using a couple of views that combine several of the tables. The SAS guys then use the view to build a SAS dataset and merge multiple datasets together.

Yesterday's post was prompted by the prospect of having to build a single table in Oracle that does what they're currently doing via SAS. And to Joel's point, SAS is certainly not Oracle! One of the original goals was to ultimately have a table built and maintained in Oracle and only accessed by SAS, not multiple objects in Oracle that have to be merged into SAS datasets.

I've been so busy and focused on getting the data formulated/aggregated, and doing so as efficiently as possible, that I hadn't even made the attempt to build out a full table with all the columns of all the tables included in that one table. When I started thinking about it yesterday, I got all freaked out about doing it and that's what prompted my post. I hadn't even thought about the Oracle imposed limits on the number of columns! I've been told more than once that the team had delivered 1000s of columns before, so I just hadn't given it much thought. But, given there is an Oracle limit, I'm guessing they've never delivered a final Oracle table like they want to do this time...so, it's either been done in SAS previously or the Oracle tables didn't exceed the 1000 column limit.

So, I've got lots to figure out over the next few weeks. This just may be more fun than I can stand! :)

7 comments:

Noons said...

Let us know the outcome, Karen.
It's an interesting design problem, which way to go. And it'll be good to know the final result and why.

Connor McDonald said...

I'm not near a server currently so I cannot check this, but if memory serves, there are some interesting side-effects with lots of columns in terms of block usage.

(I think) you can only have 255 not-null columns per block, so a row with more than that will be split across multiple blocks (or maybe its into multiple row-pieces within a block)

(I stress this may all be fiction, but its just something that tucked away in the brain from many versions ago...)

Cheers
Connor

Taral said...

Karen,

Not sure this is related to this but we are also using SAS with oracle. What we do is we treat report as ETL kind of thing and process everthing in oracle with all complex logic( Now this is not SAS part of logic it's data massaging for SAS ). Also, use partition tables and other stuff. Because, what happen with SAS in our case it was taking too much time while processing and network two and fro. So, after data is ready from SAS we just issue "select * from table" or can use partition also. So, here there would be no such blocking and you will get result fast.

Anonymous said...

Karen,

You may not be able to create a table or view with more than 1,000 columns, but you are allowed to write a query that select far more columns. (I know that you can exceed 2,000 columns in the select list in 10.2.0.3, but don't remember what the limit is).

So you can set up a (small) set of tables with one-to-one links, and somem canned SQL at the SAS layer to join them. You many need to construct ordered data sets and force NL joins across the board to get this to work as efficiently as possible.

If SAS is running on a different server (and that's what I've usually seen) then you'll need to configure SQL*Net to use the largest possible SDU (32KB), and may need to check the tcp transmit and receive buffers (I can't remember their sqlnet.ora names) to ensure that they are at least as big as the SDU.

Karen said...

Jonathan, I'd pretty much decided to do exactly what you propose. The SAS guys briefly freaked out when they thought they'd have to merge multiple unsorted datasets on the SAS side (apparently doing that in SAS is a loathsome process), but when I assured them I'd give them what they want without any real effort on their part, they calmed down and agreed that sounded like the best approach.

I appreciate the note on SDU size and the tcp transmit and receive buffers as I'm sure I'd have hit a snag there and spent hours figuring that out. Thanks!

Taral, Like you mention, the bulk of the processing/aggregating of data is being done in Oracle but due to the massive number of columns, I can't provide a single source table that includes everything for the SAS folks to "select *" from. As I mentioned above in my comment back to Jonathan, I'll be using multiple smaller source tables and creating queries that link them that the SAS guys will use to produce the final datasets they need.

Connor, Just for my own edification I plan to do some testing with the 255 column boundary and see what happens. That research will likely come after I've gotten all this other stuff working, but I've got it on my "to do" list and will post results when I finally get it done.

Thanks again everyone!

Unknown said...

Seems like u could use multipule tables with the same pk using a multi table insert then place a view on top.
What is the approach u used?

DukeGanote said...

I did something similar for a reporting issue. We needed high-throughput for a report sent monthly to certain business partners. We needed to ramp up from a few hundred reports to thousands.

Basically, I designed a table to provide exactly what was needed for the end-report. No computation by the reporting tool (Cognos): it does a "3 P" process:
1. Pluck the values from the table
2. Put them in the report cell
3. Push out the PDF report.

http://it.toolbox.com/blogs/data-ruminations/the-absolutely-stunningly-beautiful-table-25899

One team member hates the ETL because there's a lot of repetition in the SQL: summarizing the data by month-to-date, year-to-date, rolling-12-month, etc.

Our table "only" has 105 columns, compared to yours, but I think the principle is the same.

And it screams :)

http://it.toolbox.com/blogs/data-ruminations/tuning-for-fun-28641