Monday, January 4, 2010

2000 columns

How many columns does the largest table you've ever worked with contain? The current project I'm working on has 1 table with almost 2000 columns (and it's likely to add more!). This is the most highly denormalized design I've ever encountered and there's something about it that makes the performance optimizer in me cringe. But, the statisticians that have to munch and crunch this data in SAS tell me this format best suits their needs (based on similar designs used successfully in previous projects).

I think I'm really more concerned about the work that has to be done to populate these columns as most of the columns contain aggregations or formulations of some sort or another. So, perhaps it's not the number of columns that really is niggling at me as it is everything that must occur to produce the values contained in a single row (it's a lot).

What's your experience? Did the number of columns help, hinder or make no difference in the design and performance of the application that used such wide tables? This phase of the project is a proof of concept so it'll be a while before I get hard data on how well (or not) this design really works, but it has certainly made me curious about what other folks have experienced.

Let me hear from you if you have any comments or insights!

7 comments:

Kerry Osborne said...

wow

What's the average row length and do a significant percentage of the rows span more than one block? I'm guessing you're not using an 8K block size.

Do most of the queries return a large number of the columns? I have seen a system where a table was split into a set of often queried columns and rarely queried columns. So to get the whole row you had to a join and the ER diagram had a one-to one relationship. Sort of a do it yourself column oriented database back in the day.

Kerry

SydOracle said...

I once had a reporting table that had a column for each hour for the past five or six weeks. I think we pegged it back to 1000 columns.

It was basically used for comparing a figure for one time with the one from the previous hours or the same hour across the previous week/month.

It had a relatively small number of rows (hundreds) and got thrown away and rebuilt every night. It was never a problem. Just odd.

The front-end/UI picked up the entire record and decided which columns to work with, so the database wasn't buried with different queries for different column sets.

Noons said...

Let's look at it this way:

- is the CPU work needed to populate those 2000 columns any less if you normalize the design and spread them across more than one entity(table)?

By this I mean: you still need to calculate all those 2000 columns don't you? Regardless of them being in one table or not.

As such, if there is a lot of CPU mileage needed to create the values to populate those columns with, spreading them across more than one table achieves nothing.

By itself.

Now, the reverse of the coin:

- if to populate those values you need to access a lot of other data as well as the calculations, then it might make sense to group the calculation data with the result columns into separate tables, so that you can have multiple streams of computation going at the same time without an I/O impact as well.

Then there is the third factor:

- the 2000 columns to be fed to SAS, do they go all in a row by row operation? Or is it a case of one column in row 1, another column in row 20882938, next column in row 232, etcetc. You get my drift.


All of these would have to be weighed in into a final solution, and of course it's all contingent on the real needs of the processing being carried out.

As such rather than apply a general rule of "2000 columns? Oh noes!", I'd favour analyzing the whole thing first and see where is the real application bottleneck, potential or real.
Then, and only then, a final custom and optimal design can be arrived at.

Randolf said...

Not really addressing your question, but is this an Oracle table?

I thought there is a limitation of 1,000 columns per table in Oracle. Has this been lifted in the recent releases?

Having more than 255 columns means that a single row will be stored in multiple row pieces even if it fits within a single block, furthermore compression (at least the compression available until 10g) is not possible with such a table.

Randolf

Joel Garry said...

Randolf, SAS is not Oracle. There are a couple of ways to use Oracle with SAS, one of which allows pass-through of sql statements, to do the analytic processing in Oracle rather than the more feature-rich SAS. Of course, anything you do in Oracle is going to be limited by Oracle, causing people like Karen to be mortified when she sees what they do :-)

I was talking to a statistician on a train...

Think about what denormalization is (when it is done right, from "our" viewpoint): pre-calculating aggregates out of a normalized design in order to move the performance hit from the selection/sorting to a time when it is not critical.

Is that what the designers are doing in your case, Karen? Keep in mind, this may be implicit in their previous successful projects, whether or not they realized it. And it may or may not implicitly propagate to what you are seeing now. And it may be sensitive to how the SAS datasets are physically implemented, not migrating nicely to Oracle (if that is what you are doing).

Of course, I'm on the internet, so may have no knowledge whatsoever of what I may speak of. In the distant past, I could make large datasets process much faster by taking them out of the db, but these days...

word: duckis

robert said...

Gary, I don't know the details but since you talked about a few hundred rows: couldn't you pull that off with analytics?

pd said...

even in 11g there is a 1000-columns limit which i hope to be more.