Data Proximity for Data Warehousing Performance

by Jan 9, 2019

Data Warehousing DBAs usually work with huge tables, often possessing hundreds of millions or even billions of rows. And occasionally, these DBAs might need to perform maintenance that requires either recreating the entire huge table or some of its partitions. For example, the DBA might need to drop an unnecessary column or simply desire to de-fragment that table or some of its partitions (i.e. reduce the number of extents allocated).

In either case, data warehousing tables, and even just their partitions, are often simply too big for unloading and reloading. For example, exporting and then importing a 700 million row table may require 2 to 3 days run-time. Whereas a “CREATE TABLE AS SELECT” (a.k.a. CTAS) would take just a few hours — assuming that sufficient space is available.

A run-time efficient version of an Oracle database CTAS command to copy a huge table is as follows:

CREATE TABLE COPY_OF_ORDER

STORAGE (INITIAL 100M NEXT 100M PCTINCREASE 0)

PCTFREE 5 PCTUSED 90

NOLOGGING PARALLEL (DEGREE 16)

AS

SELECT *

FROM ORDER;

 

Thus, CTAS becomes the data warehousing DBA’s greatest alley. It permits the DBA to easily and quickly copy massive amounts of data. However, CTAS can also be a DBA’s worst enemy if they’re not careful. While the CTAS example above is run-time efficient for the copy operation itself, what about its effect on queries against the copied table?

 

Our test environment was a 700 million-row table on a 64-bit, 16-processor UNIX box with an EMC RAID-1 disk array. The CTAS command copied the table in less than one hour. However, queries against the copied table ran nearly four times as long as those same queries issued against the original table. What happened?

 

The problem is that the CTAS has negated the original table’s proximity of data. More simply stated, the CTAS has more uniformly spread related rows of data across many more blocks and files. This can be a very elusive problem to find. Look again at the CTAS example. Unfortunately, it looks harmless enough. It’s tough to see that it’s the source of the problem.

 

Most data-warehousing tables have their data loaded at regular intervals (e.g. nightly). As a result, related rows of data become naturally located in close proximity, often within the same or closely related data blocks inside a relatively few data files. So queries retrieving those densely packed rows perform a relatively minimum amount of physical IO.

 

The query to test a table’s data proximity for Oracle is as follows:

 

SELECT

    COUNT(

        DISTINCT DBMS_ROWID.ROWID_RELATIVE_FNO(rowid)||’-‘||

        DBMS_ROWID.ROWID_BLOCK_NUMBER(rowed

    ) “FILES-BLOCKS”,

                    AVG(

        COUNT(DISTINCT DBMS_ROWID.ROWID_ROW_NUMBER(rowid))

    ) “AVERAGE ROW DENSITY”

FROM &table_name

WHERE &where_condition_for_returning_related_rows

GROUP BY

    DBMS_ROWID.ROWID_RELATIVE_FNO(rowid)||’-‘||

    DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid);

The average row density for our original table was 90. In other words, each data block held approximately 90 related rows. However, the density for our new table was only 24, meaning that each block held only 24 related rows. Hence, all the related data rows are spread across nearly four times as many data blocks. Since Oracle does its physical IO at the block level, we have nearly four times as much physical IO occurring. That’s why our queries were running four times as long.

The corrected CTAS locates the related rows together and is as follows:

CREATE TABLE ORDERED_COPY_OF_ORDER

STORAGE (INITIAL 100M NEXT 100M PCTINCREASE 0)

PCTFREE 5 PCTUSED 90

NOLOGGING PARALLEL (DEGREE 16)

AS

SELECT *

FROM ORDER

GROUP BY &all_the_column_names_in_the_table;

The average row density for this newest table was 104 – which was even better than the original table’s density of 90. Note that the CTAS uses the GROUP BY clause because the ORDER BY clause is illegal for a CTAS. Although the table copy run-time increased to nearly four hours (which was acceptable as a one-time cost to copy the data versus 2-3 days for doing export/import), queries against this newest table ran within 5-10 minutes.

As you can see, data warehousing DBAs have some very challenging issues to deal with. And often the problems and their solutions are not readily apparent. But that’s the real fun in working with terabytes of data.