Parallelization of processes in data warehouses

The processing of large amounts of data is typical for data warehouse environments. Depending on the available hardware resources, sooner or later the point is reached
where a job cannot be processed on a single processor resp. cannot be represented by a single process anymore. The reasons for that are:

  • Time requirements demand the use of multiple processors
  • Systems resources (memory, disk space, temporary tablespace, rollback segments, . . .) are limited.
  • Recurrent errors require the repetition of the process.

Parallelization by RDBMS parallel processing

Modern database systems are capable of parallel query processing. Queries and sometimes also changes on large amounts of data can be parallelized within the database server and use multiple processors concurrently. Advantages of this solution are:

  • No resp. only little development effort is needed
  • Only a small overhead is produced by this kind of parallelization

Disadvantages are:

  • The control of the degree of parallelization is very limited
  • Changing the number of parallel executed processes at runtime is generally impossible
  • In case of an error all work done so far is lost
  • Required database systems resources (temporary tablespace, rollback segments, . . .) must be appropriately dimensioned for the entire operation.
  • The resource usage often isn’t deterministic which can lead to problems within systems with a strong need of resource control
  • The influence of the parallelization on the rest of the system is very unpredictable resp. cannot be planned.

RDBMS parallel processing is therefore mainly suited for accelerating operations by the use of multiple processors. If systems resources aren’t abundantly present the

>>> More