Tuesday, January 15, 2008

Oracle and RAID Technologies Best Practices

Raid Technologies

In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published a paper entitled "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.

The Mean Time Between Failure (MTBF) of the array will be equal to the MTBF of an individual drive, divided by the number of drives in the array. Because of this, the MTBF of an array of drives would be too low for many application requirements. However, disk arrays can be made fault-tolerant by redundantly storing information in various ways

RAID is an acronym for Redundant Array of Independent Disks or Redundant Array of Inexpensive drives. A ‘RAID’ system consists of an enclosure containing a number of disks, volumes, connected to each other and to one or more computers by a fast interconnect. Six levels of RAID are defined: RAID-0 simply consists of several disks, and RAID-1 is a mirrored set of two or more disks. The only other widely-used level is RAID-5, which is the subject of this article. Other RAID levels exist, but tend to be vendor-specific, and there is no generally accepted standard for features included.

Third-party vendors supply RAID systems for most of the popular UNIX platforms and for Windows NT. Hardware vendors often provide their own RAID option

Conceptually, RAID is the use of 2 or more physical disks, to create 1 logical disk, where the physical disks operate in tandem to provide greater size and more bandwidth. RAID has become an indispensable part of the I-O fabric of any system today and is the foundation for storage technologies, supported by many mass storage vendors. The use of RAID technology has re-defined the design methods used for building storage systems that support Oracle databases.

The 3 main concepts in RAID

When you talk RAID, 3 terms are important and relevant.


• Striping

• Mirroring

• Parity.



1- What Is Striping?

Striping is the process of breaking down data into pieces and distributing it across multiple disks that support a logical volume – “Divide, Conquer & Rule”. This often results in a “logical volume” that is larger and has greater I-O bandwidth than a single disk. It is purely based on the linear power of incrementally adding disks to a volume to increase the size and I-O bandwidth of the logical volume. The increase in bandwidth is a result of how read/write operations are done on a striped volume.

Imagine that you are in a grocery store. With you are about two hundred of your closest friends and neighbors all shopping for the week’s groceries. Now consider what it’s like when you get to the checkout area and find that only one checkout line is open. That poor clerk can only deal with a limited number of customers per hour. The line starts to grow progressively. The same is true of your I-O sub-system. A given disk can process a specific number of I-O operations per second. Anything more than that and the requests start to queue up. Now stop and think about how great it feels when you get to the front of the store and find that all 20 lines are open. You find your way to the shortest line and your headed out the door in no time.

Striping has a similar effect to your I-O system. By creating a single volume from pieces of data on several disks, we can increase the capacity to handle I-O requests in a linear fashion, by combining each disk’s I-O bandwidth. Now, when multiple I-O requests for a file on a striped volume is processed, they can be serviced by multiple drives in the volume, as the requests are sub-divided across several disks. This way all drives in the striped volume can engage and service multiple I/O requests in a more efficient manner. This “cohesive and independent” functioning of all the drives in a logical volume is relevant for both read and writes operations. It must be noted that striping by itself, does not reduce “response time” for servicing I-O requests. However, it does provide predictable response times and facilitates the notion of better performance, by balancing I-O requests across multiple drives in the striped volume.


Figure depicts a 4-way striped volume (v1) with 4 disks (1-4). A given stripe of data (Data1) in a file on v1 will be split/striped across the 4 disks, into 4 pieces (Data11-Data14).

Disk1

Disk2

Disk3

Disk4

Data11

Data12

Data13

Data14

Data21

Data22

Data23

Data2



2-What Is Mirroring?

Mirroring is the process of writing the same data, to another “member” of the same volume simultaneously. Mirroring provides protection for data by writing exactly the same information to every member in the volume. Additionally, mirroring can provide enhanced read operations because the read requests can be serviced from either “member” of the volume. If you have ever made a photocopy of a document before mailing the original then you have mirrored data. One of the common myths with mirroring, is that it takes “twice as long” to write. But in many performance measurements and benchmarks, the overhead has been observed to be around 15-20%.
Figure illustrates a 4-way striped mirrored volume (v1) with 8 Disks (1-8). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and then mirrored across Disks (5-8). Disks (1-4) and (5-8) are called “Mirror Members” of the volume v1.

Disk1

Disk2

Disk3

Disk4

Disk5

Disk6

Disk7

Disk8

Data11

Data12

Data13

Data14

Data11

Data12

Data13

Data14

Data21

Data22

Data23

Data24

Data21

Data22

Data23

Data24



3- What Is Parity?

Parity is the term for error checking. Some levels of RAID, perform calculations when reading and writing data. The calculations are primarily done on write operations. However, if one or more disks in a volume are unavailable, then depending on the level of RAID, even read operations would require parity operations to rebuild the pieces on the failed disks. Parity is used to determine the write location and validity of each stripe that is written in a striped volume. Parity is implemented on those levels of RAID that do not support mirroring.

Parity algorithms contain Error Correction Code (ECC) capabilities, which calculates parity for a given ‘stripe or chunk’ of data within a RAID volume. The size of a chunk is operating system (O-S) and hardware specific. The codes generated by the parity algorithm are used to recreate data in the event of disk failure(s). Because the algorithm can reverse this parity calculation, it can rebuild data, lost as a result of disk failures. It’s just like solving a math problem what you know the answer (checksum) and one part of the question e.g. 2+X =5, what is X? Of course, X=3.

Figure depicts a 4-way striped RAID 3 volume with parity – v1 with 5 Disks (1-5). A given stripe of data (Data1) in a file on v1 will be split/striped across the Disks (1-4) and the parity for Data1 will be stored on Disk 5. There are other levels of RAID that store parity differently and those will be covered in the following sections.

Disk1

Disk2

Disk3

Disk4

Disk5

Data11

Data12

Data13

Data14

Parity1

Data21

Data22

Data23

Data24

Parity2

Possible approaches to RAID

· Hardware RAID

The hardware based system manages the RAID subsystem independently from the host and presents to the host only a single disk per RAID array. This way the host doesn't have to be aware of the RAID subsystems(s).

o The controller based hardware solution DPT's SCSI controllers are a good example for a controller based RAID solution. The intelligent controller manages the RAID subsystem independently from the host. The advantage over an external SCSI---SCSI RAID subsystem is that the controller is able to span the RAID subsystem over multiple SCSI channels and by this remove the limiting factor external RAID solutions have: The transfer rate over the SCSI bus.

o The external hardware solution (SCSI---SCSI RAID. An external RAID box moves all RAID handling "intelligence" into a controller that is sitting in the external disk subsystem. The whole subsystem is connected to the host via a normal SCSI controller and appears to the host as a single or multiple disks. This solution has drawbacks compared to the controller based solution: The single SCSI channel used in this solution creates a bottleneck. Newer technologies like Fiber Channel can ease this problem, especially if they allow to trunk multiple channels into a Storage Area Network. 4 SCSI drives can already completely flood a parallel SCSI bus, since the average transfer size is around 4KB and the command transfer overhead - which is even in Ultra SCSI still done asynchronously - takes most of the bus time.

· Software RAID

o The MD driver in the Linux kernel is an example of a RAID solution that is completely hardware independent. The Linux MD driver supports currently RAID levels 0/1/4/5 + linear mode.

o Under Solaris you have the Solstice DiskSuite and Veritas Volume Manager which offer RAID-0/1 and 5.

o Adaptecs AAA-RAID controllers are another example, they have no RAID functionality whatsoever on the controller, they depend on external drivers to provide all external RAID functionality. They are basically only multiple single AHA2940 controllers which have been integrated on one card. Linux detects them as AHA2940 and treats them accordingly.
Every OS needs its own special driver for this type of RAID solution, this is error prone and not very compatible.

· Hardware vs. Software RAID


Just like any other application, software-based arrays occupy host system memory, consume CPU cycles and are operating system dependent. By contending with other applications that are running concurrently for host CPU cycles and memory, software-based arrays degrade overall server performance. Also, unlike hardware-based arrays, the performance of a software-based array is directly dependent on server CPU performance and load.

Except for the array functionality, hardware-based RAID schemes have very little in common with software-based implementations. Since the host CPU can execute user applications while the array adapter's processor simultaneously executes the array functions, the result is true hardware multi-tasking. Hardware arrays also do not occupy any host system memory, nor are they operating system dependent.

Hardware arrays are also highly fault tolerant. Since the array logic is based in hardware, software is NOT required to boot. Some software arrays, however, will fail to boot if the boot drive in the array fails. For example, an array implemented in software can only be functional when the array software has been read from the disks and is memory-resident. What happens if the server can't load the array software because the disk that contains the fault tolerant software has failed? Software-based implementations commonly require a separate boot drive, which is NOT included in the array.

Types of RAID available

RAID-0:

RAID-0 offers pure disk striping. The striping allows a large file to be spread across multiple disks/controllers, providing concurrent access to data because all the controllers are working in parallel. It does not provide either data redundancy or parity protection. In fact, RAID-0 is the only RAID level focusing solely on performance. Some vendors, such as EMC, do not consider level 0 as true RAID and do not offer solutions based on it. Pure RAID-0 significantly lowers MTBF, since it is highly prone to downtime. If any disk in the array (across which Oracle files are striped) fails, the database goes down.

In RAID 0, the data transfer from the host system is split up by the array controller, and spread across the disk drives into chunks or 'stripes'. The size of the stripe is defined by the stripe size.

RAID 0 provides a high i/o rate for small data writes. For example, with an array of 4 disks, and a stripe size of 1K, if the host system needs to write 1K worth of data, the controller can direct this to a given disk (based on some algorithm). If, however, the host system needs to write 4K of data, the controller must wait for all 4 disks to become available.

Note: - RAID 0 does not provide any data protection.

RAID0 has the following attributes:


- Better read performance

- Better write performance

- Inexpensive in cost

- Not fault-tolerant

- Storage equivalent to sum of physical drive storage in the array

- Readily available from most vendors

RAID-1:

With RAID-1, all data is written onto two independent disks (a "disk pair") for complete data protection and redundancy. RAID-1 is also referred to as disk mirroring or disk shadowing. Data is written simultaneously to both disks to ensure that writes are almost as fast as to a single disk. During reads, the disk that is the least busy is utilized. RAID-1 is the most secure and reliable of all levels due to full 100-percent redundancy. However, the main disadvantage from a performance perspective is that every write has to be duplicated. Nevertheless, read performance is enhanced, as the read can come from either disk. RAID-1 demands a significant monetary investment to duplicate each disk; however, it provides a very high Mean time between failures (MTBF). Combining RAID levels 0 and 1 (RAID-0+1) allows data to be striped across an array, in addition to mirroring each disk in the array.

RAID 1 is also referred to as 'disk mirroring'. Data on one disk is also duplicated onto another disk. Should one disk fail, the controller directs all i/o to the surviving half of the mirror.

All writes go to two disks, so the total disk space must be twice as large as the capacity made available to the host for user data. Reads can be serviced by either disk.

RAID 1 provides high data availability, at the cost of having to duplicate the number of disk drives.

RAID1 has the following attributes:


- Better read performance

- Similar write performance

- Expensive

- Fault-tolerant

- Storage equivalent to 1/2 the sum of the physical drive storage in the mirrored set.

- Readily available from most vendors

RAID-0 & RAID-1:

If RAID/0 is then combined with RAID/1 (mirroring) this then provides the resilience, but at a cost of having to double the number of disk drives in the configuration. There is another benefit in some RAID/1 software implementations in that the requested data is always returned from the least busy device.

This can account for a further increase in performance of over 85% compared to the striped, non-mirrored configuration.

Write performance on the other hand has to go to both pieces of the software mirror. If this second mirror piece is on a second controller (as would normally be recommended for controller resilience), this degradation can be as low as 4 percent

RAID10 has the following attributes:


- Better read performance

- Better write performance

- Expensive

- Fault-tolerant

- Storage is 1/2 of the sum of the physical drives' storage

- Currently available from only a few vendors (at the time of this writing)

RAID-3:

In a RAID 3 configuration, a single drive is dedicated to storing error correction or parity data. Information is striped across the remaining drives. RAID/3 dramatically reduces the level of concurrency that the disk subsystem can support (I/O's per second) to a comparable software mirrored solution . The worst case for a system using RAID/3, would be an OLTP environment, where the number of rapid transactions is numerous and response time is critical.

So to put it simply, if the environment is mainly read only (Eg Decision Support) RAID/3 provides disk redundancy with read performance slightly improved, but at the cost of write performance. Unfortunately, even decision support databases still do a significant amount of disk writing since complex joins, unique searches etc still do temporary work, thus involving disk writing.

RAID-5:

Instead of total disk mirroring, RAID-5 computes and writes parity for every write operation. The parity disks avoid the cost of full duplication of the disk drives of RAID-1. If a disk fails, parity is used to reconstruct data without system loss. Both data and parity are spread across all the disks in the array, thus reducing disk bottleneck problems. Read performance is improved, but every write has to incur the additional overhead of reading old parity, computing new parity, writing new parity, and then writing the actual data, with the last two operations happening while two disk drives are simultaneously locked. This overhead is notorious as the RAID-5 write penalty. This write penalty can make writes significantly slower. Also, if a disk fails in a RAID-5 configuration, the I/O penalty incurred during the disk rebuild is extremely high. Read-intensive applications (DSS, data warehousing) can use RAID-5 without major real-time performance degradation (the write penalty would still be incurred during batch load operations in DSS applications). In terms of storage, however, parity constitutes a mere 20-percent overhead, compared to the 100-percent overhead in RAID-1 and 0+1. Initially, when RAID-5 technology was introduced, it was labeled as the cost-effective panacea for combining high availability and performance. Gradually, users realized the truth, and until about a couple of years ago, RAID-5 was being regarded as the villain in most OLTP shops. Many sites contemplated getting rid of RAID-5 and started looking at alternative solutions. RAID 0+1 gained prominence as the best OLTP solution for people who could afford it. Over the last two years, RAID-5 is making a comeback either as hardware-based RAID-5 or as enhanced RAID-7 or RAID-S implementations. However, RAID-5 evokes bad memories for too many OLTP database architects.

The main feature of RAID-5 is prevention of data loss. If a disk is lost because of a head crash, for example, the contents of that disk can be reconstituted using the information stored on other disks in the array. In RAID-5, redundancy is provided by error-correcting codes (ECCs) with parity information (to check on data integrity) stored with the data, thus striped across several physical disks. (The intervening RAID levels between 1 and 5 works in a similar way, but with differences in the way the ECCs are stored.)

Depending on the application, performance may be better or worse. The basic principle of RAID-5 is that files are not stored on a single disk, but are divided into sections, which are stored on a number of different disk drives. This means that the effective disk spindle speed is increased, which makes reads faster. However, the involvment of more disks and the more complex nature of a write operation means that writes will be slower. So applications where the majority of transactions are reads are likely to give better response times, whereas write-intensive applications may show worse performance.

Only hardware-based striping should be used on Windows NT. Software striping, from Disk Administrator, gives very poor performance.

RAID-1 is indicated for systems where complete redundancy of data is considered essential and disk space is not an issue. RAID-1 may not be practical if disk space is not plentiful. On a system where uptime must be maximized, Oracle recommends mirroring at least the control files, and preferably the redo log files. RAID-5 is indicated in situations where avoiding downtime due to disk problems is important or when better read performance is needed and mirroring is not in use.

RAID5 has the following attributes:

- Data is stripped across multiple physical disks and parity data is stripped across storage equivalent to one disk.

- Better read performance

- Poorer write performance

- Inexpensive

- Fault-tolerant

- Storage equivalent to N - 1 times the number of physical drives in the array.

- Readily available from most vendors

RAID-S:

RAID S is EMC's implementation of RAID-5. However, it differs from pure RAID-5 in two main aspects:

(1) It stripes the parity, but it does not stripe the data.

(2) It incorporates an asynchronous hardware environment with a write cache.

This cache is primarily a mechanism to defer writes, so that the overhead of calculating and writing parity information can be done by the system, while it is relatively less busy (and less likely to exasperate the user!). Many users of RAID-S imagine that since RAID-S is supposedly an enhanced version of RAID-5, data striping is automatic. They often wonder how they are experiencing I/O bottlenecks, in spite of all that striping. It is vital to remember that in RAID-S, striping of data is not automatic and has to be done manually via third-party disk-management software.

RAID-7:

RAID-7 also implements a cache, controlled by a sophisticated built-in real-time operating system. Here, however, data is striped and parity is not. Instead, parity is held on one or more dedicated drives. RAID-7 is a patented architecture of Storage Computer Corporation.

The Structure of HP-Auto-RAID

As discussed above, RAID 1 provides good write performance, but requires twice the disk space. RAID 5 addresses the disk space issue (ie, less cost), but penalizes small writes.

The goal of Auto-RAID is 'to provide RAID 1 performance at RAID 5 cost'. To reflect this, Auto-RAID is based on a hierarchic approach; at the top of the hierarchy is a RAM memory cache. After this there is RAID-1 disk storage, and then RAID 5 disk storage.

All i/o (both read and write) performed by the host is made via the cache. RAM provides very fast i/o, but is expensive to buy.

RAID-1 storage in Auto-RAID deviates from traditional RAID 1, in that it is both mirrored and striped. This provides greater throughput than traditional RAID 1.

Movement of Data in Auto-RAID

Data that has been recently accessed tends to live in upper layers of Auto-RAID, whilst data that has not been recently accessed gets 'aged down' the hierarchy. The idea behind this is that recently accessed data is likely to be accessed again, and should be kept in the upper layers of the hierarchy. The last modify time is used as the criteria for aging data down. This is because read performance for RAID 1 and RAID 5 is comparable.

The unit of data movement is not constant (varies from block level to stripe sizes and 'chunks'); the algorithm to calculate this is not discussed in this document.

 
                              +---------------+  
                              |               |
                              |     HOST      |
                              |               |
                              +---------------+  
                                      ^
                                      |
                                      |
                                      |
                                      |
    perf                              |
      ^                               v
      |                       +---------------+                       |
      |                       |               |                       |
      |                       |      RAM      |                       |
      |                       |               |                       |
      |              +----------------------------------+             |
      |              |                                  |             |
      |              |              RAID 1              |             |
      |              |                                  |             |
      |     +---------------------------------------------------+     |
      |     |                                                   |     |
      |     |                       RAID 5                      |     |
      |     |                                                   |     | cost
      |     +---------------------------------------------------+     v $/Mbyte
 
                                   Figure 
 

Above figure shows the Auto-RAID hierarchy, and shows the cost/performance relationship. The cost is based on the price per MB, relative to storage on a single disk. For example, relative to a single disk, the cost of RAID 1 storage is 2 (since two disks are needed instead of one).

Auto-RAID Reads

When the host requests to read some data, that data is transferred from disk storage (either RAID 1 or RAID 5) to RAM cache. The data does not migrate between the two RAID levels.

Auto-RAID Writes

When the host writes data to Auto-RAID, the data always passes through the RAM cache. After this, the data needs to be transferred from cache to disk storage, in one of several ways:

a) If the data already resides in RAID 1, it will remain there whilst being updated from the cache

b) If the data resides in RAID 5, it will probably be moved to RAID 1 (see c. and d. for reasons why the data may not move to RAID 1) as it is being updated from the cache. This is the basis of how Auto-RAID provides RAID 1 type performance

c) If the data resides in RAID 5, and the updates would cause a high amount of sequential i/o, then the data is written directly to RAID 5, without the need to migrate to RAID 1. This is because the data can be written across all of the stripes at once, removing the need to first read the parity information. This gets around the RAID 5 write penalty.

d) If there is a very high rate of random writes to RAID 5 resident data, then the performance overhead of moving all this data to RAID 1 would be excessive. Consequently, all writes would be done directly to RAID 5. This would give no performance improvement over traditional RAID 5.

e) As the need for b. arises, less recently used data is aged down from RAID 1 to RAID 5 using a 'least recuntly used' algorithm.

Sizing Auto-RAID

In order to size Auto-RAID correctly, it is vital that the 'working set' size is known.

The working set is the amount of distinct data that users are likely to modify in a given time period (typically 24 hours). The first time data is modified, it adds to the working set. Subsequent modifications of the same data do not increase the size of the working set. Read data does not contribute to the working set.

The performance of Auto-RAID is goverened by how closely the working set size resembles the RAID 1 storage size.

To understand how working set size effects performance, consider the three cases below:

  1. working set size equals RAID 1 storage size
  2. working set size is less than RAID 1 storage size, but contents are changing
  3. working set size is greater than RAID 1 storage size

a. Working set size equals RAID 1 storage size

The host writes data to the cache. At some point, a combination of free space and time criteria will cause the data to be written to RAID 1 storage. Since the working set fits in RAID 1, the data will not be written directly to RAID 5. Additionally, data will not need to be moved from RAID 5 to RAID 1.

The performance seen by the host will be similar to that of traditional RAID 1.

Note that the non-working set data (if there is any) will remain on RAID 5 storage.

b. Working set size is less than RAID 1 storage size, but contents are changing

Consider the case where the working set on two consecutive days is of the same size (and fits into RAID 1 storage), but has different data in it from day to day.

In this case, on the second of the two days, the following will happen:

  1. the host will write the data to the cache
  2. when Auto-RAID moves ths data from cache to disk, the corresponding data will be moved from RAID 5 to RAID 1 as it is being updated
  3. the previous day's working set will be aged down from RAID 1 to RAID 5

In this case, there will be an initial overhead to account for movement of data between the two RAID levels. Once this has stabilised, the performance seen by the host will be similar to that of traditional RAID 1.

c. Working set size is greater than RAID 1 storage size

In this case, Auto-RAID needs to decide if it can afford to constantly migrate data between the two RAID levels. Excessive movement of data would cause a large performanace hit. To avoid this, Auto-RAID may decide not to move data from RAID 5 to RAID 1, and write directly to RAID 5. Auto-RAID reserves a minimum of 10% for RAID 1. In the worse case, 90% of the writes would go directly to RAID 5.

This is the worse case performance scenario. Write performance for an application can degrade from that of RAID 1 to RAID 5 if the working set is too large.

If the working set is larger than RAID 1 storage capacity, Auto-RAID has the facility to dynamically add disks to RAID 1.

Should Oracle use Auto-RAID?

To answer this question, the DBA would need to consider the factors that would affect the performance of Auto-RAID.

In terms of Oracle the working set may be measured as the data that gets modified in one working day.

The DBA would need to know how much data is likely to get modified in this time period. In the extreme case, all of the data in all of the datafiles may potentially get modified. In this case, for best performance, the RAID 1 storage area would need to be as large as the whole database.

In reality, it is unlikely that all of the data would need to be modified. In this case, the DBA would need to appreciate the characteristics of the application running on Oracle. The DBA would need to take a best guess at the amount of data likely to be modified in one day, and set the size of RAID 1 storage adequately.

After moving to Auto-RAID, the performance would need to be monitored. If the initial sizing proves to be incorrect, then the RAID 1 storage area can be increased dynamically by adding extra disk drives.

In any case, it is important to note that the performance of Auto-RAID is heavily dependant on how well the working set fits into RAID 1 storage.

Benchmark tests have shown that OLTP applications are better suited to Auto-RAID than DSS applications. This is due to OLTP applications writing less data than DSS in 'one write'. As discussed in Section 4.2, this will decrease the chance of

a. migrating data between RAID 5 and RAID 1

b. writing data directly to RAID 5

Pro's and Cons of Implementing RAID technology

There are benefits and disadvantages to using RAID, and those depend on the RAID level under consideration and the specific system in question.

In general, RAID level 1 is most useful for systems where complete redundancy of data is a must and disk space is not an issue. For large datafiles or systems with less disk space, this RAID level may not be feasible. Writes under this level of RAID are no faster and no slower than 'usual'.

For all other levels of RAID, writes will tend to be slower and reads will be faster than under 'normal' file systems. Writes will be slower the more frequently ECC's are calculated and the more complex those ECC's are. Depending on the ratio of reads to writes in your system, I/O speed may have a net increase or a net decrease. RAID can improve performance by distributing I/O, however, since the RAID controller spreads data over several physical drives and therefore no single drive is overburdened.

The striping of data across physical drives has several consequences besides balancing I/O. One additional advantage is that logical files may be created which are larger that the maximum size usually supported by an operating system. There are disadvantages, as well, however. Striping means that it is no longer possible to locate a single datafile on a specific physical drive.

This may cause the loss of some application tuning capabilities. Also, in Oracle's case, it can cause database recovery to be more time-consuming. If a single physical disk in a RAID array needs recovery, all the disks which are part of that logical RAID device must be involved in the recovery.

One additional note is that the storage of ECC's may require up to 20% more disk space than would storage of data alone, so there is some disk overhead involved with usage of RAID.

RAID and Oracle

The usage of RAID is transparent to Oracle. All the features specific to RAID configurations are handled by the operating system and go on behind- the-scenes as far as Oracle is concerned. Different Oracle file-types are suited differently for RAID devices. Datafiles and archive logs can be placed on RAID devices, since they are accessed randomly.

The database is sensitive to read/write performance of the Redo Logs and should be on a RAID 1, RAID 0+1 or no RAID at all since they are accessed sequentially and performance is enhanced in their case by having the disk drive head near the last write location.

Keep in mind that RAID 0+1 does add overhead due to the 2 physical I/O's. Mirroring of redo log files, however, is strongly recommended by Oracle. In terms of administration, RAID is far simple than using Oracle techniques for data placement and striping.

Recommendations:

In general, RAID usually impacts write operations more than read operation. This is especially true where parity need to be calculated (RAID 3, RAID 5, etc). Online or archived redo log files can be put on RAID 1 devices. You should not use RAID 5. 'TEMP' tablespace data files should also go on RAID1 instead of RAID5 as well. The reason for this is that streamed write performance of distributed parity (RAID5) isn't as good as that of simple mirroring (RAID1).

ORACLE database files on RAID


Given the information regarding the advantages and disadvantages of various RAID configurations, how does this information apply to an ORACLE instance? The discussion below will provide information about how database files are used by an ORACLE instance under OLTP and DSS classifications of workload.


Note that the perspectives presented below are very sensitive to the number of users: if your organization has a 10-20 user OLTP system (and thus, a low throughput requirement), then you may get very acceptable performance with all database files stored on RAID5 arrays. On the other hand, if your organization has a 100 user OLTP system (resulting in a higher throughput requirement), then a different RAID configuration may be absolutely necessary. An initial configuration can be outlined by estimating the number of transactions (based on the number of users), performing adjustments to encompass additional activity (such as hot backups, nightly batch jobs, etc.), then performing the necessary mathematical calculations.
You definitely want to keep rollback segments, temp tablespaces and redo logs off from RAID5 since that is too slow for these write-intensive Oracle files. They are sequentially accessed. Redo logs should have their *own* dedicated drives.

OLTP (On-line transaction processing) workloads


Characterized by multi-user concurrent INSERTS, UPDATES, and DELETES during normal working hours, plus possibly some mixture of batch jobs nightly. Large SELECTS may generate reports, but the reports will typically be "canned" reports rather than ad-hok queries. The focus, though, is on enabling update activity that completes within an acceptable response time. Ideally, each type of database file would be spread out over it's own private disk subsystem, although grouping certain types of files together (when the number of disks, arrays, and controllers is less than ideal) may yield adequate performance. (Please see the article on Instance tuning for information regarding groupings of database files in an OLTP system.)

Redo logs.

During update activity, redo logs are written to in a continuous and sequential manner, and are not read under normal circumstances. RAID5 would be the worst choice for performance. Oracle Corporation recommends placing redo logs on single non-RAIDed disk drives, under the assumption that this configuration provides the best overall performance for simple sequential writes. Redo logs should always be multiplexed at the ORACLE software level, so RAID1 provides few additional benefits. Since non-RAID and RAID0 configurations can vary with hardware from different vendors, the organization should contact their hardware vendor to determine whether non-RAIDed disks or RAID0 arrays will yield the best performance for continuous sequential writes. Note that even if redo logs are placed on RAID1 arrays that the redo logs should still be mirrored at the ORACLE level. When the log writer process determines that it does not know whether the contents of a particular redo log are valid, it will mark that redo log as "STALE" in the V$LOG table. If this redo log is the only copy, then it cannot be archived, and will cause a database halt (assuming the database is running in archivelog mode). If redo logs are multiplexed as recommended by Oracle Corporation, then the archiver process will choose a copy of the redo log that is not marked as "STALE", thus generating no interruptions. If the redo logs are mirrored only at the hardware level, then both copies of the redo log are "STALE".

Archive logs

As redo logs are filled, archive logs are written to disk one whole file at a time (assuming, of course, that the database is running in archivelog mode), and are not read under normal circumstances. Any RAID or non-RAID configuration could be used, depending upon the performance requirements and size of the redo logs. For instance, if the redo logs are large, then they will become full and be archived less often. If an archive log is likely to be written no more than once per minute, then RAID5 may provide acceptable performance. If RAID5 proves too slow, then a different RAID configuration can be chosen, or the redo logs can simply be made larger. Numerous early ORACLE installations wrote redo logs directly to tape rather than disk, so reasonable sizing of redo logs can unquestionably minimize the write requirements enough to make RAID5 performance acceptable in small to medium volume installations. Note that a fault-tolerant configuration is advisable: if the archive log destination becomes unavailable, the database will halt.

Rollback Segments

As modifications are made to the database tables, undo information is written to the buffer cache in memory. These rollback segments are used to to maintain commitment control and read consistency. Rollback segment data is periodically flushed to disk by checkpoints. Consequently, the changes to the rollback segments are also recorded in the redo logs. However, a smaller amount of information is typically written to the rollback segments than to the redo logs, so the write rate is less stringent. A fault-tolerant configuration is advisable, since the database cannot operate without rollback segments, and recovery of common rollback segments will typically require an instance shutdown. If the transaction rate is reasonably small, RAID5 may provide adequate performance. If it does not, then RAID1 (or RAID10) should be considered.

User tables and indexes

As updates are performed, these changes are stored in memory. Periodically, a checkpoint will flush the changes to disk. Checkpoints occur under two normal circumstances: a redo log switch occurred, or the time interval for a checkpoint expired. (There are a variety of other situations that trigger a checkpoint. Please check the ORACLE documentation for more detail.) Like redo log switches and generation of archive logs, checkpoints can normally be configured so that they occur approximately once per minute. Recovery can be performed up to the most recent checkpoint, so the interval should not be too large for an OLTP system. If the volume of updated data written to disk at each checkpoint is reasonably small (ie. the transaction rate is not extremely large), then RAID5 may provide acceptable performance. Additionally, analysis should be performed to determine the ratio of reads to writes. Recalling that RAID5 offers reasonably good read performance, if the percentage of reads is much larger than the percentage of writes (for instance, 80% to 20%), then RAID5 may offer acceptable performance for small, medium, and even some large installations. A fault-tolerant configuration is preferable to maximize availability (assuming availability is an objective of the organization), although only failures damaging datafiles for the SYSTEM tablespace (and active rollback segments) require the instance to be shutdown. Disk failures damaging datafiles for non-SYSTEM tablespaces can be recovered with the instance on-line, meaning that only the applications using data in tablespaces impacted by the failure will be unavailable. With this in mind, RAID0 could be considered if RAID5 does not provide the necessary performance. If high availability and high performance on a medium to large system are explicit requirements, then RAID1 or RAID10 should be considered.

Temp segments

Sorts too large to be performed in memory are performed on disk. Sort data is written to disk in a block-oriented. Sorts do not normally occur with INSERT/UPDATE/DELETE activity. Rather, SELECTS with ORDER BY or GROUP BY clauses and aggregate functions (ie. operational reports) , index rebuilds, etc., will use TEMP segments only if the sort is too large to perform in memory. Temp segments are good candidates for non-RAIDed drives or RAID0 arrays. Fault-tolerance is not critical: if a drive failure occurs and datafiles for a temp segment are lost, then the temp segment can either be recovered in the normal means (restore from tape and perform a tablespace recovery), or the temp segment can simply be dropped and re-created since there is no permanent data stored in the temp segment. Note that while a temp segment is unavailable, certain reports or index creations may not execute without errors, but update activity will typically not be impacted. With this in mind, RAID1 arrays are a bit unnecessary for temp segments, and should be used for more critical database files. RAID5 will provide adequate performance if the sort area hit ratios are such that very few sorts are performed on disk rather than in memory.

Control files

Control files are critical to the instance operation, as they contain the structural information for the database. Control files are updated periodically (at a checkpoint and at structural changes), but the data written to the control files is a very small quantity compared to other database files. Control files, like redo logs, should be multiplexed at the ORACLE software level onto different drives or arrays. Non-RAIDed drives or or any RAID configuration would be acceptable for control files, although most organizations will typically distribute the multiple copies of the control files with the other database files, given that the read and write requirements are so minimal. For control files, maintaining multiple copies in different locations should be favored over any other concern.

Software and static files

The ORACLE software, configuration files, etc. are very good candidates for RAID5 arrays. This information is not constantly updated, so the RAID5 write penalty is of little concern. Fault-tolerance is advisable: if the database software (or O/S software) becomes unavailable due to a disk failure, then the database instance will abort. Also, recovery will include restore or re-installation of ORACLE software (and possible operating system software) as well as restore and recovery of the database files. RAID5 provides the necessary fault-tolerance to prevent this all-inclusive recovery, and good read performance for dynamic loading and unloading of executable components at the operating system level.

DSS (Decision Support System) workloads


In comparison to OLTP systems, DSS or data warehousing systems are characterized by primarily SELECT activity during normal working hours, and batch INSERT, UPDATE, and DELETE activity run on a periodic basis (nightly, weekly, or monthly). There will typically be a large amount of variability in the number of rows accessed by any particular SELECT, and the queries will tend to be of a more ad-hock nature. The number of users will typically be smaller than their ajoining OLTP systems (where the data originates). The focus is on enabling SELECT activity that completes within an acceptable response time, while insuring that the batch update activity still has capacity to complete in it's allowable time window. Note now that there are two areas of performance over which to be concerned: periodic refreshes and ad-hock read activity. The general level directive in this case should be to configure the database such that read-only performed by end users is as good as it can get without rendering refreshes incapable of completion. As with OLTP systems, each type of database file would ideally have it's own private disk subsystem (disks, arrays, and controller channel), but with less than ideal resources certain grouping tend to work well for DSS systems. (Please see the article on Instance tuning for information on these groupings.)

Redo logs

Redo logs are only written to while update activity is occurring. In a DSS-oriented system, a significant portion of data entered interactively during the day may loaded into the DSS database during only a few hours. Given this characteristic, redo logging may tend to be more of a bottleneck on periodic refresh processes of a DSS database than on it's ajoining OLTP systems. If nightly loads are taking longer than their allowance, then redo logging should be the first place to look. The same RAID/non-RAID suggestions that apply to redo logging in OLTP also apply with DSS systems. As with OLTP systems, redo logs should always be mirrored at the ORACLE software level, even if they are stored on fault-tolerant disk arrays.

Archive logs

Like redo logging, archive logs are only written out during update activity. If the archive log destination appears to be over-loaded with I/O requests, then consider changing the RAID configuration, or simply increase the size of the redo logs. Since there is a large volume of data being entered in a short period of time, it may be very reasonable to make the redo logs for the DSS or data warehouse much larger (10 or more times) than the redo logs used by the OLTP system. A reasonable rule of thumb is to target about one log switch per hour. With this objective met, then the disk configuration and fault-tolerance can be chosen based on the same rules used for OLTP systems.

Rollback Segments

Again like redo logging, rollback segments will be highly utilized during the periodic refreshes, and virtually unused during the normal work hours. Use the same logic for determining RAID or non-RAID configurations on the DSS database that would be used for the OLTP systems.

User tables and indexes

Writes are done to tablespaces containing data and indexes during periodic refreshes, but during the normal work hours read activity on the table and indexes will typically far exceed the update work performed on a refresh. A fault-tolerant RAID configuration is suggested to sustain availability. However, in most cases the business can still operate if the DSS system is unavailable for several hours due to a disk failure. Information for strategic decisions may not be available, but orders can still be entered. If the DSS has high availability requirements, select a fault-tolerant disk configuration. If RAID5 arrays can sustain the periodic refresh updates, then it is typically a reasonably good choice due to it's good read performance. As seen above, the read and write workload capacities can be adjusted by adding physical drives to the array.

Temp segments

In a decision support system or data warehouse, expect temp segment usage to be much greater than what would be found in a transaction system. Recalling that temp segments do not store any permanent data and are not absolutely necessary for recovery, RAID0 may be a good choice. Keep in mind, though, that the loss of a large temp segment due to drive failure may render the DSS unusable (unable to perform sorts to answer large queries) until the failed drives are replaced. If availability requirements are high, then a fault-tolerant solution should be selected, or at least considered. If the percentage of sorts on disk is low, then RAID5 may offer acceptable performance; if this percentage is high, RAID1 or RAID10 may be required.

Control files

As with OLTP systems, control files should always be mirrored at the ORACLE software level regardless of any fault-tolerant disk configurations. Since reads and writes to these files are minimal, any disk configuration should be acceptable. Most organizations will typically disperse control files onto different disk arrays and controller cards, along with other database files.

Software and static files

Like OLTP systems, these files should be placed on fault-tolerant disk configurations. Since very little write activity is present, these are again good candidates for RAID5.

Taking the above information into consideration, can an organization run an entire ORACLE database instance on a single RAID5 array? The answer is "yes". Will the organization get a good level of fault-tolerance? Again, the answer is "yes". Will the organization get acceptable performance? The answer is "it depends". This dependency includes the type of workload, the number of users, the throughput requirements, and a whole host of other variables. If the organization has an extremely limited budget, then it can always start with a single RAID5 array, perform the necessary analysis to see where improvement is needed, and proceed to correct the deficiencies

No comments: