Well today I have to install an Oracle RAC environment on AIX.
We use ASM for our shared disks between our servers. This is where our database (tablespaces), redo logs, spfiles, and archive logs reside. For the Oracle Cluster Ready Services you can get by with 1 OCR disk and 2 voting disks. For redundancy, we run with 2 OCR disks and 3 voting disks. The redundancy, raid, striping etc is done in our SAN. One thing you do need to know is that if you use ASM then you need your luns to be the same size for each diskgroup or well your raw devices to be the same size. Otherwise it can throw off your rebalancing algorithm. NOTE: in the latest 10.2.0.3
The basic requirements of installing Oracle on AIX /Unix still applies. You need to create the oracle user, the dba group and the oinstall group. This is in the Oracle Database install guide and also in the Oracle RAC install guide. It took our SA’s about 2 DAYS to build the boxes according to specs. Another thing I would highly advise is to get sudo for ALL just to make it easier. You will also find that it works well that you are in constant contact with your SA’s should you need something or find it missed. After the install is done, they can remove the sudo command.
Pre-Configuring the Servers
A couple of things you won’t find in the doc that we did is we set the io buffer cache to 2-5% of available RAM. We also had our Lun’s precarved out to 64GB Luns. Luns being the physical disk that is on the SAN. The SAN admin and AIX admin handle attaching these and making them shared. I will get into how you do that below. The thing is we did the same thing in Linux and we used fdisk to carve out the OCR and voting disks. In AIX you use a LVM. The problem with the LVM in AIX according to a couple of our AIX admins is that they can’t give both nodes access to the same disk at the same time. They would need to use something like HACMP which we don’t have a license for on this server. The answer? The admins are going to get 5 luns at 1GB a piece. They will be attached and each will have a major and minor spindle number. He can then use mknod to essentially create an alias and voila. I will also show what the cluster verification utility is and how I use it to verify things like user equivalence, node connectivity, etc… I will even show where we get an error an I ignore it. Also, depending on if your database does a lot of reading and writing, you will want fine striping like 128k. If you are heavy read and write intensive like a data warehouse then you want course striping.
My main reference point was the Oracle 10gR2 RAC - AIX install guide and the Oracle ASM guide.
System Requirements for all nodes:
AIX 5L version 5.3 maintenance Level 02 or later
Hardware
· Allocate swap space equivalent to 1.5 x physical memory
· 1GB of disk space in the /tmp directory
· 64 - bit kernel / processor is required
OS filesets
· bos.adt.base
· bos.adt.lib
· bos.adt.libm
· bos.perf.libperfstat
· bos.perf.perfstat
· bos.perf.proctools
· rsct.basic.rte
· rsct.compat.client.rte
· xlC.aix50.rte 7.0.0.4 or later
· clC.rte 7.0.0.1 or later
C/C++ Compiler
· IY65361
· IY65362
AIX PTF for APARs
· IY89080
· IY92037
· IY68989
· IY68874
· IY70031
· IY76140
JDK
· IY63533 - SDK 1.4 64-bit 1.4.2.1
· IY65305 - SDK 1.4 32-bit 1.4.2.2
· IY58350 - SDK 1.3.1.16 32-bit
Shell Limits and Network Tuning Parameters
· Ensure ssh2 functions on all nodes
· Soft File size = -1 (Unlimited)
· Soft CPU time = -1 (Unlimited)
· Soft Data segment = -1 (Unlimited)
· Soft Stack size = -1 (Unlimited)
· maxuproc = 65536
· ipqmaxlen = 512
· rfc1323 = 1
· sb_max = 2*655360
Set u-limit for Oracle account as follows:
· sudo -u root ulimit -a
o time(seconds) unlimited
o file(blocks) unlimited
o data(kbytes) unlimited
o stack(kbytes) unlimited
o memory(kbytes) unlimited
o coredump(blocks) unlimited
o nofiles(descriptors) unlimited
You should have unlimited for oracle and root user otherwise it will fail to spawn the process.
We also need to configure the network configuration. We have a public network and also a vip ip that goes on that public network. The SA only configures the static public network to your public interface but he puts both that and the VIP into the /etc/hosts file and even in DNS if you want it there (in DNS). Oracle’s CRS does the bonding of the VIP to the card. That way if a node fails it can bring the VIP up on another node in the cluster. You also need to configure your interconnects. For this we have 2 NIC’s on each node going through 2 separate switches so no single point of failure; however, if one of the interconnects fails that CRS uses, then it will fence and evict other nodes out of the cluster. So how did we get around this? We had the SA bond the two interconnects into one IP. We then used etherchannel to do load balancing and the speed of the interconnects is full 1000FD. Something else you can do is use jumbo frame sizes of 9000 MTU instead of the default of 1500 MTU. For this to work on Cisco switches you have to have a dedicated switch and not VLAN’s because it is a global setting on the switch and you don’t want to use jumbo frames for anything else (in most cases). Since RAC involves potentially feeding a lot of blocks across the interconnect, the jumbo frames remove a lot of the overhead and allow plenty of room for data. It also means there is less work to do to break up the blocks into small packets and reassemble since there are fewer parts.Also, you never ever ever ever want to use a cross over cable for your interconnect. If the cable breaks or if a card fails, which one do you know is bad? Also, if you want to scale, then how do you do it beyond two nodes? For a reference on this, look in metalink and also Google Mike Erwin and RAC. He has some discussions about it.
So this is what our network configuration looks like:
cat /etc/hosts
127.0.0.1 loopback localhost
#PUBLIC INTERFACE
10.10.10.1 racnode1 racnode1.domain
10.10.10.2 racnode2 racnode2.domain
#VIP ON PUBLIC - Assuming a subnet mask of 255.255.255.0
#Notice how it is on the same subnet
10.10.10.11 racnode1-vip racnode1-vip.domain
10.10.10.12 racnode2-vip racnode2-vip.domain
#PRIVATE INTERCONNECTS
#These are bonded - really 2 NICs on each node
192.168.1.1 racnode1-priv racnode1-priv.domain
192.168.1.2 racnode2-priv racnode2-priv.domain
Once this is all configured, you then need to setup your raw devices. We did not want to license HACMP to carve out a couple of raw disks from logical volumes from existing physical volumes or LUNS, so we had the SAN admins carve out the smallest LUNS they would give me which were 1GB. The AIX SA then made them available to me. The 5 disks were /dev/hdisk6 - /dev/hdisk10. 2 for the OCR and 3 for the voting disks. The OCR’s for AIX only need to be 256MB and the Voting only need to be 20MB a piece. This might seem like overkill but was cheaper to lose 3.5GB of disk than to license HACMP. There were also devices labeled /dev/rhdisk6 - /dev/rhdisk10 which is the raw character device. This is what Oracle actually accesses and the permissions on the 2 OCR disks need to be owned by root and the dba group. The 3 voting disks need to be owned by oracle and the dba group. I have seen these work with the oinstall group used as well but the Oracle doc says to use the dba group so I will. You also need to change permissions on the underlying raw device unless using multipathing software which creates a virtual name to address multiple paths to the same disk (multiple HBAs). To change the permissions issue the following command as root or use sudo.
chown root:dba /dev/rhdisk6
chmod 640 /dev/rhdisk6
Do the same thing for rhdisk7 or whichever your raw devices are. With the 3 voting disks as well as all your other ASM disks (if you are using ASM) then issue the following command as root or use sudo:
chown oracle:dba /dev/rhdisk8
chmod 660 /dev/rhdisk8
Do the same thing for the rest of the devices.
There are a couple of things to note about setting ownership on these “raw devices”.
First, find out where the logical character device is pointing. For example, ls -l /dev/rdsk and see where it points and then grant permission on that as well. If you are using multipathing software, then ignore. For example, if you are using emcpower path, then just grant ownership of emcpowera through emcpowerz which is the virtual name.
Secondly, you need to set the no-reservation flag to prevent locking on the hdisks and allow concurrent I/O.
For all multipath I/O or MPIO capable devices such as ESS, DS8000, and DS6000, use this command.
chdev -l hdisk1 -a reserve_policy=no_reserve.
For EMC Symmetrix, Clarion, Hitachi, IBM DS4000, and non-MPIO capable devices, do the following
chdev -l hdisk -a reserve_lock=no
Next, user equivalence needs to be setup. What user equivalence does is allows Oracle to login to other nodes in the cluster without needing a password. It does this via interactive logins. We also have an issue and motd fie in our etc directory which throws it off. To get around this you either remove those files or rename them or you can create a .hushlogin file in the Oracle user’s home directory.
To setup user equivalence, issue the following commands:
NOTE: The docs say to give them a password but I gave them no password. It might be a little less secure but in doing this I won’t have to issue the ssh-agent $SHELL command and then the ssh-add command to add the keys for each session.
/usr/bin/ssh-keygen -t rsa
/usr/bin/ssh-keygen -t dsa
You then create an authorized_keys files and add keys to it then copy it to all nodes in the RAC.
Example for a 2 node RAC cluster:
touch ~/.ssh/authorized_keys
ssh racnode1 cat /home/oracle/.ssh/id_rsa.pub >> authorized_keys
ssh racnode1 cat /home/oracle/.ssh/id_dsa.pub >> authorized_keys
ssh racnode2 cat /home/oracle/.ssh/id_rsa.pub >> authorized_keys
ssh racnode2 cat /home/oracle/.ssh/id_dsa.pub >> authorized_keys
scp ~/.ssh/authorized_keys racnode2:/home/oracle/.ssh/
Now we can verify if it works. When you run the command below, you should ONLY get a driectory output. If you get any banners or anything else it will fail.
ssh racnode2 ls
Now you need to verify things using Oracle’s cluster verification utility. It could fail if you have some private interfaces that are really public, like anything that starts with 10 for example. If you get this, just ignore it. You also want to verify connectivity to shared storage.
./runcluvfy.sh stage -pre crsinst -n litprxor01dm,litprxor11dm
./runcluvfy.sh comp ssa -n litprxor01dm,litprxor11dm -s
/dev/rhdisk8,/dev/rhdisk9,/dev/rhdisk10
Installation:
After this has been verified it will be time to start the install. The first thing you need to install is Oracle’s CRS. I would also recommend reading through the documentation first.
I ran the rootpre.sh that was in the rootpre folder on each node. I used scp -r to copy the directory over to each node and run it. The software only needs to be on one node though because it uses user equivalence to copy it over. If you have a slow interconnect then it could be a long install : ).
After running the rootpre, I then bring up the ./runInstaller on my local screen using X redirect since my AIX box is not running XWindows. I then proceed through the install and select both nodes. I answer a few questions including what I want to use for my interconnect. If your etc/hosts file is off then it will throw errors. Make sure they match up correctly. You will also need to tell it which interface is public (VIPs), which is private (interconnect), and which ones are not used. At the end it will ask you to run a series of scripts and it will tell you which node to run them on. Sudo and the comman through an error. I had to sudo su - root and then run them. I can’t recall the error but if I do I will update this post. It was about something missing because of the path of the Oracle user not root.
Also, you will need to run VIPCA if it errors. The reason it errors is because one of the programmers had it look for a private IP address and mark it as an invalid VIP candidate. It runs silently as part of the CRS install. Manually running it after the install fixes the problem.
After this, I installed the Oracle Database binaries. Again, I had the software on one node and it installed across all nodes. I told it not to create a database. One thing you have to decide if you use ASM is if you want it to share the same binaries as the database. If you do, then patching becomes much faster with less downtime. The drawback is if you want multiple versions of the DB software to use ASM. Example, Oracle 11g comes out and I want to put it on this server. I also want to use ASM, what happens if it requires a new version of ASM? 10g won’t work with that new version. It also takes up more space to have separate homes. As you can see, there are pros and cons to both sides. After installing it in the same home without creating any instances or databases I ran the 10.2.0.3 patch on the CRS home and the database home. You need a metalink account to get this.
Patching:
NOTE: I have included my path but if it is set in your $PATH variable then issuing crsctl should work.
You need to issue with root privileges the following command:
/u01/app/oracle/product/10.2.0/crs/crsctl stop crs
You will have to start it after the finish patching it with the following command:
/u01/app/oracle/product/10.2.0/crs/crsctl start crs
You can verify that CRS starts by issuing the following command:
crs_stat -t
You will have to do the same thing for the Oracle home. Once patched, you can then create a database. For now, and because of the extensity of this already post already, I would say use dbca. Before you create the database, you should run NETCA and setup your listeners on both nodes. This will then setup the correct local_listeners and remote_listeners values in your DB parameters as well as your TNSNAMES.ORA file and LISTENER.ORA file.
When you setup ASM using DBCA, you will need to use a pfile and not an spfile. Use the default location. It will then create it in your admin/+ASM/pfile directory but it also creates it in your $ORACLE_HOME/dbs directory. The dbs directory is where you will want to make changes to your files later. When I setup the ASM and discovered the disks, I chose external redundancy because the SAN has it built in with RAID 5 or RAID 10. It will also ask you for an admin password. Remember this because if you use Grid Control, you will need to know that password to monitor your ASM disks. If you forget it, you will need to create a new password file. Also, if you use the default of +ASM then the instance on the first node will be +ASM1 and on the second will be +ASM2. You can only log into ASM as sysdba. It is not a database, just an instance with some data dictionary views. Also, when I built my diskgroups, I made my flash recovery area bigger than my database. I multiplex my redo logs and control files here. You canalso decide if you want to multiplex your archive logs between them. I send my RMAN backups uncompressed to DG2 as well. You can see why this has to be bigger than the diskgroup I use for the DB. You can use lower tier storage here than you do for the database. Work with your SAN guys in setting that up.
When you create the database and have it use ASM you can select the diskgroups you want to use. In my case, I chose both. I named them DG1 and DG2. When I turned on archiving, I used DG2. For the multiplexing of redo logs I used +DG1 and +DG2. That is all I selected. For the Flash Recovery Area, I used +DG2. When I got into my redo logs, you will notice there is 4 by default. There is 2 for the first node and 2 for each additional node. If you look at them you will see the thread numbers are different. I went ahead and created a 5th and 6th group. 5th was thread 1 and 6th was thread 2. This gives me 3 redo logs for each node. When they archive, they will not overwrite eachother. I also made my redo logs 100MB in size. Oracle makes them way too small for my liking. If I can reduce the number of log switches, then I can reduce some overhead. When you are done, it will create your database on both nodes. After it finishes you can issue the crs_stat -t command again and you will see your services are started. You can also create other RAC databases doing the same thing.
A gotcha that I ran into was the load balancing. The parameter remote_listener is set to point to a tnsname in your tnsnames.ora. Whatever the host is, is what it will return when it tries to load balance. If it is not in DNS, then when it returns the connect string to go to the other node your local machine will not resolve it. You could make an entry in your etc/hosts file or you could just hardcode the VIP address and then it will always resolve. Up to you.
Friday, June 27, 2008
Subscribe to:
Posts (Atom)