Exadata & metric_iorm.pl

Our case where to gather information was actually related to two disks going OFFLINE at the same time. As I wasn’t sure if this is normal Exadata behavior I went and created Service Request for this.

This is what we saw one day:

4_1 2017-01-04T18:54:05+01:00 warning "Data hard disk entered confinement offline status. The LUN 0_3 changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline.
All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped. 
Status : WARNING - CONFINEDOFFLINE Manufacturer : HGST Model Number : XXXXN4.0T Size : 4.0T Serial Number : XXXXXX Firmware : A2D2 Slot Number : 3 Cell Disk : CD_03_exa1celadm03 Grid Disk : RECO_CD_03_exa1celadm03, 
DATA_CD_03_exa1celadm03, DBFS_DG_CD_03_exa1celadm03 Reason for confinement : service time is high relative to other disks with similar workloads"

4_2 2017-01-04T18:59:03+01:00 clear "Data hard disk status changed to normal. Status : NORMAL Manufacturer : HGST Model Number : XXXXN4.0T 
Size : 4.0TB Serial Number : XXXXXX Firmware : A2D2 Slot Number : 3 Cell Disk : CD_03_exa1celadm03 Grid Disk : DATA_CD_03_exa1celadm03, DBFS_DG_CD_03_exa1celadm03, RECO_CD_03_exa1celadm03"

After analysis Oracle came back and said this was expected behavior as the disk utilization went to 100% for a while. I still don’t understand why they go OFFLINE in that case though but perhaps there is some good logic for this.

While they said I could use cellcli to identify which database consumed most of I/O on this time using metrics DB_IO_UTIL_LG  and DB_IO_UTIL_SM they referenced also note “Tool for Gathering I/O Resource Manager Metrics: metric_iorm.pl (Doc ID 1337265.1)”.

I hadn’t seen this script earlier and while they say in the note that you can use Enterprise Manager to view metrics via browser for me it seems you can do your analysis bit easier when starting with metric_iorm.pl from given time range, identifying problematic database and then running AWR there to see what caused the I/O load.

There are few different ways of running the script. Either you run it without parameters so it will show current cell metrics or you can give time range similar you would give with cellcli and it will produce output with one minute intervals from the cell you run it in. Good to have detailed time of your issue!

Example:

./metric_iorm.pl where collectionTime > ‘2017-01-11T19:00:00+01:00’ and collectionTime < ‘2017-01-11T18:40:00+01:00’

Output will be similar to this (with all of your databases, in this example I picked only one):

Database: OBIEE
Utilization: Small=0% Large=0%
Flash Cache: IOPS=1985 Space allocated=846380MB
Disk Throughput: MBPS=3
Small I/O's: IOPS=3.717 Avg qtime=2.2ms
Large I/O's: IOPS=11.633 Avg qtime=1.2ms
 Consumer Group: _ORACLE_MEDPRIBG_GROUP_
 Utilization: Small=0% Large=0%
 Flash Cache: IOPS=1.400
 Disk Throughput: MBPS=0
 Small I/O's: IOPS=0.300 Avg qtime=1.9ms
 Large I/O's: IOPS=0.050 Avg qtime=8.0ms
 Consumer Group: _ORACLE_LOWPRIFG_GROUP_
 Utilization: Small=0% Large=0%
 Flash Cache: IOPS=0.000
 Disk Throughput: MBPS=0
 Small I/O's: IOPS=0.000 Avg qtime=0.0ms
 Large I/O's: IOPS=0.000 Avg qtime=0.0ms
 Consumer Group: OTHER_GROUPS
 Utilization: Small=0% Large=0%
 Flash Cache: IOPS=1983
 Disk Throughput: MBPS=3
 Small I/O's: IOPS=2.100 Avg qtime=2.9ms
 Large I/O's: IOPS=11.583 Avg qtime=1.2ms
 Consumer Group: _ORACLE_LOWPRIBG_GROUP_
 Utilization: Small=0% Large=0%
 Flash Cache: IOPS=0.500
 Disk Throughput: MBPS=0
 Small I/O's: IOPS=0.000 Avg qtime=0.0ms
 Large I/O's: IOPS=0.000 Avg qtime=0.0ms
 Consumer Group: _ORACLE_BACKGROUND_GROUP_
 Utilization: Small=0% Large=0%
 Flash Cache: IOPS=0.400
 Disk Throughput: MBPS=0
 Small I/O's: IOPS=1.317 Avg qtime=1.2ms
 Large I/O's: IOPS=0.000 Avg qtime=0.0ms

And in the end you will have summary:

CELL METRICS SUMMARY

Cell Total Utilization: Small=1% Large=61%
Cell Total Flash Cache: IOPS=2488.05 Space allocated=1210491MB
Cell Total Disk Throughput: MBPS=452.231
Cell Total Small I/O's: IOPS=108.935
Cell Total Large I/O's: IOPS=461.868

Cell Avg small read latency: 16.05 ms
Cell Avg small write latency: 0.06 ms
Cell Avg large read latency: 9.00 ms
Cell Avg large write latency: 0.11 ms

I guess in this situation I should have analyzed the disk utilization but metrics had been purged already from the time of the issue.

I’ll definitely keep this script for future issues. Seems useful!

Simo

Recent Posts

OCI CLI work – getting Cloud Guard risk levels via script

I recently got a requirement to get all Cloud Guard recipes and their rule risk…

5 days ago

Autonomous Database Audit Logs to Logging Service Part 1

I recently came across requirement to get OCI Oracle Autonomous Database audit logs to OCI…

2 months ago

Connecting to Autonomous Database Running on Google Cloud

Last time I showed how to provision Autonomous Database Serverless (ADB-S) on Google Cloud. This…

4 months ago

Can you believe it? Provisioning Autonomous Database in GCP!

I bet few years back folks didn't expect that by 2024 we would be able…

4 months ago

IP Address Insights with CLI

My previous post on IP Address Insights I mentioned it wasn't yet available with CLI…

8 months ago

Thoughts on Oracle Database@Azure

This will NOT be a technical walkthrough on Oracle Database@Azure but rather my opinions and…

8 months ago