Jun 17, 2014

Oracle Log File Sync Wait Event - a short analysis and troubleshoot methods

Oracle Log File Sync Wait Event 

The Oracle "log file sync" wait event is triggered when a user session issues a commit (or a rollback). The user session will signal or post the LGWR to write the log buffer to the redo log file. When the LGWR has finished writing, it will post the user session. The wait is entirely dependent on LGWR to write out the necessary redo blocks and send confirmation of its completion back to the user session. The wait time includes the writing of the log buffer and the post, and is sometimes called "commit latency".

The P1 parameter in <View:V$SESSION_WAIT> is defined as follows for the log file sync wait event:

Note: P1 = buffer#

All changes up to this buffer number (in the log buffer) must be flushed to disk and the writes confirmed to ensure that the transaction is committed and will be kept on an instance crash. The wait is for LGWR to flush up to this buffer#.

Before writing a batch of database blocks, DBWn finds the highest high redo block address that needs to be synced before the batch can be written. DBWn then takes the redo allocation latch to ensure that the required redo block address has already been written by LGWR, and if not, it posts LGWR and sleeps on a log file sync wait.


Root causes of ‘log file sync’ waits :

Root causes of ‘log file sync’, essentially boils down to few scenarios and following is not an exhaustive list, by any means!

1. LGWR is unable to complete writes fast enough for one of the following reasons:

(a) Disk I/O performance to log files is not good enough. Even though LGWR can use asynchronous I/O, redo log files are opened with DSYNC flag and buffers must be flushed to the disk (or at least, written to disk array cache in the case of SAN) before LGWR can mark commit as complete.

(b) LGWR is starving for CPU resource. If the server is very busy, then LGWR can starve for CPU too. This will lead to slower response from LGWR, increasing ‘log file sync’ waits. After all, these system calls and I/O calls must use CPU. In this case, ‘log file sync’ is a secondary symptom and resolving root cause for high CPU usage will reduce ‘log file sync’ waits.

(c) Due to memory starvation issues, LGWR can be paged out. This can lead to slower response from LGWR too.

(d) LGWR is unable to complete writes fast enough due to file system or unix buffer cache limitations.

2. LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for cpu or memory and I/O performance is decent enough. Still, if there are excessive commits, then LGWR has to perform many writes/semctl calls and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage’ statistics’.

3. IMU undo/redo threads. With Private strands, a process can generate few Megabytes of redo before committing. LGWR must write generated redo so far and processes must wait for ‘log file sync’ waits, even if redo generated from other processes is small enough.

4. LGWR is suffering from other database contention such as enqueue waits or latch contention. For example, we have seen LGWR freeze due to CF enqueue contention. This is a possible scenario however unlikely.

5. Various bugs. Oh, yes, there are bugs introducing unnecessary ‘log file sync’ waits.

Note : 
    CF enqueue :  The CF enqueue is a Control File enqueue and happens during parallel access 6to the control files.  The CF enqueue can be seen during any action that requires reading the control file, such as redo log archiving, redo log switches and begin backup commands.

Root cause analysis :

It is worthwhile to understand and identify root cause and resolve it.

1. First make sure, ‘log file sync’ event is indeed a major wait events. For example in the statspack report for 60 minutes below, ‘log file sync’ is indeed an issue. Why? Statspack is for 1800 seconds and there are 8 CPUs in the server. Approximately, available CPU seconds are 14,400 CPU seconds. There is just one database alone in this server and so, approximate CPU usage is 7034/14,400 : 50%

But, 27021 seconds were spent waiting. In average, 27021/3600=7.5 processes were waiting for ‘log file sync’ event. So, this is a major bottleneck for application scalability.

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                      % Total
Event                                Waits          Time (s)       Ela Time
--------------------------- ------------ --------   --------
log file sync                     1,350,499     27,021     50.04
db file sequential read      1,299,154     13,633     25.25
CPU time                       7,034             13.03
io done                           3,487,217      3,225      5.97
latch free                        115,471         1,325      2.452.

Identify and break down LGWR wait events. Query wait events for LGWR. In this instance LGWR sid is 3 (and usually it is).

Find which "sid" is showing this wait event / causing issue with commit / rolback.

SQL> select sid, event, time_waited, time_waited_micro
from v$session_event where sid=3 order by 3;


   SID EVENT                          TIME_WAITED       TIME_WAITED_MICRO
------ ------------------------------ -----------         -----------------
..
     3 control file sequential read        237848              2378480750
     3 enqueue                             417032                    4170323279
     3 control file parallel write         706539                7065393146
     3 log file parallel write             768628                  7686282956
     3 io done                           40822748                   4.0823E+11
     3 rdbms ipc message                208478598          2.0848E+12

When LGWR is waiting ( using semtimedop call) for posts from the user sessions, that wait time is accounted as ‘rdbms ipc message’ event. This event, normally, can be ignored. Next highest waited event is ‘io done’ event. After submitting async I/O requests, LGWR waits until the I/O calls complete, since LGWR writes are done synchronous writes. [ asynchronous and synchronous are not contradictory terms when comes to I/O! Google it and there is enormous information about this already]


Reducing Oracle waits / wait times:

If a SQL statement is encountering a significant amount of total time for this event, the average wait time should be examined. If the average wait time is low, but the number of waits is high, then the application might be committing after every row, rather than batching COMMITs. Oracle applications can reduce this wait by committing after "n" rows so there are fewer distinct COMMIT operations. Each commit has to be confirmed to make sure the relevant REDO is on disk. Although commits can be "piggybacked" by Oracle, reducing the overall number of commits by batching transactions can be very beneficial.

If the SQL statement is a SELECT statement, review the Oracle Auditing settings. If Auditing is enabled for SELECT statements, Oracle could be spending time writing and commit data to the AUDIT$ table.

If the average wait time is high, then examine the other log related waits for the session, to see where the session is spending most of its time. If a session continues to wait on the same buffer# then the SEQ# column of V$SESSION_WAIT should increment every second. If not then the local session has a problem with wait event timeouts. If the SEQ# column is incrementing then the blocking process is the LGWR process. Check to see what LGWR is waiting on as it may be stuck.


Some solution:

1) Reduce other I/O activity on the disks containing the redo logs, or use dedicated disks.
2) Try to reduce resource contention. Check the number of transactions (commits + rollbacks) each second, from V$SYSSTAT.
2) Alternate redo logs on different disks to minimize the effect of the archiver on the log writer.
3) Move the redo logs to faster disks or a faster I/O subsystem (for example, switch from RAID 5 to RAID 1).
4) Consider using raw devices (or simulated raw devices provided by disk vendors) to speed up the writes.
5) See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options in order to reduce the amount of redo being written.
6) See if any of the processing can use the COMMIT NOWAIT option (be sure to understand the semantics of this before using it).
7) Check the size of the log buffer as it may be so large that LGWR is writing too many blocks at one time.

Log file sync wait event: other considerations:

There may be a problem with LGWR's ability to flush redo out quickly enough if Oracle "log file sync" waits are significant for the entire system. The overall wait time for "log file sync" can be broken down into several components. If the system still shows high "log file sync" wait times after completing the general tuning tips above, break down the total Oracle wait time into the individual components. Then, tune those components that take up the largest amount of time.

The "log file sync" wait event may be broken down into the following components:

1. Wakeup LGWR if idle
2. LGWR gathers the redo to be written and issues the I/O
3. Wait time for the log write I/O to complete
4. LGWR I/O post processing
5. LGWR posting the foreground/user session that the write has completed
6. Foreground/user session wakeup

Tune the system based on the "log file sync" component with the most wait time. Steps 2 and 3 are accumulated in the "redo write time" statistic. (i.e. as found under STATISICS section of Statspack) Step 3 is the "log file parallel write" wait event. (See Metalink Note 34583.1:"log file parallel write") Steps 5 and 6 may become very significant as the system load increases. This is because even after the foreground has been posted it may take some time for the OS to schedule it to run.


Myth# User Action :

There are 3 main things you can do to help reduce waits on "log file sync":

1) Tune LGWR to get good throughput to disk.

  • Do not put redo logs on RAID 5.
  • Place log files on dedicated disks.
  • Consider putting log files on striped disks.


2) If there are lots of short duration transactions, see if it is possible to BATCH transactions together so there are fewer distinct COMMIT operations. Each commit has to have it confirmed that the relevant REDO is on disk. Although commits can be piggybacked by Oracle, reducing the overall number of commits by batching transactions can have a very beneficial effect.

3) Determine whether any activity can safely be done with NOLOGGING / UNRECOVERABLE options.

Thanks.....

1 comment:

  1. A Windows feature called superfetch preloads frequently used programmes into memory for quicker access. On some systems, nevertheless, it can result in performance concerns. To disable Superfetch in Windows, open Services, locate SysMain, right-click and choose Properties, change Startup type to Disabled, and then click Stop.

    ReplyDelete

Translate >>