How to create multiple SQL Monitor Reports from History

Sometimes I sat in front of the screen until late into the night and watched a batch job go by. At regular intervals, I kept storing SQL Monitor reports for later analysis.
With Oracle 12 the historical SQL Monitor became available.
I have written a small script that can generate sql monitor reports from history.
IMHO it’s the most useful script I’ve ever written.
It has no parameters, I prefer to edit my scripts.
The script has of course still room for improvement.
Maybe someone would like to publish a better version?

And, by the way, it’s written for Sql * Plus …


set newpage 0
set space 0
SET LONG 10000000 LONGCHUNKSIZE 10000000 pages 0 lines 500
set echo off
set feedback off
set verify off
set heading off
set trimspool on
set sqlp ''
spool temp.sql
WITH reps as (
	SELECT REPORT_ID,
	  KEY1 SQL_ID,
	  KEY2 SQL_EXEC_ID,
	  KEY3 SQL_EXEC_START,
	  T.PERIOD_START_TIME,
	  T.PERIOD_END_TIME,
	  TRUNC(86400*(T.PERIOD_END_TIME-T.PERIOD_START_TIME)) DURATION,
	  DECODE(TO_NUMBER(EXTRACTVALUE(XMLTYPE(REPORT_SUMMARY), '/report_repository_summary/sql/plsql_entry_object_id')), 
				NULL, EXTRACTVALUE(XMLTYPE(REPORT_SUMMARY), '/report_repository_summary/sql/module'),
			   (SELECT OBJECT_NAME FROM DBA_OBJECTS WHERE OBJECT_ID = TO_NUMBER(EXTRACTVALUE(XMLTYPE(REPORT_SUMMARY), '/report_repository_summary/sql/plsql_entry_object_id')))) CALLED_FROM ,
	  EXTRACTVALUE(XMLTYPE(REPORT_SUMMARY), '/report_repository_summary/sql/module') module
	FROM DBA_HIST_REPORTS t,
	  dba_hist_sqltext st
	WHERE T.PERIOD_START_TIME BETWEEN TO_DATE('01.07.2017 01:00:00','DD.MM.YYYY HH24:MI:SS') AND TO_DATE('03.07.2017 23:59:00','DD.MM.YYYY HH24:MI:SS')
	  AND 86400*(T.PERIOD_END_TIME-T.PERIOD_START_TIME) >= 1000 -- abitrary treshold, longrunning statement
	  AND ST.SQL_ID                                      = T.KEY1
	  AND COMMAND_TYPE                                  <> 47 -- no PL/SQL 
	  and EXTRACTVALUE(XMLTYPE(REPORT_SUMMARY), '/report_repository_summary/sql/module')!='DBMS_SCHEDULER'), -- No gather stats
commands as (
	SELECT REPORT_ID, 1 line, 'spool sqlmon_'||SQL_ID||'_'||CALLED_FROM||'_'||REPORT_ID||'.html'  command FROM REPS
	UNION ALL
	SELECT REPORT_ID,2, 'SELECT DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL(RID => '|| REPORT_ID||', TYPE => ''EM'')       FROM dual;' FROM REPS
	UNION ALL
	SELECT REPORT_ID,3, 'SPOOL OFF' FROM REPS
)
SELECT COMMAND
  from commands
order by REPORT_ID,  line;
spool off
@@temp
set echo on
set feedback on
set verify on

Can a select statement create a transaction Lock?

Recently I had to research a “enq: TM contention” wait event. I queried active session history (ASH) an found blocking session’s id.

I queried ASH again and found what the blocking session was doing right at the time when the other session was waiting on the “enq: TM contention” lock. At this time there was a long running select statement active in the blocking session.

How is that possible? Can a select create a lock? What must have happened ? Would it actually help to tune the select?

For a further hint you might wat to read Arup Nanda’s blog on transaction locks.

enq: TX row lock contention and enq:TM contention

Solution:Index Rebuild: Magic or Voodoo?

Of course my readers have solved the problem. I would like to thank Jure Bratina, Andrew Sayer and Martin Berger for their contributions.
In the following you will see the entire test case in a commented form. IMHO most of it is self-explanatory.
The following little scripts represents the initial set up. The schema is, as I said, the SH example scheme.

alter table sales add  (sparse varchar2(300)); 
update sales set sparse = rpad('sometext',300, '*');
commit;
create index sparse_idx on sales (sparse);
select blocks from user_segments where segment_name ='SPARSE_IDX';

Let us now examine the size of the index segment:

select blocks from user_segments where segment_name ='SPARSE_IDX';

   BLOCKS
---------
    44032

select leaf_blocks from user_indexes where index_name ='SPARSE_IDX';

LEAF_BLOCKS
-----------
      41766

Here is an update similar to the one I found in the original database:

update sales set sparse=NULL;

918843 rows updated.

exec dbms_stats.gather_table_stats(user,'SALES');

And how does it affect the statistics and the segments?

select leaf_blocks from user_indexes where index_name ='SPARSE_IDX';

LEAF_BLOCKS
-----------
          0
select blocks from user_segments where segment_name ='SPARSE_IDX';

    BLOCKS
----------
     44032

The optimizer only checks the index statistics. It does not check the segment statistics. The optimizer therefore believes that the index tiny.
All the non-null values of the column sparse must be in the index. Therefore it is optimal, from the optimizer’s perspective, to scan the supposedly small index.
However, the index segment still has it’s full size. An index rebuld is required to resolve the issue.

Index Rebuild: Magic or Voodoo?

I actually don’t like to write about todays topic. However, I hope you will agree this story is worth to be told.

Via Freelist somebody asked if an Index rebuild could be occaisionally usefull. There was a case at hand, where index rebuild seemed to be usefull after a big delete. I sometimes got annoyed by the popular myth, that regular index rebuild would be helpfull. This is one of the more popular database myths as branded by Mr. Index, Richard Foote.

Thus, I could not resist the opportunity to trash the myth I don’t like and answered that there is no such case. I ignored the warning wisper in my head telling me that things are never that easy.
Of course, the unavoidable happened Jonathan Lewis corrected me stating that there are corner cases were index rebuild can be useful ( e.g. after a big delete).

As if this would not be enough shortly thereafter I came accross this little statement below. (I presenting a test case that makes use of the SH schema.)

select time_id from sales where sparse is not null;

The Exection Plan with runtime statistics looks like that:


-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name       | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |            |      1 |        |      0 |00:00:00.04 |   41770 |
|   1 |  TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| SALES      |      1 |      1 |      0 |00:00:00.04 |   41770 |
|*  2 |   INDEX FULL SCAN                          | SPARSE_IDX |      1 |      1 |      0 |00:00:00.04 |   41770 |
-------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("SPARSE" IS NOT NULL)

Why did the optimizer choose a full index scan? Actually a full table scan would be much better. The estimate (E-Rows) is correct. The statistcs are current.

Can you guess what is missing from my test case? Hint: I researched the DML on this table and I found an update statement.

Solution: Parallel Activity just drops

As you remember we had a hash join outer with dominant null values. Thus, the parallel process dealing with the null values got most of the work to do.
How would we deal with it? Let’s us back up a bit.
A null value is a special case. We are comparing in our join a null value versus a primary key that is defined not null. The null values will certainly not generate a match when joining. Therefore any of the parallel slaves could do the compare, as long as it is guaranteed not to generate match.
It is therefore beneficial to convert the null values into some other value, as long as we to fulfill two prerequisites:

  1. we have to convert the null values in such way that the get evenly distributed across all parallel slaves
  2. we must make sure that the converted values do not generate a match

It took me a long time to figure this. First I tried something like negative values, but they were still going to the same slave.
The issue is that our newly generated key needs to be roughly in range with the pk of the build table to be spread across all slaves.
I could have tried to figure the range in a with clause reading the data dictionary, but could not get myself to do something I considered uncool.
Instead I tried something else, which is honestly a bit of a hack, but it looks way better. I used the pk of the outer table to generate a not null widespread key.
I also took advantage of the fact that the id’s were integer and changed the join condition to:

FROM T_TAB1 T1
LEFT OUTER JOIN T_TAB2 T2
ON ( NVL(T2.X_ID, T2_id +0.5)    = T1.X_ID)

That is certainly somewhat tricky. The only thing I can say in favor is that is kind of worked.

After I applied my solution the activity tab looked like that:

Compare this to the previous picture. It is quite impressive.
Well, in version 12 the database should be able to deal with skew join distribution. But it still does not work for all cases of the as Randolf Geist states.
If someone does not like my solution he might have a look into _fix_control 6808773, but I do not warrant either solution. 😉

 

Parallel Activity just drops

In one of my DWH POCs for Oracle’s Real World Performance Group we did see an unexpected drop of activity. The customer wanted to know why it happened.
This is how sql monitor’s activity tab looked:

I investigated carefully and found out that the reason was in both cases an outer join that looked liked that

FROM T_TAB1 T1
LEFT OUTER JOIN T_TAB2 T2
ON T2.X_ID    = T1.X_ID

I found that 90% of all values in T1.X_ID were Null values. The distribution for the join was hash. Thus, the parallel slave dealing with null values got most of the work and was for quite some time the only one running. Table T1 was the biggest table, to broadcast it was out of question.
(You may also compare Randolf’s article on skew parallel distribution: Parallel Execution Skew – Demonstrating Skew).

Can you now re-code the join condition such that the join is done by all parallel slaves? (This was version 11G. In version 12 the optimizer should take care of it.)  Further assumption: table T2 has a one column primary key call T2_id.

 

Solution: Is the migration database slower?

In OTN was once asked whether the optimizer takes the size of the buffer cache into account swhen creating the execution plan . Under the usual disclaimer (it could in the cost review with incorporated), I would say: no. How should the size of the cache be considered anyway? If I have a larger buffer cache, is more likely that a segement X is in the cache, so I’m doing an index access? (I argue here analogous to parameters OPTIMZER_INDEX_CACHING.) That’s not a good logic. The size Buffer caches is no a safe indicator of whether a particular segment actually in the cache or not. As will be seen later, the size of the cache may be even an argument for the full table scan.

To create an execution plan, the selectivity estimate is the key factor.

However, how an existing execution plan is being executed, is another matter. Here the execution enigine makes the final decision.

For example, a full table scan (FTS) is often executed doing “direct path reads”. This means that inevitable a physical I / O will to be done. (For the database, it is a physical I / O even if the result comes from the cache file system.)

But if the execution engine detects that a table is almost fully represented in the buffer cache, it can switch the execution from a “direct path read” to a “scattered read” The “scattered read” in contrast to “direct path read” will benefit from the buffer cache.

Tanel Poder wrote a good, but a somewhat older  explanation of this concept : Optimizer statistics-driven direct path read decision for full table scans .

In short, there are two prerequisites that must be fulfilled that the execution engine switches to “scattered reads”.

  1.   The buffer cache must be large enough to cache the segment which is the subject of the FTS
  2.  The segment must be actually almost completely cached


The first point can be so easily checked. The analysis showed the following: The table was about 25GB large in production and in the migration.
The buffer cache was currently 55 GB in production. In the migration it was only 5 GB in size.
Now we can conclude with certainty: In the production, the table colud be cached to 100% of it’s size. Theoretically, the runtime engine can therefore initiate a scan in memory. In the miragtion this is certainly not possible.
Was the table now cached to almost 100%  in production?
To clarify this question the below statement, which determines the number of cached blocks of a table can be used.


SELECT sum(num_buf)
FROM X$KCBOQH x, dba_objects o
WHERE x.obj#=o.object_id
AND object_name='my table'
;
If the so calculated number is almost identical to the number of blocks the table (according to my tests> 90%, but no guarantee)  scattered reads” will be used.

The analysis of the productive database revealed that the table was really for the most part in the Buffer Cache. I could also show the rapid FTS in memory did sometimes not happen due to the lack of cached blocks. We now double-checked our result and increased the buffer cache of the migration environment. Then we loaded the table into the cache.
After that the FTS was fast also in mirgration.

Some remarks:

Frits Hoogland published a newer, deeper analysis of the subject :

Investigating the full table direct path / buffered decision

 Because of several observations I think the exadata storage server can apply similar algorithms in relation to the SmartScan. A SmartScan can be converted into a scan of the storage server flash cache. I do not have enough evidence to be 100% certain however.

Is the migration database slower?

My customer set up a database in a virtual environment to support a migration. Initial tests show that the migration database is many times slower than the production database. The tests focus on a particular query.

Here is an excerpt from an AWR of the migration database. You can see the relevant data from the sql command that you should investigate as part of the list “SQL ordered by elapsed time”.

apsed Time (s)

Executions

Elapsed Time per Exec (s)

%Total

%CPU

%IO

199.24

1

199.24

98.16

3.93

96.72

For comparison, the same data from the production database:

Elapsed Time (s)

Executions

Elapsed Time per Exec (s)

%Total

%CPU

%IO

11.02

1

11.02

65.95

99.98

0.0

What makes you curious? What working hypothesis would you choose to start your investigation and what would you check? Hint: This sql statement we investigate is a count which scans only one table. The execution plan is identical in both cases, it is in each case a full table scan.

Solution: Where did the downgrades come from ?

As Stefan already correctly ponted out, the downgrades did origin from parallel sessions that Toad issued. Under special circumstance, the invoking client process has long ceased to exist, but the database sessions stays. As a result, we came under increasing pressure as the number of these sessions grew slowly. There were fewer and fewer new px be allocated because the pool is not infinite.
As a measure, however, we have a profile is activated, that kills inactive session after a certain time.
The second reason was that the parameter parallel_adaptive_multi_user was set to true. This ensured that about a half of the parallel session pools could be used for parallel processing. The parameter was obviously absurd in a data warehouse. However, the local DBA’s insisted not to change the default. For the first time I was faced with the absurd idea that a default is something of a taboo. Instead,they set the parameter was parallel_threads_per_cpu triumphantly to the default of 2. However, the OS had reported all threads as CPUs, so now twice as many CPUs were reported as really existed, as twice as many threads.
As for the downgrades three meaningless values ​​in the parameters have happy neutralized each other. Why, however, the Resource Manager was still running, I honestly do not get. 😉

Where did the downgrades come from?

In a data warehousewe we faced a increasing number of downgrades of parallelism. It seemed to be getting worse by the day. Indeed, it turned out that the number of free available parallel processes showed a constant decrease. A closer examination of PX used showed that many had Toad as source.
How was it possible that the number of parallel sessions that originated from Toad grew constantly?
On the other hand, there was still a comfortable number of PX is available. How was it possible that we were still seeing downgrades? What parameters should we check out?