As you remember we had a hash join outer with dominant null values. Thus, the parallel process dealing with the null values got most of the work to do.
How would we deal with it? Let’s us back up a bit.
A null value is a special case. We are comparing in our join a null value versus a primary key that is defined not null. The null values will certainly not generate a match when joining. Therefore any of the parallel slaves could do the compare, as long as it is guaranteed not to generate match.
It is therefore beneficial to convert the null values into some other value, as long as we to fulfill two prerequisites:
- we have to convert the null values in such way that the get evenly distributed across all parallel slaves
- we must make sure that the converted values do not generate a match
It took me a long time to figure this. First I tried something like negative values, but they were still going to the same slave.
The issue is that our newly generated key needs to be roughly in range with the pk of the build table to be spread across all slaves.
I could have tried to figure the range in a with clause reading the data dictionary, but could not get myself to do something I considered uncool.
Instead I tried something else, which is honestly a bit of a hack, but it looks way better. I used the pk of the outer table to generate a not null widespread key.
I also took advantage of the fact that the id’s were integer and changed the join condition to:
FROM T_TAB1 T1
LEFT OUTER JOIN T_TAB2 T2
ON ( NVL(T2.X_ID, T2_id +0.5) = T1.X_ID)
That is certainly somewhat tricky. The only thing I can say in favor is that is kind of worked.
After I applied my solution the activity tab looked like that:
Compare this to the previous picture. It is quite impressive.
Well, in version 12 the database should be able to deal with skew join distribution. But it still does not work for all cases of the as Randolf Geist states.
If someone does not like my solution he might have a look into _fix_control 6808773, but I do not warrant either solution. 😉
In one of my DWH POCs for Oracle’s Real World Performance Group we did see an unexpected drop of activity. The customer wanted to know why it happened.
This is how sql monitor’s activity tab looked:
I investigated carefully and found out that the reason was in both cases an outer join that looked liked that
FROM T_TAB1 T1
LEFT OUTER JOIN T_TAB2 T2
ON T2.X_ID = T1.X_ID
I found that 90% of all values in T1.X_ID were Null values. The distribution for the join was hash. Thus, the parallel slave dealing with null values got most of the work and was for quite some time the only one running. Table T1 was the biggest table, to broadcast it was out of question.
(You may also compare Randolf’s article on skew parallel distribution: Parallel Execution Skew – Demonstrating Skew).
Can you now re-code the join condition such that the join is done by all parallel slaves? (This was version 11G. In version 12 the optimizer should take care of it.) Further assumption: table T2 has a one column primary key call T2_id.
As Stefan already correctly ponted out, the downgrades did origin from parallel sessions that Toad issued. Under special circumstance, the invoking client process has long ceased to exist, but the database sessions stays. As a result, we came under increasing pressure as the number of these sessions grew slowly. There were fewer and fewer new px be allocated because the pool is not infinite.
As a measure, however, we have a profile is activated, that kills inactive session after a certain time.
The second reason was that the parameter parallel_adaptive_multi_user was set to true. This ensured that about a half of the parallel session pools could be used for parallel processing. The parameter was obviously absurd in a data warehouse. However, the local DBA’s insisted not to change the default. For the first time I was faced with the absurd idea that a default is something of a taboo. Instead,they set the parameter was parallel_threads_per_cpu triumphantly to the default of 2. However, the OS had reported all threads as CPUs, so now twice as many CPUs were reported as really existed, as twice as many threads.
As for the downgrades three meaningless values in the parameters have happy neutralized each other. Why, however, the Resource Manager was still running, I honestly do not get. 😉
In a data warehousewe we faced a increasing number of downgrades of parallelism. It seemed to be getting worse by the day. Indeed, it turned out that the number of free available parallel processes showed a constant decrease. A closer examination of PX used showed that many had Toad as source.
How was it possible that the number of parallel sessions that originated from Toad grew constantly?
On the other hand, there was still a comfortable number of PX is available. How was it possible that we were still seeing downgrades? What parameters should we check out?