Knowing SQL Server: Joins methods

The Nested Loops join uses one join input as the outer input and one as the inner input. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table. A Nested Loops join is particularly effective if the outer input is small and the inner input is preindexed and large. In many small transactions, such as those affecting only a small set of rows, index Nested Loops joins are superior to both Merge joins and Hash joins. In large queries however, Nested Loops joins are often not the optimal choice.

An index nested loops join performs better than a merge join or hash join if a small set of rows are involved. Whereas, if a large set of rows are involved the Nested Loops join might not be an optimal choice. Nested Loops support almost all join types except right and full outer joins, right semi-join and right anti-semi join.

Merge Join

The first thing that you need to know about a Merge join is that it requires both inputs to be sorted on join keys/merge columns (or both input tables have clustered indexes on the column that joins the tables) and it also requires at least one equijoin (equals to) expression/predicate.

Because the rows are pre-sorted, a Merge join immediately begins the matching process. It reads a row from one input and compares it with the row of another input. If the rows match, that matched row is considered in the result-set (then it reads the next row from the input table, does the same comparison/match and so on) or else the lesser of the two rows is ignored and the process continues this way until all rows have been processed..

A Merge join performs better when joining large input tables (pre-indexed / sorted) as the cost is the summation of rows in both input tables as opposed to the Nested Loops where it is a product of rows of both input tables. Sometimes the optimizer decides to use a Merge join when the input tables are not sorted and therefore it uses an explicit sort physical operator, but it might be slower than using an index (pre-sorted input table).

Hash Join

A Hash join is normally used when input tables are quite large and no adequate indexes exist on them. A Hash join is performed in two phases; the Build phase and the Probe phase and hence the hash join has two inputs i.e. build input and probe input. The smaller of the inputs is considered as the build input (to minimize the memory requirement to store hash table discussed later) and obviously the other one is the probe input.

During the build phase, joining keys of all the rows of the build table are scanned. Hashes are generated and placed in an in-memory hash table. Unlike the Merge join, it's blocking (no rows are returned) until this point.

During the probe phase, joining keys of each row of the probe table are scanned. Again hashes are generated (using the same hash function as above) and compared against the corresponding hash table for a match.

A Hash function requires significant amount of CPU cycles to generate hashes and memory resources to store the hash table. If there is memory pressure, some of the partitions of the hash table are swapped to tempdb and whenever there is a need (either to probe or to update the contents) it is brought back into the cache.

There is no “ideal join,” it all depends on the data being joined.

From a resource perspective, a nested loop join generally uses fewer resources, and seeing one generally is often an indicator of good overall performance. Usually best for smaller joins.

Merge joins are often used for moderate to large data sets, and are most efficient if the joined columns are pre-sorted, otherwise they have to be sorted before the merge join can occur.
Hash joins are often used when very large data sets are used. Hash joins parallelize and scale better than other joins and are good at minimizing response times for OLAP queries. Seeing one may be a hint that too much data is being returned, especially in OLTP applications.

Knowing SQL Server

Saturday, July 4, 2015

Joins methods

No comments:

Post a Comment

Blog Archive