[Skip Global Navigation]

SPSS Tech Tips

SPSS Direct Home

Understanding how conditions affect SQL left-outer joins

This is a complimentary tip available to SPSS Direct subscribers.

Question

I would like to perform a left-outer join (LOJ) on two tables in ShowCase® Query™. However, when I review the query, some records disappear. I have changed the join type to an inner join, but the results remain the same. Why am I losing records when doing an LOJ?

Answer

Joins are perhaps the most basic and critical operation performed when querying multiple files in a relational database. An incorrect join can prevent the correct initial set of records from being returned initially, making it very difficult (if not impossible) to produce the expected results. The fields used to join the files and the type of join operation are the two main factors to consider. Key fields for joining files are usually well established in most database environments. The join type frequently becomes the point of confusion and problems.

The basic inner join is the most common join type. One file joins directly to another based on matching values in specified key fields. The inner join is the simplest, most straightforward means of joining files. If no match is found, no record is returned. Only those records having the same key values in both files are included.

The next most common join type seems to be the left-outer join. The intent of the LOJ is to return all records in the primary, or “left,” file regardless of whether the secondary file has any records containing a matching value in the key field. When no matching records are present, the selected fields from the secondary file contain a NULL value. This is precisely how the LOJ functions when unaffected by other conditions or joins.

Errors can develop when other joins and conditions are included in the query. When other joins or conditions are placed on fields within the primary file of a LOJ, they will not necessarily prevent any or all NULL results from being returned. In these cases, elimination of NULL results from the “right” files is only coincidental to a condition removing records from the “left” file. On the other hand, conditions or joins placed on fields within a secondary file will almost always prevent any records with NULL values from being returned in the final results. This happens because records whose fields contain NULL values do not meet the criteria of the new condition or join.

As an example, consider a master file of sales personnel containing one record for each employee on the sales staff. There are ten records in this sample, which will be joined to a sales file that has one record for every sales order completed by each sales employee. While the employee master file has one record per employee, the sales file may contain any number of corresponding records for each employee (including no records for employees who have no sales). The files will be joined using the employee ID field. If an LOJ is used, there will be at least one record for every sales employee (a minimum of ten in this case). Those with completed sales will have one or more records where fields from the sales file contain data. An employee with no sales will have only one record, and all fields from the sales file contain NULL. Here is the SELECT statement and the sample output:

SELECT 
    SALESREP.EMPID,
    ORDERNO,
    CUSTID,
    ORDDATE,
    BILLED,
    PAID
FROM
    SJD.SALESREP SALESREP
    LEFT OUTER JOIN SJD.SALESORD SALESORD
    ON
    SALESREP.EMPID = SALESORD.EMPID
ORDER BY 
    1


Figure 1: A ShowCase Query results table from an LOJ with no additional conditions.
(Click to enlarge)

A condition is placed on the employee ID field in the “left” file that includes all employees. It does not eliminate any employee records or their corresponding sales records. All NULL records are still present. The results are the same as the previous query. Here is the SELECT statement and the results from this example:

SELECT
    SALESREP.EMPID,
    ORDERNO,
    CUSTID,
    ORDDATE,
    BILLED,
    PAID 
FROM
    SJD.SALESREP SALESREP
    LEFT OUTER JOIN SJD.SALESORD SALESORD
    ON 
    SALESREP.EMPID = SALESORD.EMPID WHERE
    SALESREP.EMPID IN('BAT','CAB','DAN','JAN','KAT',
	'SAP','SAS','STB','TRS','WAF')
ORDER BY 
    1


Figure 2: A ShowCase Query results table with conditions on left file. The conditions had no limiting effect on the results.
(Click to enlarge)

In the next example, the same SELECT is used except the condition is now applied to the employee ID field of the sales order file (the “right” file). Although all employees are included in the condition as before; all NULL results are now eliminated by simply moving the condition to the secondary file. Here is the SELECT statement and output for this query:

SELECT
    SALESREP.EMPID,
    ORDERNO,
    CUSTID,
    ORDDATE,
    BILLED,
    PAID
FROM
    SJD.SALESREP SALESREP
    LEFT OUTER JOIN SJD.SALESORD SALESORD
    ON
    SALESREP.EMPID = SALESORD.EMPID
WHERE
    SALESORD.EMPID IN('BAT','CAB','DAN','JAN','KAT',
	'SAP','SAS','STB','TRS','WAF')
ORDER BY
    1


Figure 3: A ShowCase Query results table with conditions on the right file. The records with NULL entries in Figures 1 and 2 no longer are included.
(Click to enlarge)

The removal of the NULL records in this case occurs because the condition applied to the “right” file of the join does not include the possibility of a NULL value. The appropriate records are eliminated. This interaction between added conditions and the initial result set, produced by the LOJ, is frequently misunderstood.

You can modify an added condition so that it allows NULL results to be returned. This requires that the condition be grouped together with an OR condition applied to the same field of the “right” file that specifies the IS NULL predicate operator.

The condition used in the previous example would appear as follows:

WHERE
    ( SALESORD.EMPID IN( 'BAT', 'CAB', 'DAN', 'JAN', 
	'KAT', 'SAP', 'SAS', 'STB', 'TRS', 'WAF' )
    OR SALESORD.EMPID IS NULL )

Adding this grouped OR condition prevents the original condition from removing all NULL results. This version of the condition produces the same results as seen in the first and second examples above.

There is one final important point to keep in mind. Conditions that can be added by other functions also play a role in determining whether NULL values are included. These functions can include ShowCase Suite Warehouse Manager™ security, JD Edwards business unit security, or registered third-party user exit programs, such as PentaSafe.

In summary, LOJs produce the initial result set they are designed to return. However, they do not act alone in a SELECT statement when determining the final query output. If you’d like more information on handling conditions added by other functions, please contact SPSS Technical Support at http://support.spss.com/.