
Gathering all rejects prior to killing a job
As an alternative to collecting incorrect rows up to the point where a job fails (Die on error), you may wish to capture all rejects from an input before killing a job.
This has the advantage of enabling support personnel to identify all problems with source data in a single pass, rather than having to re-execute a job continually to find and fix a single error / set of errors at a time.
Getting ready
Open the job
jo_cook_ch03_0010_validationSubjob
. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap
).
How to do it…
- Add the
tJava
,tDie
,tHashInput
, andtFileOutputDelimited
components. - Add
onSubjobOk
totJava
from thetFileInputDelimited
component. - Add a flow from the
tHashInput
component to thetFileOutputDelimited
component. - Right-click the
tJava
component, select Trigger and then Runif. Link the trigger to thetDie
component. Click the if link, and add the following code((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) > 0
- Right-click the
tJava
component, select Trigger, and then Runif. Link this trigger to thetHashInput
component.((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) == 0
The job should now look like the following:
- Drag the generic schema
sc_cook_ch3_0010_genericCustomer
to both thetHashInput
andtFileOutputDelimited
. - Run the job. You should see that the
tDie
component is activated, because the file contained two errors.
How it works…
What we have done in this exercise is created a validation stage prior to processing the data.
Valid rows are held in temporary storage (tHashOutput
) and invalid rows are written to a reject file until all input rows are processed.
The job then checks to see how many records are rejected (using the RunIf link). In this instance, there are invalid rows, so the RunIf link is triggered, and the job is killed using tDie
.
Tip
By ensuring that the data is correct before we start to process it into a target, we know that the data will be fit for writing to the target, and thus avoiding the need for rollback procedures.
The records captured can then be sent to the support team, who will then have a record of all incorrect rows. These rows can be fixed in situ within the source file and the job simply re-run from the beginning.
There's more...
This recipe is particularly important when rollback/correction of a job may be particularly complex, or where there may be a higher than expected number of errors in an input.
An example would be when there are multiple executions of a job that appends to a target file. If the job fails midway through, then rolling back involves identifying which records were appended to the file by the job before failure, removing them from the file, fixing the offending record, and then re-running. This runs the risk of a second error causing the same thing to happen again.
On the other hand, if the job does not die, but a subsection of the data is rejected, then the rejects must be manipulated into the target file via a second manual execution of the job.
So, this method enables us to be certain that our records will not fail to write due to incorrect data, and therefore saves our target from becoming corrupted.
See also
- The Validating against the schema recipe, in this chapter.