I'm only going by inspection here, but I don't think this actually works.
The mappers take two input streams and tag them with their source (0=WARC, 1=dupe list) then output to the reducer using a composite key consisting of the WARC ID plus the source.
The reducer looks at how many values a key has and only outputs the document if the number of values is 1. This would work if the key was just the WARC ID, but since it includes the source tag as well, it's actually going to add the dummy documents to the output stream rather than deleting the duplicate documents.
I've reorganized the data flow and collapsed phases 1, 2, 3.1, 3.2, 3.3, 3.4 into a single job which generates both text-only WARCs and a list of duplicate document IDs and was converted Phase 4 into my new phase 2 when I ran across this.
If you're open to adopting my new workflow (when it's ready to be reviewed), this can probably be deferred, but if you're going to use the existing pipeline for a while, someone should have a look and see if I'm confused or this is a real problem.