diff: Import new diff algorithms from libdiff
AbandonedPublic
Actions

Authored by thj on Oct 3 2022, 9:35 AM.

Details

Reviewers

bapt
pstef
jhb

Group Reviewers

Klara

Summary

Import and integrate modern diff algorithms from Game of Tree's libdiff.
This adds support for two new diff algorithms, Myers diff and Patience
diff.

These algorithms perform a different form of search compared to the
classic Stone algorithm and support escapes when worst case scenarios
are encountered.

Add the -A flag to allow selection of the algorithm, but default to
using the new Myers diff implementation.

The libdiff implementation currently only supports a subset of input and
output options supported by diff. When these options are used, but the
algorithm is not selected, automatically fallback to the classic Stone
algorithm until support for these modes can be added.

This review is published as a WIP, there are a couple of patches in the libdiff
import that still need to be upstreamed, but I think this is ready for feedback
for the integration of the library.

Test Plan

The new algorithms pass all but one subtest of the existing diff tests. This
one I have forced to stone, I think this is okay as it very recent
functionality.

Due to different diff algorithms generating different, but equivalent diffs we
have to verify functionality by running each an ensuring that applying the
created diff with patch creates the same file.

The following test file runs through commits passed on the command line and
checks that the diff remains valid between unpatched diff, gdiff and the
updated diff. I have used this for testing and it is happy with the most recent
1000 FreeBSD commits.

#!/bin/sh
#
# Test diff command by running through the commits in a git repository that is
# the current working directory. Test is performed by taking each commit and
# checking if it is equivalent (patched file is the same) as installed diff,
# gdiff and the git diff output. test-commit.sh uses $testdiffcmd and tries to
# only log minimal progress and errors.

testdiffcmd=/usr/obj/code/freebsd/worktrees/diff-got/amd64.amd64/usr.bin/diff/diff.full

#
#
# Takes commits to test on the command line
#	$ test-commit.sh 57f7a82fbb4299a255c2c20ec812258004a90631 7d98cc096b995ca3bfd85277ed009b0f872c3e1b ...
#
# failed diffs end up in test-commit.log
#
# To get big lists of commits to test we can use git log and xargs, Run through
# many commits as generated by git in 100 commit chunks increasing -P adds more
# jobs, but there seem to be concurrency issues:
# 	git log -n 1000 --pretty=format:"%H" | xargs -P1 -L 100 -t sh ./test-commit.sh unified
# will run through 1000 commits, in 100 commit chunks, with one thread
# 
# Testing files are named to try and allow concurrent access to the git repo
# for tests.

mode=$1
shift

case $mode in 
"classic")
	diffflags=""
	patchflags="-s -n"
	;;
"ed")
	diffflags="-e"
	patchflags="-e"
	;;
"unified")
	diffflags="-u"
	patchflags="-s -u"
	;;
"func")
	diffflags="-u -p"
	patchflags="-s -u"
	;;
*)
	echo "no valid mode specified"
	echo "usage: test-commit.sh [classic | ed | unified ] hash hash hash ..."
	exit
	;;
esac

commits=$@

if [ -z "$commits" ]
then
	echo no commits specified
	exit
fi

outdir=testdir
logfile=test-commit.log

patchcmd=gpatch

commitcount=0
filecount=0
skipcount=0

for commit in $commits
do
	commitcount=$((commitcount+1))
	prevcommit=$(git log -n 1 --pretty=format:"%H" $commit^1)
	files=$(git diff --name-only $prevcommit $commit)

	if [ -z $prevcommit ]
	then
		echo "no previous commits for $commit"
		continue
	fi
	printf "$commitcount:  $commit -> $prevcommit\n\t"

	for f in $files
	do
		printf "."
		filecount=$((filecount+1))
		filename=$(basename $f)

		afile=${outdir}/${filename}-${prevcommit}-a
		bfile=${outdir}/${filename}-${commit}-b

		gitdiff=${outdir}/gitdiff-${prevcommit}-${commit}
		gdiff=${outdir}/gdiff-${prevcommit}-${commit}
		bsddiff=${outdir}/bsddiff-${prevcommit}-${commit}
		testdiff=${outdir}/testdiff-${prevcommit}-${commit}

		gitpatch=${outdir}/gitpatch-${prevcommit}-${commit}
		gdiffpatch=${outdir}/gdiffpatch-${prevcommit}-${commit}
		bsdpatch=${outdir}/bsdpatch-${prevcommit}-${commit}
		testdiffpatch=${outdir}/testdiffpatch-${prevcommit}-${commit}

		git --no-pager show ${prevcommit}:${f} 2>/dev/null > $afile
		if [ $? -ne 0 ]
		then
			#echo "no a file skipping $f"
			skipcount=$((skipcount+1))
			continue
		fi

		git --no-pager show ${commit}:${f} 2>/dev/null > $bfile
		if [ $? -ne 0 ]
		then
			#echo "no b file skipping $f"
			skipcount=$((skipcount+1))
			continue
		fi

		cp ${afile} ${gitpatch}
		cp ${afile} ${gdiffpatch}
		cp ${afile} ${bsdpatch}

		git diff ${prevcommit}:${f} ${commit}:${f} > ${gitdiff}
		gdiff ${diffflags} $afile $bfile > ${gdiff}
		diff ${diffflags} $afile $bfile > ${bsddiff}

		# Check that the git patch creates b file
		$patchcmd -s -u ${gitpatch} ${gitdiff}

		if [ $? -gt 1 ]
		then
			echo "patch failed skipping $f"
			skipcount=$((skipcount+1))
			continue
		fi

		cmp ${bfile} ${gitpatch}

		# Check that the bsddiff (stone) patch creates b file
		$patchcmd ${patchflags} ${bsdpatch} ${bsddiff}
		cmp ${bfile} ${bsdpatch}

		if [ $? -ne 0 ]
		then
			# if the first test failed, this might be a binary. If
			# it is skip this file
			if ! file --mime $f | grep -q "text"
			then
				file --mime $f | grep -q "text"
				echo "binary file skipping $f"
				skipcount=$((skipcount+1))
				continue
			fi
		fi

		# Check that the gdiff patch creates b file
		$patchcmd ${patchflags} ${gdiffpatch} ${gdiff}
		cmp ${bfile} ${gdiffpatch}

#diffstat < ${gitdiff}
#wc -l ${afile} ${bfile}

		if [ ! -z $testdiffcmd ]
		then
#echo "running: $testdiffcmd ${diffflags} $afile $bfile > ${testdiff}"
			$testdiffcmd -A myers ${diffflags} $afile $bfile > ${testdiff}

			cp ${afile} ${testdiffpatch}

			$patchcmd ${patchflags} ${testdiffpatch} ${testdiff}
			cmp ${bfile} ${testdiffpatch}

			if [ $? -ne 0 ]
			then
				echo failed $prevcommit $commit >> ${logfile}
				echo $(tput bold) test diff failed $(tput sgr0)
				echo diff ${diffflags} ${afile} ${bfile}
				echo "processed $filecount files from $commitcount commits (skipped $skipcount)"
				exit 255
			fi
		fi
	done
	printf "\n"
done

#echo "Sucessfully diffed $filecount files from $commitcount commits (skipped $skipcount)"
#echo "$commits" | wc -l

#commitbacklog=100
#commits=$(git log -n ${commitbacklog} --pretty=format:"%H")
# rm log.out; time git log -n 1000 --pretty=format:"%H" | xargs -P4 -L 100 -t sh ./test-commit.sh

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 47762
Build 44649: arc lint + arc unit

Event Timeline

thj created this revision.Oct 3 2022, 9:35 AM

Herald added subscribers: jonathan, emaste, imp, bdrewery. · View Herald TranscriptOct 3 2022, 9:35 AM

thj requested review of this revision.Oct 3 2022, 9:35 AM

Harbormaster completed remote builds in B47679: Diff 111338.Oct 3 2022, 9:35 AM

thj edited the test plan for this revision. (Show Details)Oct 3 2022, 9:36 AM

When ready to commit is the intent that libdiff will go via contrib?

move libdiff into contrib

Harbormaster completed remote builds in B47762: Diff 111610.Oct 10 2022, 3:11 PM

emaste added inline comments.Oct 10 2022, 6:32 PM

usr.bin/diff/diffreg.c
253	What's this?

Would it be possible to omit the compat subdirectory from the import?

jamie_catflap.org added a subscriber: jamie_catflap.org.Mar 25 2023, 10:39 PM

guest-patmaddox added a subscriber: guest-patmaddox.Mar 26 2023, 6:59 PM

What's preventing us from landing this?

des added inline comments.Feb 16 2024, 4:31 PM

usr.bin/diff/diffreg_new.c
142	Our `diffreg()` expects `diffreg_new()` to return `D_SAME`, `D_DIFFER`, `D_BINARY` or `D_ERROR`. However in several places we give `rc` the return value from a library function (e.g. `diff_atomize_file`) which returns 0 or an error code, then `goto done` which eventually returns `rc` unmodified. This means that unless an error occurs we almost always return 0 from `diffreg_new()` regardless of whether the files differed. The net result is that `diff -s foo bar`, where `foo` and `bar` are not identical, will correctly show the differences between `foo` and `bar`, but will then incorrectly add `Files foo and bar are identical.`
216	`rc` needs to be `D_ERROR` here.
219	`rc` needs to be `D_ERROR` here.
260	`rc` needs to be `D_DIFFER` here.