In the most recent version 1.0.5 of Behold released yesterday, I’ve fixed and improved Behold’s ancestral loop checking. This was prompted by a question on the Genealogy SE Q&A site by Enno Borgsteede who asked How can I find loops in GEDCOM files?
This sort of a problem is really appealing to me. Prior to graduating in Computer Science, I had an undergraduate degree in Statistics, and one of my favorite courses was Combinatorics. (See how I solved one of my combinatoric problems at my question on Stack Overflow: What ever happened to APL?)
I had originally implemented loop checking as a simple recursive routine. Recursion is where an algorithm calls itself in simpler and simpler form and then backs out again to result in the solution. So in the case of Ancestral Loops, I would solve the problem for a person by seeing if there were any loops back to that person from either of that person’s parents. Each parent would be solved by their solving their parents, etc. until everyone was done.
I thought this was an easy problem, and once a person was encountered, you would simply mark them as “visited” and not follow them again. But I was wrong. My method was missing loops.
The answer provided by Tom Wetmore to Enno’s question said that Tom was using Tarjan’s algorithm to report on the loops. I hadn’t heard of the algorithm before and researched into it. Tarjan’s strongly connected components algorithm, published in 1972, was designed to partition a graph so that each component with at least one internal connection will contain at least one cycle. The term “cycle” and “loop” can be used interchangably.
There is a great explanation of Tarjan’s algorithm on Wikipedia and includes pseudocode and the wonderful little animation that I’ve inserted below.
The animation shows 4 connected components with 4 loops.
The algorithm has been proved to work and it is fast. But, ugh, the pseudocode is horribly ugly. Even with my background, I was going to have a tough time implementing that algorithm correctly.
I searched the web to see if there was an implementation of Tarjan’s in Delphi or Pascal, either of which I could have adapted easily. No such luck. I did find several implementations in other languages (Java, .NET and Python), but translation of those implementations to Pascal looked even more difficult than directly programming the algorithm. Also I also read somewhere (can’t find the source again) that one of the implementations was wrong. Due to the complexity of the algorithm, I shouldn’t necessarily trust that those other programmers implemented the algorithm without error.
A couple of days ago, also prompted by Enno’s question, Tamura Jones wrote about Finding Ancestral Loops. In that article he stated:
Loops detection requires some knowledge and understanding of graph theory and algorithms. Sadly, few software developers have the necessary background in theoretical computer science.
This is correct. These advanced mathematicians are good at defining and proving and expressing their algorithms, but not in a simplistic enough manner that most programmers can understand well enough.
During this fascinating research, I had thought Tom’s answer of Tarjan’s algorithm was the correct answer. But as I started to try to implement it, I realized that it wasn’t going to give me the individual loops I wanted to have. For example, see this diagram from the Wikipedia writeup on “Strongly connected component”:
The diagram shows 3 strongly connected components, but 4 loops, with c-d and d-h being two loops in one component.
Surely, I thought, there must be an algorithm to find such loops at the atomic level. And I was right. Donald B. Johnson presents the best algorithm in his 1975 paper: Finding All The Elementary Circuits of a Directed Graph. In fact, this algorithm was shown to be even faster than Tarjan’s. Unfortunately, the pseudocode is as hard to understand and the implementation looked to be even more complex.
By now this had taken all weekend. My family couldn’t pry me out of my room. I needed to finish this so I could open my door and see the real world again. But after about 10 false starts on Tarjan’s and then Johnson’s algorithm, I decided to go back to good old recursion. I really just needed a correction to my original implementation.
The help came from this Depth-First Search (DFS) writeup by Rashid Bin Muhammad, PhD. The algorithm was simple, and told me which nodes to properly mark as visited by partitioning them into White (undiscovered), Gray (discovered but not finished) and Black (finished). That appeared to do the trick.
I was now only worried about execution time. It was possible that for large trees with lots of loops, it might takes seconds, minutes, or even hours to find all the loops. But these worries were alleviated when my BetaBeast test GEDCOM file (graciously provided to me by Andy Hatchett) was handled, finding exactly 2,727 ancestral loops containing 1,522,771 people in a fraction of a second.
The best part of the implementation, AFAIAC, is the way I now display these loops in Behold. Here is how a few of the Ancestral Loops from BetaBeast are shown in Behold:
The report lists each loop, showing the order of the people, parents to children, the person’s name is hyperlinked to their information in the Everything Report, their birth and death years are shown if given, their assigned number in Behold is listed, and an indication if they were born before their parent or died before their parent was born. This will make it very easy to find and fix the problems.
So for me, that was fun. It was the sort of thing that kept me up each night until 2:30 in the morning, making me a basket case the next day.
I’m actually just as excited about all the other things I’ve been working on in Behold that are coming soon, including Life Events and Consistency Checking that I think will be a rather amazing implementation. But I’ll wait for the critics to judge it when it comes out.