Project news

On throwing away data

By Dylan Irion, 15th October 2019

Back at the end of May, my department held our annual student symposium where I presented some very preliminary insights from this project. In previous blogs, I spoke about some mark-recapture basics, and the advantage the computer vision has brought this project. Both elements form the background of the question I posed in this talk. What do we do with low-quality images?

Image by Dylan Irion

What do we do?

Historically? We throw it away!

Many people have spent countless hours and money over the years to make this project a possibility and as a statistician, it pains me to throw away anything. (Next up on A&E…Data Hoarders with your host Dylan Irion). We do this, however, to avoid generating multiple encounter histories for the same individual. For example, we might have two good quality photos that we can match resulting in an encounter history like 00101, then for each poor-quality photo, we get something like 01000 or 00010. What looks to us (and the model) like 3 individuals, and 3 encounter histories, is actually just one. This leads to over-estimation of abundance, and under-estimation of survival rates.

Generally, we use some measure of quality to exclude photos or exclude the first sighting for each individual, which would eliminate the unmatched single “ghost” encounters. Both ensure that each individual has only 1 encounter history, at a cost of discarding lots of hard-earned data. Setting a quality threshold is difficult and introduces potential for confusing quality for distinctiveness, which will favour certain well-marked individuals. Censoring initial captures assumes that all ghost encounters are due to poor image quality, eliminating any possibility for transient individuals that only appear once. Other approaches estimate a False Rejection Rate and incorporate this into the mark-recapture model.

Consider the case

Image by Dylan Irion

A photo might be considered low quality, but there is still information present. Consider the first image of the post. Are they a match? Hard to say for certain. Now consider the image above. We can be a bit more certain that the image on the left doesn’t match any on the right. What I’m getting at, is we can start to estimate probabilities that a pair of fins are a match (or don’t match). This is essentially what the software is doing anyway. We don’t even need to know the ‘true’ encounter histories. We estimate them statistically from the probability that each fin matches one another.

There must be another way!

This type of approach is already being developed in the field of genetic mark-recapture and has also been used successfully in photo-id for situations where only one side of an individual is photographed – a bit like our sharks!

The problem? Of course, there’s a catch. With images from about 25,000 encounters, we’re talking about a massive pairwise matrix to process. For each of these encounters, we need to consider the probability that it matches with each of the other 24,999 encounters! This is a huge computational cost, even with the state of modern computing. Perhaps we can reduce this load by considering only those encounters that cannot be matched by eye or reducing the [encounter X encounter] matrix to an [encounter X individual] matrix and only consider the probability that an encounter matches a known individual. But the real question remains, is it even worth it? How much will it improve our estimates?

A statistician can dream (and experiment)!

Project See project and more news