Is YOLO-Behaviour appropriate for my study system?
For any new computer vision method, it is important to figure out whether it is appropriate for your study system. While YOLO-Behaviour is super powerful and robust, it has a few major shortcomings that make it less appropriate for certain study systems compared to others. Here are a list of considerations you can think about before diving into a project and spending time annotating data:
Is the behaviour you want to detect visually distinguishable from a single image?
Do you have sufficient images per behavioural class to train the model?
Are the videos relatively consistent/ without major noise?
Do you have (or can prepare) a validation dataset to ensure the model is detecting event appropriately?
If the answers to the questions above are all YES, then you should be good to go. If not, I expanded on each point below, to help figure out whether this method is appropriate for your study system.
1. Is the behaviour visually distinguishable?
This is the most important thing to consider, and the major shortcoming of the method. The YOLO model only analyzes SINGLE IMAGES, and tries to detect a behaviour frame by frame. If the behaviour you want to detect has a temporal component, or is context dependent, or if even you cant tell from an image whether a behaviour is happening, then the model will likely also fail.For example, let’s say you would like to detect a pigeon spinning around as part of its courtship display. From the video, you will be able to see it turn around, but for YOLO, since it only takes in single frame inputs, every frame of this spinning behaviour will only look like a pigeon standing still or walking. This would be an example where temporal information is important, and YOLO-Behaviour might not be appropriate. For this task, methods that takes in video input (like DeepEthogram) or methods that first does keypoint annotation might be more appropriate.
As a second example, let’s say you want to detect whether Siberian Jays are doing a submissive display or flying (This was a real example that was excluded in the manuscript because the method was not appropriate). The submissive display is a behaviour where the jay flaps their wing rapidly while screaming when sitting on the perch. In this case, this behaviour is very visually similar to the birds flying, and was easily confused with birds flying in and out of the frame. This is an example where the context matters, since which behaviour it is depends on what the bird is doing before and after this behaviour, or other cues (like audio).
Hopefully these examples can give an idea of when the method might not work. Just remember all the model has is the single frame, and any visual features of the image. So anything that cannot be determined from the image alone, will likely not work.