Workshop: Statistics and Biodiversity Science Workshop

DateWednesday, 17 Oct 2012
Time08:30 - 17:30
LocationMappin Pavilion, Zoological Society of London
Event seriesCSML Workshops

When: We will hold the first workshop on 17th Oct 2012 with a second workshop in 2013 to present progress.

Where: Mappin Pavilion, Zoological Society of London, London NW1 4RY.

Agenda for Oct 2012:
8.30 – 9am: Coffee
9-10am: Welcome and Academic Speed Dating
10-10.30am: Coffee
10.30am-1pm: Analytical problems facing computational statisticians and biodiversity scientists (each 10mins and 5mins for discussion).
1-2pm: Lunch
2-4pm: Group Sessions (Break-out groups to discuss posed problems)
4–4.30pm: Coffee
4.30-5:30pm: Group session reports
5:30pm: Workshop close
7-9pm: Dinner with participants.

Tickets must be booked by 9 October.

For directions please see:
For more on the Mappin Pavilion please see:,463,AR.html

Collaborative opportunities for computational statistics and biodiversity science

Aim: This workshop is for both the leading UCL statistical and machine learning researchers and the leading UK biodiversity scientists (Zoological Society of London, Microsoft Research Cambridge, UCL Department of Genetics, Evolution and Environment and CoMPLEX UCL). We will explore how to strike up efficient and mutually beneficial collaborations that will really challenge the state of the art methods and biodiversity science. This workshop will establish collaborations between Zoological Society of London, Microsoft Research and UCL with a view to forming an ‘Advanced Technology for Nature Unit’ (

Background: Wild nature is declining rapidly as humans use more of earth’s resources. Better tools are needed to document, monitor, and conserve wildlife populations. Advances in methods to collect biodiversity data (such as global citizen science programmes) are creating a critical need for better statistical approaches to process and analyze this information. From a computational statistics perspective, these new biodiversity data sources are presenting a challenge to available tools and there is a great opportunity to develop new methodological approaches.

The workshop will be written up as a ‘Workshop Report’ in Biology Letters with all the participants as authors.

Problem Summary (so far)
Problem 1 (Bio). Identifying individual and species from images and video in the wild – Marcus Rowcliffe, Kirsty Kemp, Chris Carbone, Alasdair Davies, Rajan Amin, Olivia Needham, Tim Wacher. Camera traps are emerging as a central tool for monitoring terrestrial vertebrate diversity, abundance and behaviour. However, while camera trap technology has recently improved significantly, the analytical tools for dealing with the resulting information are currently under-developed, in two main ways. First, the processing of images to produce data is still a time consuming manual process, and there is great scope for computational tools to assist. Key problems here are to identify when animals are present in photographs, what taxa they belong to, how big they are and where they are in relation to the camera. Second, the resulting data are in many ways unsuitable for the application of classical statistical analysis, and there is huge scope for the development more sophisticated and appropriate statistical tools. Key problems here are to use data on timing of records to measure patterns and amounts of animal activity and interaction, to use data on animal positions to measure movement parameters and camera sensitivity, and to use data on record rates to provide robust estimates of animal abundance.

Problem 2 (Stats) – How to find specialists among so many "experts"? - Gabriel Brostow, Oisin Mac Aodha, Mike Terry. As an analogy, imagine a patient arrives at a hospital complaining of an illness. One of several doctors could potentially treat him/her, where the choice of doctors depends on the symptoms. We would like to consult the most suitable doctor, but not at any cost (some cases don't require the world-best expert). For our research, the doctors' role in that analogy can be filled both by humans giving input through some user-interface, or programmed algorithms that run automatically. To explore our new model of Non-Disjoint Classification, we are looking for visual data where i) there are experts/doctors to choose between, and ii) where the ground truth is (or can be) known for a fairly sizeable training set: how many X's are in this image, where are all the Y's in this video?

Problem 3 (Bio). Recognising behaviour from multi-dimensional sensor data in the wild – Robin Freeman. Our understanding of the distribution, behaviour and environmental context of animals in the wild is often based on sparse presence/absence data. However, modern advances in bio-logging and telemetry are now enabling the recording of individual animal movement in high-resolution with multiple sensors types. From flight dynamics to global migration, we are gathering a rich and dramatically growing set of information from animals in the wild. However, understanding these data in combination with extremely large environmental datasets can be challenging. We are currently exploring new methods for analysing and interpreting these data in order to gain a better understanding of the behaviour and movement of animals and their response to environmental change. The wealth of techniques from machine learning and pattern recognition can be key here - enabling us to explicitly incorporate the intrinsic spatiotemporal dynamics of movement and behaviour in a changing environmental context.

Problem 4 (Stats) Thinking about the observation model as carefully as the data model. Ioanna Manolopoulou, Richard Hahn. In many applications, data are not collected at random, but come to us with built-in bias. For example, when building and fitting a statistical model to predict crime, available databases reflect both previously implemented modeling efforts as well as incidental data-collection biases (i.e. closely monitored populations are more likely to be arrested and easy-to-prosecute crimes will be over-represented in the sample). Similarly, when one is interested in the distribution of a particular species, the choice of sampling locations will depend on habitat, environment and other prior information and, moreover, some habitats inherently present better opportunities for a sighting (i.e. shallow water versus deep water). We propose to couple carefully tuned observation models with standard data models in order to disentangle observational effects from the substantive phenomenon of interest. Specifically, we are interested in implementing our methods on datasets with binary outcomes (presence/absence, for example) which have been sampled non-uniformly according to additional covariates such an environment or habitat.

Problem 5 (Bio). Are there better ways of predicting species distributions? - Helen Chatterjee, Charlotte Walters. We are using Ecological Niche Modelling (e.g. Maxent) to predict the environmental requirements and geographic distributions of endangered species of primates. ENM uses current species distributions in association with environmental variables (bioclimatic variables available from WorldClim) to predict habitat suitability now and in the future, under various climate change scenarios. For future projections we are using an ensemble approach combining several climate models and emission scenarios. The problem with ENM and other ecological models (e.g. vegetation models, process models) is that they are unable to also capture dynamic ecosystem variables (e.g. species dispersal, biotic interactions, land-use and topography). An integrated ecosystem model which can capture species distribution, environmental variables and other dynamic variables, employing an ensemble approach, would be extremely beneficial to tackle a variety of biodiversity/conservation questions.

Problem 6 (Bio). How can you include similarity of species due to a shared evolutionary history in machine learning approaches? - Lucie Bland.
Understanding processes driving extinction and threat has become a major goal of conservation biology, and comparative studies of species extinction risk based on large, multispecies datasets have become increasingly popular in recent years. However, most regression models constructed to date have achieved low explanatory power. Machine Learning methods have the potential to shed a new light on these processes. However, data from different species are not independent because species share an evolutionary history, which often leads to a correlated data structure. While methods for accounting for phylogenetic non-independence in regression have been developed, such methods are not currently available in a Machine Learning framework.

Problem 7 (Bio). How could we use a smartphone to process sounds from nature to give a species identification in the field?- Kate Jones. Smartphones are ever increasing in power and popularity. Although some conservation citizen science programmes are beginning to use them for collecting information from wild nature in the field from the public, fewer are using smartphone capabilities for processing information and returning it to the user. Could we record wild sounds (birds, bats) on smartphones, apply an algorithm to detect, parameterise the sounds and then return a classification of the sounds to the user?

Problem 8 (Bio). Learning about the effectiveness of enforcement from ranger patrols - Aidan Keane. Learning about the effectiveness of enforcement measures for deterring illegal behaviour (e.g. poaching in and around protected areas) is a significant challenge in conservation. Ranger patrol reports have been advocated as a readily available, and therefore relatively cheap, source of data. However, interpreting these data is problematic - increasing patrol effort is generally expected to increase the proportion of infractions that are detected, but it is also likely to deter new infractions from being committed. Poachers also face considerable incentives to avoid detection, meaning that they may alter their behaviour in response to the proximity of patrols. Similarly, rangers often face incentives to turn a blind eye to some infractions, or to overstate their efforts. Given these complexities, what are likely to be the best strategies for learning from patrol data, and how effective can they be?

iCalendar csml_id_112.ics