We aim to set the global standard for a truly interdisciplinary approach to contemporary data-driven research challenges. Established in 2015, the Data Science Institute (DSI) has over 300 members and has raised £50 million in research grants.
10-year anniversary of DSI – “Decade of Data Science”
In 2025, the Data Science Institute (DSI) at 新澳门六合彩 proudly marks its 10th anniversary. Since its founding in 2015, the DSI has established itself as a leading hub for cutting-edge research, interdisciplinary collaboration, and real-world impact in data science and artificial intelligence. Over the past decade, our researchers and partners have tackled some of the most pressing challenges in society, science, and industry—advancing the foundations of data science, fostering ethical and trustworthy AI, driving innovation across sectors and training 100s of data science practitioners.
As we celebrate this milestone, we reflect on the achievements of our vibrant research community and the transformative projects that have shaped the field. Looking ahead, the DSI remains committed to pushing the boundaries of data science and AI research, strengthening global collaborations, and supporting the next generation of data scientists.
About us
We are working to create a world-class Data Science Institute at 新澳门六合彩 (DSI@新澳门六合彩) that sets the global standard for a truly interdisciplinary approach to contemporary data-driven research challenges. DSI@新澳门六合彩 aims to have an internationally recognised and distinctive strength in being able to provide an end-to-end interdisciplinary research capability - from infrastructure and fundamentals through to globally relevant problem domains and the social, legal and ethical issues raised by the use of Data Science.
The Institute is initially focusing on the fundamentals of Data Science including security and privacy together with cross-cutting theme areas consisting of environment, resilience and sustainability;health and ageing, data and society and creating a world-leading institute with over 300 affiliated academics, researchers, and students.
Our data science, health data science and business analytics programmes have launched the careers of hundreds of data professionals over the last 10 years. Students from our programmes have progressed to data science roles at Amazon, PWC, Ernst & Young, Hawaiian Airlines, eBay, Zurich Insurance, the Co-operative Group, N Brown, the NHS and many others - please look at our Education pages for further details of the courses on offer.
DSI is pleased to announce the?N8CIR?internships programme for 2025
DSI is pleased to announce the internships programme for 2025. This programme is aimed at 2nd and 3rd year undergraduates interested in investigating research software engineering as a career, and we are offering up to 2 8-week positions starting on the 18th June with a £3500 stipend for the period. Interested students should read the list of projects listed below, and contact the relevant supervisors ahead of a nomination **deadline of 17th April**. Candidates will then be invited for an interview process at which the recipients of the internships will be selected.
Please email dsi-enquiries@lancaster.ac.uk if you have questions.
accordion
1. Optimising on-disc storage of Monte Carlo output for machine learning problems
In statistics, machine learning, and AI, models commonly lead to complex multi-dimensional probability distributions as the focus of interest. These probability distributions are often represented as a collection of random numbers – “Monte Carlo samples” – which must be stored for further processing and analysis. In many situations, the number and complexity of these Monte Carlo samples means that they must be streamed out of RAM to non-volatile storage, for example a binary file, column database, or object storage. However, the order in which we might write such samples is often incompatible with efficient read access in the future.
This project will explore efficient storage patterns across a variety of modern on-disk and cloud storage formats. A typical Monte Carlo sampler will be used as a test case, with different storage formats being investigated on a range of hardware from personal computers to HPC to cloud. Storage formats include, but are not limited to, HDF5, Zarr, Parquet, and TileDB. The project will culminate in a model that enables researchers (or even a machine) to choose the best storage format for their Monte Carlo samples, with the chance to design a library that abstracts the storage formats to provide a consistent interface for the user.
2. Distributed Search Space Reduction for Program Synthesis
Supervisor: Barry Porter <b.f.porter@lancaster.ac.uk> (SCC)
RSE Mentor: John Vidler (SCC)
Genetic programming (GP) is an approach to synthesising new programs for novel problems, by searching through theoretical program space following a reward signal. Compared to large language models, this approach allows the synthesis of novel programs to entirely unseen problems. GP typically starts from an empty program and navigates outwards in various directions to try to find improved candidates. In this project you will develop an alternative approach, in which distributed parallel computing is used to incrementally narrow a search space. Your project will use an existing, novel program search space framework as a starting point; this framework is able to represent all of program search space as a regular rectangle, and operates part of its search process across GPUs. Your system will start by splitting the total theoretical search space into a number of equally-sized regions for parallel distributed search, and sampling random points from within each region. The most promising of these regions will then be selected as the area of focus, itself being split into a number of equally-sized regions for parallel distributed search. Starting from our existing framework, you will focus specifically on the distributed systems aspect and the implementation of parallel search sampling. The resulting framework should be drive-able and observable using a REST API.
3. gemlib: a python library for epidemic modelling
Come and contribute to the development of , an open-source python library for simulating and calibrating epidemic models to real-world outbreak data! Epidemic models are used to understand and predict how infections spread in different settings. During the COVID-19 pandemic, real-time modelling was used to improve understanding of the pathogen, forecast disease dynamics, and evaluate interventions. Epidemic models are fundamentally Markov state-transition models, whereby a population of individuals is divided into mutually-exclusive disease states and move between these states according to time-varying transition rates. Models such as these become complex quickly, often including spatial features and individual interactions, stratifying the population by demographic characteristics at various scales. The parameters that govern these models are often unknown and need to be estimated. Bayesian inference methods such as MCMC and SMC are often used to account for censored data (such as unknown infection times) and estimate parameters of interest. These methods are computational complex, and the implementation is technically challenging and time-consuming.
gemlib presents a unified framework for expressing and simulating models, as well as automatic generation of probability functions for parameter inference. The library enables researchers to rapidly spin-up epidemic models during emerging outbreaks in a robust, reproducible manner. gemlib is based on the machine learning library TensorFlow, allowing complex models to be optimised on a GPU when needed. This project will enable an intern to contribute to Open-source Software development, implementing Bayesian inference algorithms as new classes to the library. There will be opportunities to expand skills in functional programming, high performance computing, and functional testing.
4. Pipeline for fitting thermal responses of mosquito traits
Mechanistic models for the impact of climate on transmission of vector-borne diseases rely on thermal responses that describe how vector and pathogen life history traits respond to temperature. This project will train a student intern in developing a data analysis pipeline to fit these thermal responses using a dataset of mosquito trait data digitised from previously published lab experiments. Over 8 weeks, the student will work with the project lead to: 1) Perform basic quality checks on the trait data and collect relevant meta-data from the original articles, such that the completed dataset can be uploaded to the VecTraits database; 2) Write a pipeline in R to allow anyone to retrieve this data from VecTraits using the ohvbd package; 3) Fit a series of thermal performance curves (TPCs) to the data using rTPC package; and 4) Visualise and interpret trends in these TPC fits using ggplot2 and conduct an appropriate statistical analysis. The training will emphasise writing pipelines that are open, reproducible, and flexible. The final pipeline will be hosted on GitHub.
5. Integrated data plotting for quantum electronics experiments
Supervisor: Edward Laird <e.a.laird@lancaster.ac.uk> (Physics)
RSE mentor: John Fozard (SMS)
Versatile and easy-to-use measurement software is essential for experiments in quantum electronics, which is one of the most rapidly growing areas of physics. For data acquisition, this need is now met by the open-source framework, which has been adopted by most of the groups in the field, including mine. However, data inspection must be done outside this framework, either by exporting data step-by-step to an analysis program such as Matlab, or by writing ad-hoc plotting programs that need to be changed with each experiment.
In this project, the intern will develop a set of generalised plotting routines that interface with the existing generalised sweep routines that are part of QCoDeS. The aim will be to automatically plot all measurement results in matplotlib, regardless of what is being measured; for example, we should be able to live plot the transition intensity, frequency, and coherence time of a qubit regardless of what parameter(s) we are sweeping. All these things can be done with existing libraries, but these are not well interfaced to the code that actually runs experiments, which makes it difficult to make evaluations on the fly.
The intern will be embedded in my research group of seven experimentalists and will immediately be able to see successive versions of their code in use. The output of the project should be submitted for inclusion in the QCoDeS library and I predict that it will be widely used in my group and ultimately by quantum electronics researchers worldwide.
6. Modelling photosynthesis
Supervisor: Samuel Taylor <s.taylor19@lancaster.ac.uk> (LEC)
RSE Mentor: Dr. Supreeta Vijaykumar (LEC)
Models of photosynthesis are important tools for understanding plant responses to global change and are increasingly used to predict opportunities for targeted engineering of core metabolic processes like photosynthesis, in support of improved agricultural productivity or carbon storage. In the project PhotoBoost (https://www.photoboost.org/), a digital twin of photosynthetic metabolism, e-Photosynthesis, is used to explore opportunities for engineering next level photosynthesis in potato and rice. An objective is to evaluate molecular biology interventions that could enhance photosynthetic carbon assimilation. We want to do this to fuel improved crop yields and more resilient crop growth, taking account of key environmental controls on photosynthesis, including light, water and atmospheric carbon dioxide. By participating, you will learn about fundamentals of widely used leaf-level models applicable not only to simulations in crop biology, but also to ecology and global change modelling, using this understanding to apply quality control and parameterise non-linear models specific to the target crops. A key goal of the internship will be to evaluate novel data that describes photosynthetic responses to light and carbon dioxide. These will be used to test simulations produced by an advanced version of e-Photosynthesis that models metabolic regulation affecting the central carbon fixing enzyme Rubisco. You will develop your skills in use of R and MATLAB, for programming and analysis and visualisation of data. You will work alongside an experienced researcher mentor on a day-to-day basis, with weekly support from your supervisor and weekly small group team meetings where skills in data science are shared.
As our global climate changes, our reliance on pesticides to grow crops is ever-increasing. One method to reduce the overuse of pesticides may be the age-old tradition of “companion planting”, where a second plant is grown alongside the rst to attract predators that eat the pests. To determine if this approach can be applied at scale we have partnered with RHS Wisley. If shown to be successful, this method can be applied to fruit and vegetable crops to improve both food security and food sustainability.
However, to determine if this approach effectively reduces pest infestation we must track how pest numbers change across the growing season. Traditionally this requires regular, by-hand ‘bug counting days’: a highly inefficient technique that definitely isn’t applicable at scale. To optimise this process for wide-scale use we will combine high-resolution photography with source detection techniques developed in astronomy.
In this internship we will apply a set of AI algorithms, with a training set built from labels determined by the general public (“citizen science”) to automate the detection of invasive pests. The astrophysics group at 新澳门六合彩 has already developed and applied such techniques to a diverse range of challenges spanning global security, healthcare and catastrophe management. In this 8 week study, we will develop a framework required to store, analyse and interpret each image collected, and develop machine learning algorithms that will automate the detection of pests. If successful, this technique will be rolled-out across the RHS, to quantify the importance of companion planting across all types of crop.
8. Applying new methods for historical spelling normalisation to Early English Books Online
Early Modern English (EModE, c. 1500–1700) is the earliest period of the English language which can be analysed at a large scale, due to the introduction of the printing press by Caxton in 1476. The Early English Books Online (EEBO) project1 set out to digitise, and then transcribe every printed book available during the EModE period, resulting in a dataset of c. 1.1 billion words. Dealing with a corpus of such a size presents challenges for Digital Humanities. Previous efforts include the Linguistic DNA project2 , and UCREL has processed an instance on CQPWeb3 .
Spelling variation, prevalent in the EModE period due to the lack of language standardisation, is an issue for any linguistic analysis of EModE texts, with word frequencies split between spelling variants, and key tasks such as Part-of-Speech tagging having considerably reduced accuracy. The standard process developed has been to introduce a spelling normalisation step within the processing pipeline, which normalises spellings to a modern form, to improve the accuracy of downstream tasks.
This project aims to develop and evaluate new methods for this spelling normalisation step, particularly focusing on translation models, which have been shown to have previous success in the task. The result will be to establish which methods are most appropriate for this careful task, without introducing spurious normalisations, which can introduce artificial noise. The best method will then be applied to EEBO, providing an enriched version which can be processed for further linguistic analysis.
The first stage of the research will be to evaluate existing translation methods previously used on historical texts: Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) using Bidirectional LSTMs, and also apply a newer translation model, fine-tuning a Transformer-based large language model, such as BiBERT or DistilBERT. This evaluation will focus only on historical English with previously prepared corpora: ICAMET, Shakespeare, Newsbooks, and CEEC. The intern will implement these models, utilising existing code-bases, and apply them to the listed corpora, reporting which method performs best for each of the corpora, in terms of normalisations made and the number of spurious normalisations. Providing the first stage is successful, the second stage will be to apply the best performing normalisation method to the very large EEBO dataset. This will involve the intern building a pipeline to process the texts in an efficient manner, performing the normalisation, and outputting in a format amenable for further analysis, with original spellings and normalised forms aligned.
New Workshop Page
We have a new current workshop page - please do take a look - full information on these workshops and sign up opportunities can be found.
Data Dialogues - 12 -1pm in Sky Lounge, Infolab
Data Dialogues is an informal, discussion-driven event where members of the DSI and the broader university community share insights into their work, spark interdisciplinary conversations and explore potential collaborations. The focus is on interactive engagement rather than formal presentations—so no slides (or just a few, if needed)! Instead, the idea is to introduce your work in an accessible way, followed by an open discussion and Q&A with attendees.
Bring your lunch and come to the Sky Lounge to hear more about some of the exciting developments in Data Science and AI going on in the university. Get fresh perspectives and think about new ways of approaching your own research, meet new people and explore potential research collaborations. Come be part of the DSI community!
9 April - Henry Moss (School of Mathematical Sciences) - Accelerating Scientific Discovery in the Age of AI
23 April - Alex Bush, Cassio Nunes and Oliver Metcalf (LEC) - Confidence and Misclassification in Automated ML in ecology: challenges of scaling
30 April - Naveed Iqbal, MD – CEO at Triton Health - Supporting Intellectual Disability with AI
7 May - Jun Liu (Digital Health, SCC)
21 May - Luigi Sedda (CHICAS)
4 June - Leonardo De Sousa Miranda (LEC)
18 June - Georgina Brown (Forensic Linguistics)
If you would like to present in the 25/26 season please get in touch.
Latest News
Call for Proposals: AI for Innovation Pilot Studies (Data Science Institute)
The Data Science Institute is pleased to announce a newfunding call aimed at empowering academic staff members to explore, test, and demonstrate innovative ways in which artificial intelligence (AI) can be leveraged to enhance, extend, or improve the University’s activities in research, teaching and engagement. Projects selected under this call will receive up to £5,000 each in support and are expected to yield practical insights, prototypes, or demonstrable solutions that can be scaled across the University where applicable. Projects that deliver improvements to student experience or that speak to our international aspirations are also welcome. Note that for this call we are not seeking new research into developing new tools and techniques, but rather to explore the application and impact of such developments.
Please see for details and eligibility requirements.
Call for Applications: Gender Diversity in Data Science Grant
DSI is delighted to announce the Gender Diversity in Data Science Grant to support the research activities of women and other marginalised genders (including non-binary and gender-diverse individuals) in the fields of data science and AI. Women are significantly underrepresented in this area, with only 22% of data science professionals being women (Wajcman et al, 2024, Turing Institute Report). This grant aims to address gender disparities and improve retention of women and marginalised gender researchers by providing financial support for research, career development, and increased visibility in the field. Apply now to help build a more inclusive future in data science.
Please see the for full guidance on making this application.
Please send your application, along with a short CV (max. 2 pages using the template in the document), to j.carradus1@lancaster.ac.uk with the subject heading “Gender Diversity in Data Science Grant Proposal” by 17:00 April 14th 2025.
If you would like an informal chat about your eligibility for this grant (e.g., “am I a data scientist?”) please get in touch with Sal Keith sally.keith@lancaster.ac.uk)
Katherine Richardson (University of Copenhagen) Planetary Boundaries: A tool to guide management of Human-Earth interactions - 2nd May
The climate and biodiversity witness that our societies cannot continue to flourish unless we actively manage our relationship with the Earth and its resources. Such management requires guardrails to identify how much perturbation of critical Earth system processes is “too much”. The planetary boundaries framework, first introduced in 2009, and since twice updated, identifies science-based limits for human perturbation of Earth system processes. The most recent update shows that 6 of 9 boundaries are transgressed and that transgression is increasing. It also shows, however, that human perturbation of the ozone layer – a boundary transgressed or nearly transgressed in the 1900s - is now in back within a “safe operating space”. The framework and how it can be used for management of the Human-Earth relationship are presented here.
2nd May at 1.30 in the Management School - LT3 - 1.30 - 3pm
Sign up via
Biography
Katherine Richardson is a professor in biological oceanography at the University of Copenhagen and, for more than 3 decades, has actively contributed to the development of Earth system science. She is one of the main architects behind the “planetary boundaries” and led the 2023 update that now has been downloaded over half a million times. She is extremely active at the science-policy and science-society interfaces and chaired the Commission that produced a plan for how Denmark can be independent of fossil fuels. She was a member of the Independent Group of Scientists that prepared the 2019 UN Global Sustainable Development Report and currently chairs the High-level EU Expert group on the economic and societal impact of research and innovation (ESIR).
Lecture: Distant Viewing and the Multimodal Turn - 21st May 3.15-5pm
Management School – Lecture Theatre 3
and , University of Richmond
How do computers view? How can we harness AI to view images at scale?
Distant viewing offers a theory and method for the large-scale analysis of images using computer vision. This talk will introduce the concept and then turn to specific AI methods for the analysis of images. We will then turn to how distant viewing can support multimodal analysis, specifically looking at multimodal large language models.
Sign up for the
Learn more about Tilton and Arnold’s work:
Taylor Arnold and Lauren Tilton, . MIT Press, 2023.
Workshop: Distant Viewing Explorer - 20th May 10am-12pm
Led by Lauren Tilton abd& Taylor Arnold, University of Richmond
Charles Carter - A15 Seminar Room
This workshop will focus on DV Explorer (distantviewing.org/dvexplorer), which introduces the ways computer vision and related-AI technologies can support the analysis of images. We will end with how one can scale up their analysis using DVScripts (distantviewing.org/dvscripts), a guide to using python for distant viewing.
Research Themes
Data Science at 新澳门六合彩 was founded in 2015 on 新澳门六合彩’s historic research strengths in Computer Science, Statistics and Operational Research. The environment is further enriched by a broad community of data-driven researchers in a variety of other disciplines including the environmental sciences, health and medicine, sociology and the creative arts.
Foundations research sits at the interface of methods and application: with an aim to develop novel methodology inspired by the real-world challenge. These could be studies about the transportation of people, goods & services, energy consumption and the impact of changes to global weather patterns.
The Health theme has a wide scope. Current areas of strength include spatial and spatiotemporal methods in global public health, design and analysis of clinical trials, epidemic forecasting and demographic modelling, health informatics and genetics.
Data Science has brought new approaches to understanding long-standing social problems concerning energy use, climate change, crime, migration, the knowledge economy, ecologies of media, design and communication in everyday life, or the distribution of wealth in financialised economies.
The focus of the environment theme has been to seek methodological innovations that can transform our understanding and management of the natural environment. Data Science will help us understand how the environment has evolved to its current state and how it might change in the future.
The Data Engineering theme aims to explore how we can utilise digital technologies to accelerate and enhance our research processes across the University.
Research Software Engineering
Within the Data Science Institute, our aim is to improve the reproducibility and replicability of research by improving the reusability, sustainability and quality of research software developed across the University. We are currently funded by the N8CIR, and work closely with our partner institutions across N8 Research.