This project will explore how new, generative, large language models like those used in ChatGPT and other AI chatbots can be used to automatically extract insights from free-text notes in police incident data.
The project aims to develop and test the robustness of these methods in recognising and classifying incidents that potentially involve vulnerable people, such as people with mental health problems.
The police have always dealt with vulnerable people but over the past decade, the nature and extent of this involvement has changed dramatically. The ESRC Vulnerability & Policing Futures Research Centre is undertaking research to better understand this changing landscape.
Grasping the nature and level of these demands is essential to support evidence-based responses and allocation of resources. Yet current attempts to quantify police demand often rely on the use of structured police data, such as incident flags. These specific markers are used to categorise the nature of incidents, such as “mental health” or “human trafficking”.
While these are essential for administrative purposes, they often present significant challenges for broader strategic analysis. For example, incident types used to describe calls for service are often inadequate to cover the complex and diverse situations in which police find themselves. Also, the use of incident flags or qualifiers can often be unreliable or inconsistent.
In contrast to incident flags, unstructured data such as free-text incident logs can provide detailed descriptions of events as they unfold. These might contain crucial information about vulnerabilities, which can be used to identify specific incidents involving vulnerable people.
A common exercise in qualitative analyses is “coding” free text, which involves categorising and interpreting text to identify themes and patterns. In the context of police reports, coding can be used to identify and label issues such as “mental health problems”, “homelessness”, and “substance abuse”. However, manual coding by researchers is very slow and resource-intensive, making it impractical to analyse thousands of records.
Advances in language modelling, exemplified by the hugely popular ChatGPT, offer an opportunity to automate qualitative text analysis. These new language models can perform a very broad range of tasks by simply being given straightforward instructions in free text. LLMs might enable the coding of unstructured text data at scale, analysing thousands of narratives on a par with structured datasets, such as incident flags. By adopting such an approach, more nuanced definitions for vulnerabilities can be established and the extent to which they are encountered in routine policing can be analysed.
This project aims to assess the ability of large language models to label police incident narratives accurately and reliably for situations involving vulnerable people, such as people with mental health problems, people who are homeless, and people with substance abuse problems.
Using LLMs for coding
The general principle of this research involves using Large Language Models (LLMs) to analyse and code unstructured text data through a process known as “prompting”. By providing text instructions, known as a prompt, LLMs can perform a variety of tasks without requiring specific training on each task.
The following example demonstrates how language models might be prompted to perform a coding task:
Example prompt to an LLM:
Read the following definition:
Alcoholism: [Definition]
You will be provided with police incident reports. Use the definition to classify the report. Highlight relevant quotes and link them to the definitions. Classify the report as either POSITIVE, INCONCLUSIVE, or NEGATIVE.
Given a police report, the LLM might then respond like so:
Policing Narrative: “Officer observed individual on Dorchester Ave. Individual admitted to having a drinking problem and being homeless. Individual was found sleeping in a public park and engaging with local homelessness services.”
Example Output: Alcoholism: “Individual admitted to having a drinking problem.” Classification: POSITIVE
Dataset
The first phase of this research will use anonymised narrative data from the US Boston Police Department’s Field Interrogation and Observation (FIO) dataset. These narratives provide detailed free-text accounts of police interactions, with sufficient detail to describe circumstances and behaviours indicative of individuals experiencing vulnerabilities. The dataset is available publicly, under an Open Data Commons Public Domain Dedication and License (PDDL), allowing it to be freely used and shared for research purposes.
Codebook development
The team is developing a codebook, a document that defines how data is categorised and interpreted, focusing on four vulnerabilities:
- Mental Health Difficulties
- Drug Abuse
- Alcoholism
- Homelessness
These vulnerabilities have been selected because they can be clearly defined, are recognisable concepts to non-experts, and are sufficiently common in the narratives, based on preliminary examination.
Human vs LLM analysis
The primary objective of this study is to compare the effectiveness of Large Language Models (LLMs) to humans in identifying vulnerabilities in police narratives. The aim is to determine if LLMs could replicate or surpass human analysis. The analysis will be structured as follows:
- Human Coding Benchmark: Two non-expert human coders are independently analysing a subset of the narratives using the codebook. Their results will be reviewed by the research team to reach a consensus, which will serve as the benchmark for evaluating the LLMs’ performance.
- LLM Variations: Various LLMs of different sizes and complexities will be tested, including:
GPT-4o: A proprietary model from OpenAI, reputed to have over 1 trillion parameters. This model represents the state-of-the-art in LLM performance.
LLaMA 8B and 70B: Open-source models by Meta, with 8 billion and 70 billion parameters, respectively. These models are more accessible and cost-effective compared to larger, proprietary models.
- Variability Analysis: even if the same LLM model and prompt are used, different outputs can be produced. Therefore, ten outputs for each narrative will be generated, allowing the team to assess the consistency and reliability of the models’ classifications.
Bias analysis
LLMs are trained with vast quantities of text sourced from the internet. It is well understood that such training data may not represent all sections of society equally. This imbalance may result in certain voices, perspectives, and cultural narratives being overrepresented, while others may be marginalised or excluded. This imbalance has the potential to perpetuate bias and reinforce stereotypes in model outputs.
To identify potential biases in this approach, the team will conduct counterfactual analyses to assess the impact of demographic characteristics on vulnerability classifications. Counterfactual analyses involve creating hypothetical scenarios by altering specific details in the data while keeping other factors constant. This method allows the team to isolate and examine the effects of race, sex, and other person-specific descriptors on model outputs while controlling for other contextual factors within the narratives.
- Counterfactual Scenarios: The team will select a subset of narratives and generate counterfactual versions, systematically altering the race and sex descriptors across a predefined set of demographics. This process will yield a set of narratives identical in content to the originals, differing only in the demographic descriptors of the subject.
- Analysis: The team will apply its classification methodology to these counterfactual narratives and analyse the results to identify any systematic differences in vulnerability classifications based on demographic characteristics.
Bias analyses will provide crucial insights into the fairness and reliability of LLM-based vulnerability classification in policing narratives. They are critical in ensuring responsible development and application of this technology in sensitive contexts.
Lead investigators
- Professor Dan Birks (University of Leeds)
- Professor Charlie Lloyd (University of York)
Postdoctoral researcher
- Sam Relins (University of York)