CHATGPT TRIALS IN AUDIT AT BEEVER AND STRUTHERS

This blog post presents the findings from a ChatGPT pilot conducted at Beever and Struthers in July and August 2024. The purpose was to explore the feasibility of adoption of ChatGPT into the audit operations. The trial involved volunteers from the Manchester and London offices respectively who actively engaged with ChatGPT and reported their experiences in diaries and interviews. The interviews conducted at the end of the trial focused on the volunteers reported use cases, benefits, challenges and implications for wider implementation.

At the start of the trial, volunteers were briefed and tasked with using ChatGPT in their daily work for a range of activities such as sourcing and summarizing regulatory information, solving simple problems, summarizing meeting minutes, and generating accounting-related reports.

The volunteers were self-selected following a request for participation to the audit group and are representative of younger employees (under 30years). This demographic information was collected using a concise survey that also revealed that volunteers had limited previous exposure to ChatGPT, and all were uncertain about the use cases in audit prior to the trial. Concerns were raised prior to the trial regarding the privacy and confidentiality of data when input to ChatGPT although the compliance staff had cleared the usage, and the enterprise version was used which has enhanced security and privacy features.

The trial was preceded by a brief introduction to ChatGPT, and an information pack provided on use cases taken from the ICAEW website and the cpa.com “Generative AI Toolkit.”

The timeline of the trial is outlined in Figure 1. At the end of the trial, semi-structured interviews were conducted with each volunteer individually on a zoom call using a set of specific, pre- determined questions; however, the conversation was kept open-ended to allow volunteers to fully articulate their experiences over the weeks of using ChatGPT in their work. All the interviews were recorded and transcribed.

Figure 1 Timeline of the pilot

Table 1: Volunteer characteristics

FINDINGS

Use Cases

Volunteers reported use of ChatGPT for various activities, including summarizing accounting standards, solving issues with Excel formulas, generating reports, and for academic purposes related to studying for accounting qualifications (see figure 2).

Figure 2 Use cases of ChatGPT identified by volunteers.

KX highlighted the usefulness of ChatGPT in summarizing client meeting minutes:

Sometimes clients will give you your board minutes in separate documents, so it’s just kind of useful to upload them all and then extract it.”

Similarly, BH mentioned using ChatGPT to summarize FRS102, a lengthy and time-consuming document:

So FRS102 is the main document that I need, and it is hundreds of pages long. You can fall asleep reading there.”

All volunteers stated that ChatGPT performed well in these tasks, delivering prompt and accurate responses.

A group of the volunteers who primarily worked with Excel used ChatGPT to correct Excel formulas. DK explained how he found it quicker to use ChatGPT to resolve errors:

And instead of me going through it and trying to find where I’ve made that mistake, I can just put the formula into ChatGPT, and it usually corrects the formula that I can put it back in.”

TC generated a payroll report, a task not previously assessed on ChatGPT, and observed:

So, it generates a really accurate description of the test going on. And I have pulled out things that are relevant for my working paper. That is probably saved me 30 minutes work.”

ChatGPT also was reported as performing well in generating or solving quantitative problems, but only after receiving accurate prompts. While TC received correct results with precise prompting, the Excel users encountered various inaccurate outcomes.

DK, who had some prior knowledge of ChatGPT, noted a tendency for “hallucinations” that are a common complaint in the literature about ChatGPT:

You can tell when it’s actually making up stuff, or it’s actually taking things out of the text and really understanding it..”

Work Efficiency

During the trial period, volunteers had the opportunity to explore ChatGPT and assess its impact on work efficiency. Volunteers stated:

“A lot of tasks would take maybe 15–20 minutes, I could see like cut down to maybe 5, maybe less than that.” (MD)

“It is quite effective, quick and good. There was nothing that I tried to use, and it did not work, or it took more time than I thought.” (HA)

However, the trial also highlighted a recurring challenge: many users struggled due to a lack of knowledge about specific use cases and insufficient training. This gap in understanding often led to difficulties in effectively utilizing ChatGPT.

The effectiveness of ChatGPT was reported as dependent on the specificity of the prompts provided. The tool performed well when users followed clear and accurate guidelines, emphasizing the importance of precise input. As KX said:

“I was asking it to summarize anything to do with expenditure for residential care and grants and stuff, but I didn't define exactly what that was, so it wasn't too useful because a lot of things go into it.” (KX)

However, this also underscored the need for training support which is essential to help newer users grasp how to apply ChatGPT effectively in practical scenarios.

Trust in outputs

Another significant finding was the low level of trust users had in the outputs from ChatGPT. As expected, most participants felt the need to verify the answers sometimes two or three times before relying on them, reflecting a cautious approach to adoption.

 

STRATEGIES FOR WIDER IMPLEMENTATION

Customized Training Initiatives

Findings underscore the importance of implementing a tailored and comprehensive training program to maximize the potential of ChatGPT. The generic training provided initially was insufficient, leading to mistakes and inefficiencies as participants were not fully equipped to explore the extensive functionalities of ChatGPT, hindering its effective use in daily operations.

Several participants expressed this as a challenge, with one stating:

“The first week was tough because I struggled with some use cases.”

This feedback highlights the need for training programs that are specifically designed to meet the unique requirements of the firm. Research by Boothby et al. (2010) supports this, showing that organizations that introduce targeted, in-depth training when adopting innovative technologies experience significant gains in productivity.

We therefore recommend creating a detailed and adaptable training manual, allowing auditors to learn at their own pace and integrate this learning into their regular work activities. This approach ensures that all employees, regardless of their prior experience with the technology, can fully exploit its advantages, leading to enhanced performance and more seamless technology adoption across the firm.

Prompt Engineering

Prompts are crucial for the accuracy and efficiency of ChatGPT. By fine-tuning the model with data specific to the domain, errors can be significantly reduced, and performance in specialized tasks can be enhanced.

Volunteers often found themselves needing to refine their prompts to correct initial mistakes. As one participant described:

“The only problem I encountered was when I asked ChatGPT to find which board minutes a figure came from—it gave me the wrong file initially, but after rephrasing my request, it provided the correct one.”

This example illustrates the importance of precise prompt engineering to achieve accurate outcomes.

Employing ‘chain-of-thought’ prompting can further refine ChatGPT performance by breaking down tasks into smaller steps, allowing it to build on previous reasoning and arrive at more accurate solutions for complex issues (Kojima et al., 2022; Wei et al., 2022).

Wei et al. (2022) have outlined a six-step framework for chain-of-thought prompting:

Task explanation, action explanation, input data explanation, output data explanation, input-output example, and task execution.

Applying these techniques will not only reduce errors but also empower teams to fully leverage ChatGPT’s capabilities.

Demographic Influences

Our research also considered the role of demographic factors in the speed and effectiveness of technology adoption, a topic with mixed findings in the literature. While some studies suggest that older individuals are more inclined to embrace technology if they see its value and benefits (Mitzner et al., 2010) this is not universally accepted. To address demographic challenges, we suggest encouraging all employees to use LLM (ChatGPT and the many other LLM variants) outside of the workplace, thus helping them become more familiar with the technology in everyday situations.

Additionally, open communication in tailored training sessions should address any concerns, including those related to privacy, data protection and the benefits of ChatGPT may reduce resistance to change.

Mitigating Technological Aversion

Participant BH mentioned a prior negative experience with the integration of another AI tool, Data Snipper, which may have caused lingering reluctance among some employees:

“You must go on the server, and Data Snipper had numerous issues on our server before, so I think some people do not want to use it because of those past problems, even though it generally works now. It is about reassuring those who are hesitant.”

This type of ‘computer anxiety’ has been identified in the literature as a barrier to technology adoption (Czaja et al., 2006). Improving communication especially when it comes to addressing and resolving software issues as they arise, could significantly reduce such anxieties and ease the process of integrating innovative technologies. Figure 3 summarizes the suggested approach to wider implementation based on the trial.

Figure 3 – Wider implementation considerations for ChatGPT

 REFERENCES AND BIBLIOGRAPHY

 

Boothby, D., Dufour, A., & Tang, J. (2010). Technology adoption, training and productivity performance. Research Policy, 39(5), 650-661.

Czaja, S. J., Charness, N., Fisk, A. D., Hertzog, C., Nair, S. N., Rogers, W. A., & Sharit, J. (2006). Factors predicting the use of technology: findings from the Center for Research and Education on Aging and Technology Enhancement (CREATE). Psychology and aging, 21(2), 333.

Emett, S., Eulerich, M., Lipinski, E., Prien, N. and Wood, D.A. (2024). Leveraging ChatGPT for Enhancing the Internal Audit Process—A Real-World Example from Uniper, a

Large Multinational Company. Accounting Horizons, pp.1–11. doi: https://doi.org/10.2308/horizons-2023-111.

Fisk, A.D., Rogers, W.A. and Sharit, J. (2010). Older adults talk technology: Technology usage and attitudes. Computers in Human Behavior, 26(6), pp.1710– 1721. doi: https://doi.org/10.1016/j.chb.2010.06.020.

Kojima, T., Gu, S., Reid, M., Matsuo, Y. and Yusuke Iwasawa (2022). Large Language Models are Zero-Shot Reasoners. arXiv (Cornell University).

doi: https://doi.org/10.48550/arxiv.2205.11916.

Mitzner, T.L., Boron, J.B., Fausset, C.B., Adams, A.E., Charness, N., Czaja, S.J., Dijkstra, K.,

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

doi: https://doi.org/10.48550/arxiv.2201.11903.

Acknowledgements

Students conducted the trial and evaluation as part of MSc Accounting and Finance individual dissertation under supervision of Professor Brian Nicholson and Dr Sung Hwan Chai at Alliance Manchester Business School, University of Manchester, UK

Tanmay Joshi

Chian-Yi Huang

Hashim Ali

Ramya Chandra Shekar

Mengyue Li

Next
Next

LLMs and GenAI in Audit