The Arup case, and what it actually demonstrated

The Arup incident, first reported by Hong Kong police in early 2024 and confirmed publicly later that year [1], deserves close reading because it is among the first publicly documented cases at significant scale that combined real-time video deepfakes, voice deepfakes and a multi-party social engineering scenario. The targeted employee initially suspected a phishing attempt when receiving an email from the supposed CFO, but was convinced by a subsequent video conference in which the CFO and several colleagues all appeared and sounded authentic. The transferred funds, equivalent to approximately 200 million Hong Kong dollars, were moved across multiple transactions and bank accounts.

What the case demonstrated is not that deepfakes have become technically perfect: investigation later identified subtle artefacts that a trained observer might have caught. It demonstrated that, in a high-pressure operational context with established trust relationships and apparent multi-party corroboration, the human verification heuristic that protects most organisations from voice-only phishing breaks down. The attack succeeded not because the deepfake was undetectable but because the social engineering structure made detection unlikely [2].

Comparable cases have surfaced at lower scale and with less public detail: voice-deepfake fraud against treasury staff in financial firms has been documented in multiple jurisdictions through 2024 and 2025. ENISA's threat landscape work classifies real-time voice synthesis as a mainstream rather than emerging technique [3].

The technical state of the art

Voice cloning has moved from requiring tens of minutes of training audio in 2020 to requiring three to five seconds in 2025. Public model releases produce cloned voices that are indistinguishable from the source to most listeners after a single demonstration sentence. Real-time streaming variants, which clone a target voice and synthesise new content with end-to-end latency below 300 milliseconds, are available both commercially and as community projects [2].

Real-time video deepfakes that operate on a live camera feed are no longer compute-bound for off-the-shelf hardware. A laptop with a modern consumer GPU can run a real-time face-swap model at video conferencing resolution with sub-150 ms latency. Integration with standard videoconferencing platforms via virtual camera drivers is trivial. The Arup case is consistent with this technology stack being deployed against a real target [1].

The defensive picture is asymmetric. Detection models trained on a given generation of synthesis methods reliably catch that generation in laboratory conditions, but degrade rapidly against newer generators and against adversarially post-processed media. The arms race favours the attacker structurally: synthesis quality must only exceed the perceptual threshold of a non-expert observer, while detection must keep pace with every new generator that emerges [6][7].

Where the enterprise attack surface actually lives

The high-impact attack surface for voice deepfakes in an enterprise is narrow but valuable. Treasury and finance functions, where authorised personnel can initiate large transfers on verbal authority. M&A and legal teams, where confidential information has immediate market or transactional value. Executive assistant workflows, where calendar manipulation and credential acquisition follow from a single convincing call. Help-desk password resets, where a deepfaked executive can request elevated access bypassing normal IT controls.

What unites these surfaces is reliance on voice or video as the primary out-of-band verification channel for high-trust actions. The historical assumption, that a phone call from a known voice constitutes meaningful authentication, has been the operational backstop for decades. That assumption is now provably wrong and must be replaced [2][3].

The lateral attack surface is less commonly discussed but matters as much. Deepfaked colleagues calling other colleagues to extract small pieces of information (server names, access patterns, internal terminology) produce intelligence that strengthens subsequent attacks. Deepfaked voicemails left for executives generate replies that can be harvested for biometric material. Each interaction trains the attacker's model of the organisation.

Why the obvious countermeasures are insufficient

Many enterprises have responded to the deepfake threat with policy changes that, examined honestly, do not work. Always call back on a known number fails when the attacker has compromised the target's personal mobile phone (a routine assumption in any credible threat model). Use of code words fails because most code words leak through normal operational chatter and because the attacker can convincingly request a new code word in the same call. Look for visual artefacts fails because the artefacts of current-generation models are not reliably visible to non-experts [2].

Liveness detection and behavioural biometrics are sometimes proposed as technical countermeasures. They have a role, but a limited one. Liveness checks based on prompt-and-response can be defeated by attackers who have automated the prompt handling. Behavioural biometrics work in retrospect for fraud analytics but rarely in real time against a determined attacker. Neither addresses the core problem, which is that the channel itself does not authenticate the participant [5][6].

The honest conclusion is that human perceptual detection of deepfakes is not a viable defence and policy alone cannot compensate. Defence requires either binding the conversation to a cryptographic identity at the device level, or restructuring the operational workflow so that no high-trust action can be authorised on a verbal channel without an out-of-band cryptographic confirmation. The former is a product strategy; the latter is an organisational redesign [4].

Cryptographic identity at the endpoint

The only defence that survives an arbitrarily good deepfake is one that does not depend on recognising the voice or the face. If both endpoints of a call are bound to a hardware-rooted cryptographic identity, and the call is authenticated end-to-end at the device level rather than at the application level, the content of the audio becomes irrelevant to authentication. The attacker can synthesise a perfect voice clone; without possession of the target's hardware identity, the call cannot be placed.

This is the design space in which sovereign hardware voice endpoints, hardware security keys for voice, and PQC-protected enterprise telephony operate. The cryptographic primitives are well understood (mutual authentication via ML-DSA-87 or comparable signature schemes, session establishment via hybrid ML-KEM-1024, transport encryption via AES-GCM 256), but the operational challenge is that the protection only attaches to the channel between two enrolled devices. A call to an unenrolled phone is unprotected by definition.

For organisations whose threat model justifies it, the practical implication is a tiered communication policy. High-trust roles operate on enrolled hardware-authenticated devices for any conversation involving authorisation, confidential information or executive decision-making. Conventional telephony remains available for routine traffic. The boundary between the two tiers becomes a procedural control: certain action classes (transfers above a threshold, M&A discussions, credential operations) are simply not authorisable over conventional channels.

Operational mitigations for the next 24 months

Most organisations cannot deploy hardware-authenticated voice endpoints across their entire workforce in the short term. In the interim, several operational mitigations meaningfully reduce exposure without eliminating it. The most important is a strict separation between the channel that initiates a high-trust action and the channel that authorises it: a voice call may request a wire transfer, but the authorisation must occur through a second channel (an authenticated web portal, a hardware MFA token, a face-to-face confirmation) that the attacker would need to compromise independently [2].

Transfer thresholds tied to multi-person authorisation, with at least one authoriser using a different communication modality from the requester, defeat the single-call deepfake scenario that has produced most documented losses. Mandatory cooling-off periods on novel beneficiaries (no transfers to new accounts within 24 hours of first request, regardless of source) eliminate the time pressure that most deepfake-driven fraud relies on [3].

Training matters less than process: organisations that have invested heavily in employee training to detect deepfakes have not measurably reduced their loss rate, while organisations that have rebuilt their authorisation workflows around channel separation have. The lesson from the Arup case and its successors is that the attackers do not need to defeat trained employees; they need to find one untrained employee in a high-pressure operational context. Process change removes the vulnerability without depending on human perceptual performance.

What the next two years will look like

Three trends will shape the threat landscape through 2027. First, the cost of high-quality real-time voice and video synthesis will continue to fall, with the technology shifting from skilled operator to push-button tooling. Second, attackers will increasingly combine synthesis with operational reconnaissance derived from open-source intelligence, leaked corporate data and prior compromises, producing scenarios that are difficult to distinguish from legitimate internal communication [8]. Third, regulatory and insurance pressure will harden authorisation workflows in financial services, healthcare and critical infrastructure, with knock-on effects on adjacent sectors.

Defenders should expect the threat to broaden from current high-value targets to mid-market enterprises within 12 to 24 months. The cost-benefit calculation that today restricts deepfake attacks to seven-figure transfers will shift toward five- and six-figure transfers as the per-attack cost drops. Process controls calibrated to today's threat will be inadequate to that environment [3][6].

What this means for you

If your organisation can be materially harmed by an attacker who can convincingly impersonate any of your senior executives in a voice or video call, your current control set is almost certainly inadequate. The remediation is not better detection: it is restructuring authorisation flows so that voice channels are not load-bearing for high-trust actions, and, for the highest-stakes communication, deploying hardware-authenticated endpoints that do not depend on human recognition [4].

Concrete priorities for the next 90 days: enumerate every workflow in which a verbal request from a recognised voice can initiate an action with material financial, reputational or security consequences; require channel separation for authorisation of every such workflow; remove the discretion of individual employees to bypass that separation under time pressure; assess the case for hardware-authenticated voice for the small subset of conversations whose strategic value justifies the investment.

The Arup case will not be the last 25-million-dollar deepfake loss. It will, with high probability, not be the largest. Treating it as the leading indicator of a structural shift in the threat landscape, rather than as an isolated incident, is the difference between organisations that will be in the news in 2027 and those that will not [1][2][3].

The deepfake threat to enterprise voice communications