UAB - The University of Alabama at Birmingham

Wiretapping via Mimicry

Crypto Phones are mobile apps claiming to offer an end-to-end VoIP security guarantees based on a purely peer-to-peer, user- centric mechanism. To secure the voice, video or even text communications, Crypto Phones require a cryptographic key, which is agreed between the end parties using a special- purpose key exchange protocol. This protocol results in a short (e.g., 16-bit or 2-word) checksum, called a Short Authenticated String (SAS), per party. These strings are then output, e.g., encoded into numbers or words, to users’ devices who then verbally exchange and compare each other’s SAS values, and accordingly accept, or reject the secure association attempt (i.e., detect the presence of Man-in-the-Middle – MITM – attack). The following figure presents a simple Crypto Phone protocol.

Figure1: Crypto Phone protocol (simplified)

Figure1: Crypto Phone protocol (simplified)


The security of Cfones crucially relies on the assumption that the human voice channel, over which SAS values are communicated and validated by the users, provides the properties of integrity and source authentication. In this work, we challenge this assumption, and report on automated SAS voice imitation man-in-the-middle attacks that can compromise the security of Crypto Phones in both two-party and multi-party settings, even if users pay due diligence. The first attack, called the short voice reordering attack, builds arbitrary SAS strings in a victim’s voice by reordering previously eavesdropped SAS strings spoken by the victim. The second attack, called the short voice morphing attack, builds arbitrary SAS strings in a victim’s voice from a few previously eavesdropped sentences (less than 3 minutes) spoken by the victim. The following figure demonstrates the attack.

Figure 2: Our short voice imitation MITM attack scenario for 2-Cfone – attack succeeds because of voice impersonation

Figure 2: Our short voice imitation MITM attack scenario for 2-Cfone – attack succeeds because of voice impersonation


We design and implement our attacks using off-the-shelf speech recognition/synthesis tools, and comprehensively evaluate them with respect to both manual detection (via a user study with 30 participants) and automated detection. The results demonstrate the effectiveness of our attacks against three prominent forms of SAS encodings: numbers, PGP word lists and Madlib sentences. These attacks can be used by a wiretapper to compromise the confidentiality and privacy of Crypto Phones voice, video and text communications (plus authenticity in case of text conversations).

We comprehensively extend this work to analyze the accuracy of machine-based speaker verification (voice biometrics) systems against voice imitation attacks. Using automated speaker verification systems in the context of Crypto Phones can be a natural near-future deployment scenario. The results of this evaluation shows that even the state-of-the-art machine verification systems fail to detect the attacked voices, although they can detect an original speaker’s voice and a different speaker’s voice accurately. Also, the evaluation shows that the attacked SAS is quantitatively more similar to the original SAS in case of the shorter SAS strings, and therefore shorter SAS strings are more difficult for the users and machines to detect compared to longer speech impersonation (shorter SAS is more prone to our attacks).