Defeating Google's audio reCaptcha with 85% accuracy

The internet is a weird and wonderful place where two or more parties share a communication link to communicate with each other. For services that assume a human will interact them there is significant scope for abuse of that service by programs designed to take advantage of that service. Computationally it is very cheap to send data over the internet and any function which performs a function on the data received, unless returning a static zero length message, will incur more costs than the sender.

We discuss ways that bot writers can use tools to enable the defeat of reCaptcha to wreak havok

Introduction

Imagine I create a service available online that produces cryptographically secure 2048 bit prime numbers that could be used in an RSA encryption scheme and I call it TotallySecurePrimes.com. This service would take a few hundreds of second of processing time to generate each prime. For the good people of the internet this would be an interesting service securing them against all evil. For others this service is a cheap source of computation and they could create scripts, or bots, that exploit my service to do something more compelling, hopefully, than a distributed denial of service (DDoD) attack.

Now, if I am anyway security conscious, I could employ a number of techniques to prevent bots from exploiting my service. An easy method, employed by millions of websites, is to use a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA). Typically, these reverse Turing tests are easy for a human to complete but difficult for a computer to correctly answer. As computers have gotten better at completing tests the tests themselves have evolved. Google’s reCaptcha system, a fascinating system that has been used to make texts legible, and classify objects, has been used extensively; I would highly recommend Louis von Ahn’s Massive-scale online collabation TED Talk to see how this system is used for good.

Demo of simple reCAPTHCA

These systems are primary visual due to the difficulty that computers had with vision, image context, and object categorisation. However, for reasons of accessibility there also needed to be an alternative way to prove that you are human for those of us that do not have sufficient vision. These have been aural tests that attempt to bamboozle computers to the same degree as the visual alternatives.

Demo of audio reCAPTCHA system

Incorporating a CAPTHCA system on my site lets me feel safe that only the good people of the internet are using my site. Sure, my users need to do a little test to make sure they are human but they want something from me and they get what they asked for and everyone is happy. Well, almost everybody. You see those pesky bot makers are furious that their cheap source of crypto primes has dried up.

Breaking reCAPTCHA

The reCAPCHA system is no fool. Most of the time it will simply ask you to check a box as shown above to prove you are human. The system uses many metrics to assess the risk you pose prior to selecting a challenge for you to complete. The higher the risk the greater the challenge will be.

Demoe of reCaptcha advanced challenge An Example of a higher order challenge

The system choses numbers to be spoken in different accents and varying pitches over background noise. An attack on this system first splits the audio track into a number of sub-tracks where numbers have been identified and uploads each sub-track to a number of online transcription services. The results for each sub-track are then collated and the final series of numbers are chosen using a carefully chosen heuristic. The series is then entered into the captcha service to complete the attack.

This relatively simple attack is accurate in 85% of cases; Google’s reCaptcha system will trust us more with every correct answer we give it. A tool has been developed called unCaptcha which can be used to automated these attacks. Further information is available via the USENET WOOT 17 Slides and you can even watch of vido showing how it exploits Reddit.