While running a series of unit tests that make API calls to Amazon Web Services (AWS), I noticed something strange: tests were failing unpredictably. Sometimes all the tests would pass, then on the next run, a few would fail, and the time after that, a different set would fail.
The errors I was getting didn’t seem to make any sense:
Aws::EC2::Errors::AuthFailure: AWS was not able to validate the provided access credentials
Aws::RDS::Errors::SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
Aws::Lambda::Errors::InvalidSignatureException: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
These exceptions seemed to indicate that the tests were using incorrect credentials, but I knew that wasn’t true. I had used these credentials many times in the past, and they had worked every time. Furthermore, each one of these unpredictable unit tests had passed at least once—so the credentials must have been valid. Why, then, was I getting these errors?
These tests did make a large number of requests. Could it be rate limiting (with decidedly unhelpful error messages)? This seemed like a reasonable cause, but I was unable to reproduce the problem, even with a higher volume of requests than the tests.
Maybe the credentials really were invalid some of the time. As a sanity check, I modified the tests to print out the credentials being used for each API call. Was this simply a case of incorrect, unpredictable code?
Nope. No luck. The credentials were indeed valid, each and every time.
Well, except for that one unit test that tested invalid credentials, of course.
$ ruby aws_test.rb
Using valid credentials:
Well, the issue certainly was real. Something interesting I noticed was that the first API call to AWS would usually take about half a second, but subsequent ones would be considerably faster. If I waited for about five seconds, though, the next request would again be (comparatively) slow.
This seemed to indicate some sort of caching issue to me. I dug through the code for the AWS Ruby SDK, but didn’t find anything related to response caching. The difference in request speeds was caused by connection pooling—the AWS Ruby SDK keeps HTTP connections alive for five seconds by default. Disabling connection pooling didn’t make the issue go away, and I even wrote a script to manually build HTTP requests and send them. Still, the problem persisted.
When I mentioned this to our CTO, he had an immediate hypothesis for what was going on: AWS was tarpitting me. The service was intentionally rejecting even valid credentials when preceded by too many failed authentication attempts, in order to prevent brute force attacks. This fit the pattern of facts we were seeing, and was the simplest explanation we could come up with.
The solution? A simple retry algorithm with exponential backoff. But this is not an ideal solution. Because there is no way to differentiate a response to a request with invalid credentials from a response to a blocked request, retrying must be applied to both of these scenarios, which means that legitimately invalid credentials will cause considerable delays when authenticating: how can we know if the credentials were actually invalid? Maybe we’re just being tarpitted, so there’s no choice but to wait until the maximum number of retries is exceeded.
All that can really be done is to modify the retry algorithm based on when and where the API calls are taking place. For tasks running in the background, it’s usually okay to have large numbers of retries and long delays. But when processing a request from a user, you’ll have to use few retries and live with the possibility that the user could see an incorrect error about invalid credentials and have to try again.
Should I implement this kind of tarpitting in my service?
This is not the Right Way™ to do things, and this is not the time to follow in Amazon’s footsteps. Not being able to differentiate legitimately invalid credentials from rate limiting or tarpitting is a serious problem for usability. Here are some ways AWS could remedy this problem while still preventing brute force attacks:
Return a different error message when too many authentication attempts have been made.
Introduce a network delay when too many authentication attempts have been made.
Require callers to process request signatures with a key derivation function (although this has the effect of slowing all requests and increasing CPU usage).
Use secret keys long enough so that brute forcing is computationally intractable, and don’t have any delays or tarpitting (just normal rate limiting to reduce server load). Remember that these secret keys are automatically generated—there is no possibility of users choosing weak passwords.
Or, the best solution: don’t use HMACs for authentication! If you’re using HMACs for authentication, you’re necessarily storing the secret keys in plaintext. By now, haven’t we learned that storing plaintext passwords is a bad idea?
Instead, generate an asymmetric keypair, give the private key to the user, and store only the public key on the server. Assuming you used secure parameters to generate the keys (you didn’t use an exponent of 3, did you?) and you use a secure signature scheme (let’s maybe avoid textbook RSA), the user can securely sign requests with this key, with the assurance that it is computationally intractable to forge signatures (until we have quantum computers, at least…).