The Case of the Apparent NSS Memory Corruption

This is a story of my encounter with an apparent memory corruption issue in the Netscape Security Services library.

The source I’m discussing can be found on Github.


© Alzay | Dreamstime.com – Computer test


Usually, when I try to get acquainted with a new API, I start to write simple program, one API call by call, which I compile and run after each step.

Imagine my surprise, when after adding the following function call (the only thing I added)

  PK11_FindKeyByAnyCert( certificate, passwd );

I got this memory corruption error.

  dblfree(56630,0x7fff73f61300) malloc: *** error for object 0x7fd39250ce70: pointer being freed was not allocated
  *** set a breakpoint in malloc_error_break to debug
  zsh: abort      ./dblfree

The above error is taken from my minimal example of the problem, not the actual program I was working on at the time. The only difference is the name of the binary and the hex numbers.

So what is happening here? I didn’t know. And to find out, it’s really important to use the right tool for the job.

So the first thing I did was to instrument my code with the built-in OS X tools, instruments(1). That didn’t tell me much; either because it doesn’t help in this particular instance, or that I just don’t know how to use it.

I will make a note that some people suggested Valgrind. I didn’t go that way because the problem seems to be adequately described with the Clang Address Sanitizer.

The Clang Address Sanitizer

Zeroth thing zeroth, before I could begin to build NSS with the Address Sanitizer enabled, I had to have a compiler for which it worked. I started by installing the latest Clang in Macports.

  sudo port install clang-4.0

However, when I tried to use it to build my ready-to-compile hello world with the Address Sanitizer, I got this error when I tried to execute ./hello.

  dyld: Library not loaded: /opt/local/libexec/llvm-4.0/lib/libclang_rt.asan_osx_dynamic.dylib
    Referenced from: ./hello
    Reason: image not found

So the sanitizer is expecting a runtime in this directory, /opt/local/libexec/llvm-4.0/lib/ but it’s not there. It turns out it’s installed in a subdirectory, quite below it. The solution I used was to symlink the relevant files into this directory.

  cd /opt/local/libexec/llvm-4.0/lib/

  sudo sh -c 'for f in clang/4.0.0/lib/darwin/*.dylib; do ln -s $f; done'

Now my hello world would both compile and run.

Building NSS with the Address Sanitizer

That turned out to be non-trivial to build NSS with the Address Sanitizer; and took some trial, error and Duck Duck Go on my part to get it right — No, I don’t normally use Google.

I already had a recent checkout of NSS and NSPR, so I didn’t update it before I started on this venture. My file system layout is simple; somewhere I have the directory src which is where I usually check out things from the respective master repositories. There I have src/nss and src/nspr; with a checkout from yesterday or the day before.

So, in src/nss, I do this, to build Address Sanitizer enabled library.

  make USE_64=1 USE_ASAN=1 nss_build_all \
    CC=clang-mp-4.0 CXX=clang++-mp-4.0 NSS_DISABLE_GTESTS=1

Troubleshooting the NSS Build

I needed the NSS_DISABLE_GTESTS=1 because otherwise it would try to build or link with g++ (I don’t know why); and this is the error message I got without it.

  g++ -arch x86_64 -o Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/gtest/src/gtest-all.o -c -g -g -fPIC  -fno-common -pipe -DDARWIN -DHAVE_STRERROR -DHAVE_BSD_FLOCK  -fsanitize=address  -fno-omit-frame-pointer -fno-optimize-sibling-calls -Wall -Qunused-arguments -Wno-parentheses-equality -Werror -Wsign-compare -DXP_UNIX -DDEBUG -UNDEBUG -DDEBUG_johann -DNSS_NO_INIT_SUPPORT -DUSE_UTIL_DIRECTLY -DNO_NSPR_10_SUPPORT -DSSL_DISABLE_DEPRECATED_CIPHER_SUITE_NAMES -Igtest/include/ -Igtest -I../../../dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/include -I../../../dist/public/gtest -I../../../dist/private/gtest  -std=c++0x gtest/src/gtest-all.cc
  clang: error: unsupported argument 'address' to option 'fsanitize='

Note that it’s using g++ directly, which is my default Clang, and it’s too old to support -fsanitize=address. This may not be a problem on the latest OS X Sierra with the most recent Xcode.

Also, without using USE_ASAN=1 (i.e. using CFLAGS=-fsanitize=address etc.) I was unable to get it going; but note that according to bug 1182382 USE_ASAN will be removed in the near future. The comment that says so is dated 2017.02.02.

This is the error message I got when I tried without USE_ASAN.

    Undefined symbols for architecture x86_64:
    "___asan_init", referenced from:
        _asan.module_ctor in now.o
    "___asan_register_globals", referenced from:
        _asan.module_ctor in now.o
    "___asan_report_load8", referenced from:
        _main in now.o
    "___asan_unregister_globals", referenced from:
        _asan.module_dtor in now.o
    "___asan_version_mismatch_check_v8", referenced from:
        _asan.module_ctor in now.o
  ld: symbol(s) not found for architecture x86_64

Installing the New NSS Build

By default, a successful build of NSS is “installed” into a dist directory in the same place I have src. That is, my installation was in src/../dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ; quite an ugly and unsightly name, yes.

There we have the usual directories bin, lib, and include. However, the files in them are just symlinks to the real files. More importantly, the way the OS X dynamic linker works, the names in the object files are wrong for a typical Unix-centric installation.

So I created an nss.asan directory in the same place my src directory is, and copied everything to there. From src/nss,

  cd ../..

  mkdir nss.asan

  cd nss.asan

  cp -pRL ../src/dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/ .

to copy all the files and not use symbolic links.

And for some reason the NSS header files are not included, so add them in an extra step with,

  find ../src/nss -name '*.h' -exec cp '{}' include/ ';'

Then I fixed the object names, so that the dynamic linker finds everything.

  #!/bin/zsh
  
  for file in lib/lib*.dylib; do
      foo=`otool -D ${file}`;
      for l in ${(f)foo}; do
        if ( [[ $l =~ "^@executable_path/lib" ]] ) {
                install_name_tool -id \
                    ${l/@executable_path/@executable_path/../lib} ${file}
            }
      done
  done
  
  for file in lib/lib*.dylib bin/*; do
      foo=`otool -L ${file} | awk '{ print $1; }'`
      for l in ${(f)foo}; do
        if ( [[ $l =~ "^@executable_path" ]] )  {
                install_name_tool -change ${l} \
                ${l/@executable_path/@executable_path/../lib} \
                "${file}";
            }
      done;
  done

And now I can use this build in the normal Unix way with relative run paths.

As I write this, I realize I probably forgot to alter the names in the executable binaries; however, since we’re not using any of the NSS commands in this tutorial (that is, we don’t need the address sanitized versions of them) that doesn’t matter.

I leave it as an exercise for the reader to alter the script to fix the commands too — if needed.

Building and Running the Demonstration Program

At this point, I can build my demonstration program with this command,

  clang-mp-4.0 -o dblfree.asan \
    -fsanitize=address -O1 -fno-omit-frame-pointer -g \
    -Wall -Wextra \
    -I${ROOT_PATH}/nss.asan/include dblfree.c \
    -L${ROOT_PATH}/nss.asan/lib \
    -lnss3 -lssl3 -lnspr4 -lplc4 \
    -rpath ${ROOT_PATH}/nss.asan/lib

where ROOT_PATH is set to the location of my src and nss.asan directories.

And when I run it I get,

  % ./dblfree.asan
  Password is: foo
  =================================================================
  ==66093==ERROR: AddressSanitizer: attempting double-free on 0x602000005b30 in thread T0:
      #0 0x102abaf89 in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89)
      #1 0x103b73742 in PR_Free (libnspr4.dylib:x86_64+0x14742)
      #2 0x102a59be5 in main dblfree.c:95
      #3 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8)
  
  0x602000005b30 is located 0 bytes inside of 4-byte region [0x602000005b30,0x602000005b34)
  freed by thread T0 here:
      #0 0x102abaf89 in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89)
      #1 0x103b73742 in PR_Free (libnspr4.dylib:x86_64+0x14742)
      #2 0x103bef9e2 in PORT_Free_Util (libnssutil3.dylib:x86_64+0x179e2)
      #3 0x1038ad2ed in PK11_DoPassword (libnss3.dylib:x86_64+0x2e2ed)
      #4 0x1038ad557 in PK11_Authenticate (libnss3.dylib:x86_64+0x2e557)
      #5 0x1038aebfa in PK11_FindKeyByAnyCert (libnss3.dylib:x86_64+0x2fbfa)
      #6 0x102a59bd8 in main dblfree.c:89
      #7 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8)
  
  previously allocated by thread T0 here:
      #0 0x102abadaf in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58daf)
      #1 0x103b72d8d in PR_Malloc (libnspr4.dylib:x86_64+0x13d8d)
      #2 0x102a59b5a in main dblfree.c:63
      #3 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8)
  
  SUMMARY: AddressSanitizer: double-free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89) in wrap_free
  ==66093==ABORTING
  zsh: abort      ./dblfree.asan

And now we start to get a clear picture of what’s happening.

There is plainly a double free() here, and it’s caused by something down the line from PK11_FindKeyByAnyCert(). So now that we know what’s happening, why is it happening?

The NSPR functions are documented in their respective header files; the NSS functions are not. The online documentation mentions nothing about freeing any memory.

The next thing I did was to look at the functions in-order. They’re not very interesting, and you can skip to the conclusion below without losing the thread.

The code you are about to see is Mozilla Public License 2; if you don’t like that license, avert your eyes.

TL;DR: Skip the source code listings.


  SECKEYPrivateKey *
  PK11_FindKeyByAnyCert(CERTCertificate *cert, void *wincx)
  {
      CK_OBJECT_HANDLE certHandle;
      CK_OBJECT_HANDLE keyHandle;
      PK11SlotInfo *slot = NULL;
      SECKEYPrivateKey *privKey = NULL;
      PRBool needLogin;
      SECStatus rv;
      int err;
  
      certHandle = PK11_FindObjectForCert(cert, wincx, &slot);
      if (certHandle == CK_INVALID_HANDLE) {
           return NULL;
      }
      /*
       * prevent a login race condition. If slot is logged in between
       * our call to pk11_LoginStillRequired and the 
       * PK11_MatchItem. The matchItem call will either succeed, or
       * we will call it one more time after calling PK11_Authenticate 
       * (which is a noop on an authenticated token).
       */
      needLogin = pk11_LoginStillRequired(slot,wincx);
      keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY);
      if ((keyHandle == CK_INVALID_HANDLE) &&  needLogin &&
                          (SSL_ERROR_NO_CERTIFICATE == (err = PORT_GetError()) ||
                           SEC_ERROR_TOKEN_NOT_LOGGED_IN == err ) ) {
          /* authenticate and try again */
          rv = PK11_Authenticate(slot, PR_TRUE, wincx);
          if (rv == SECSuccess) {
              keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY);
          }
      }
      if (keyHandle != CK_INVALID_HANDLE) { 
          privKey =  PK11_MakePrivKey(slot, nullKey, PR_TRUE, keyHandle, wincx);
      }
      if (slot) {
          PK11_FreeSlot(slot);
      }
      return privKey;
  }

That didn’t tell me anything, so I looked at the next one.

  /*
   * make sure a slot is authenticated...
   * This function only does the authentication if it is needed.
   */
  SECStatus
  PK11_Authenticate(PK11SlotInfo *slot, PRBool loadCerts, void *wincx) {
      if (!slot) {
          return SECFailure;
      }
      if (pk11_LoginStillRequired(slot,wincx)) {
          return PK11_DoPassword(slot, slot->session, loadCerts, wincx,
                                  PR_FALSE, PR_FALSE);
      }
      return SECSuccess;
  }

This one is even less interesting, let’s look at the last one.

  SECStatus
  PK11_DoPassword(PK11SlotInfo *slot, CK_SESSION_HANDLE session,
                          PRBool loadCerts, void *wincx, PRBool alreadyLocked,
                          PRBool contextSpecific)
  {
      SECStatus rv = SECFailure;
      char * password;
      PRBool attempt = PR_FALSE;
  
      if (PK11_NeedUserInit(slot)) {
          PORT_SetError(SEC_ERROR_IO);
          return SECFailure;
      }
  
  
      /*
       * Central server type applications which control access to multiple
       * slave applications to single crypto devices need to virtuallize the
       * login state. This is done by a callback out of PK11_IsLoggedIn and
       * here. If we are actually logged in, then we got here because the
       * higher level code told us that the particular client application may
       * still need to be logged in. If that is the case, we simply tell the
       * server code that it should now verify the clients password and tell us
       * the results.
       */
      if (PK11_IsLoggedIn(slot,NULL) && 
                          (PK11_Global.verifyPass != NULL)) {
          if (!PK11_Global.verifyPass(slot,wincx)) {
              PORT_SetError(SEC_ERROR_BAD_PASSWORD);
              return SECFailure;
          }
          return SECSuccess;
      }
  
      /* get the password. This can drop out of the while loop
       * for the following reasons:
       *  (1) the user refused to enter a password. 
       *                  (return error to caller)
       *  (2) the token user password is disabled [usually due to
       *     too many failed authentication attempts].
       *                  (return error to caller)
       *  (3) the password was successful.
       */
      while ((password = pk11_GetPassword(slot, attempt, wincx)) != NULL) {
          /* if the token has a protectedAuthPath, the application may have
           * already issued the C_Login as part of it's pk11_GetPassword call.
           * In this case the application will tell us what the results were in 
           * the password value (retry or the authentication was successful) so
           * we can skip our own C_Login call (which would force the token to
           * try to login again).
           * 
           * Applications that don't know about protectedAuthPath will return a 
           * password, which we will ignore and trigger the token to 
           * 'authenticate' itself anyway. Hopefully the blinking display on 
           * the reader, or the flashing light under the thumbprint reader will 
           * attract the user's attention */
          attempt = PR_TRUE;
          if (slot->protectedAuthPath) {
              /* application tried to authenticate and failed. it wants to try
               * again, continue looping */
              if (strcmp(password, PK11_PW_RETRY) == 0) {
                  rv = SECWouldBlock;
                  PORT_Free(password);
                  continue;
              }
              /* applicaton tried to authenticate and succeeded we're done */
              if (strcmp(password, PK11_PW_AUTHENTICATED) == 0) {
                  rv = SECSuccess;
                  PORT_Free(password);
                  break;
              }
          }
          rv = pk11_CheckPassword(slot, session, password, 
                                  alreadyLocked, contextSpecific);
          PORT_Memset(password, 0, PORT_Strlen(password));
          PORT_Free(password);
          if (rv != SECWouldBlock) break;
      }
      if (rv == SECSuccess) {
          if (!PK11_IsFriendly(slot)) {
              nssTrustDomain_UpdateCachedTokenCerts(slot->nssToken->trustDomain,
                                                slot->nssToken);
          }
      } else if (!attempt) PORT_SetError(SEC_ERROR_BAD_PASSWORD);
      return rv;
  }

OK, so now we see that the authentication function expects to free the value returned by our callback. Let’s look at the documentation.

This callback function returns one of these values:

If successful, a pointer to the password. This memory must have
been allocated with PR_Malloc or PL_strdup.

So now we know. A thoughtlessness on my part, not to check the documentation before diving into the deep end of the pool.

We fix the problem with this altered password function.

  char *passwd_function( PK11SlotInfo __unused *info, PRBool retry, void *arg ) {
    if ( retry == PR_TRUE ) return 0;
  
    // Fixing the double free with strdup().
    return PL_strdup( arg );
  }

And everything is working correctly again. Kudos to the OS X C Library team to include a rudimentary memory debugger in the normal everyday C Library. Without it, I would probably not have noticed this nearly as fast.

Thanks for reading.

Postscript

This text is noteworthy for including how to get a working Address Sanitizer in OS X Yosemite, and to build NSS with it; that’s why I chose to publish it — and I hope it’ll help someone to do the same.

The effort I put into updating Clang and getting the Address Sanitizer to work paid off immediately because it allowed me to home in on the actual problem.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: