The Case of the Apparent NSS Memory Corruption
This is a story of my encounter with an apparent memory corruption issue in the Netscape Security Services library.
The source I’m discussing can be found on Github.
Usually, when I try to get acquainted with a new API, I start to write simple program, one API call by call, which I compile and run after each step.
Imagine my surprise, when after adding the following function call (the only thing I added)
PK11_FindKeyByAnyCert( certificate, passwd );
I got this memory corruption error.
dblfree(56630,0x7fff73f61300) malloc: *** error for object 0x7fd39250ce70: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug zsh: abort ./dblfree
The above error is taken from my minimal example of the problem, not the actual program I was working on at the time. The only difference is the name of the binary and the hex numbers.
So what is happening here? I didn’t know. And to find out, it’s really important to use the right tool for the job.
So the first thing I did was to instrument my code with the built-in OS X tools, instruments(1). That didn’t tell me much; either because it doesn’t help in this particular instance, or that I just don’t know how to use it.
I will make a note that some people suggested Valgrind. I didn’t go that way because the problem seems to be adequately described with the Clang Address Sanitizer.
The Clang Address Sanitizer
Zeroth thing zeroth, before I could begin to build NSS with the Address Sanitizer enabled, I had to have a compiler for which it worked. I started by installing the latest Clang in Macports.
sudo port install clang-4.0
However, when I tried to use it to build my ready-to-compile hello world with the Address Sanitizer, I got this error when I tried to execute ./hello.
dyld: Library not loaded: /opt/local/libexec/llvm-4.0/lib/libclang_rt.asan_osx_dynamic.dylib Referenced from: ./hello Reason: image not found
So the sanitizer is expecting a runtime in this directory, /opt/local/libexec/llvm-4.0/lib/
but it’s not there. It turns out it’s installed in a subdirectory, quite below it. The solution I used was to symlink the relevant files into this directory.
cd /opt/local/libexec/llvm-4.0/lib/ sudo sh -c 'for f in clang/4.0.0/lib/darwin/*.dylib; do ln -s $f; done'
Now my hello world would both compile and run.
Building NSS with the Address Sanitizer
That turned out to be non-trivial to build NSS with the Address Sanitizer; and took some trial, error and Duck Duck Go on my part to get it right — No, I don’t normally use Google.
I already had a recent checkout of NSS and NSPR, so I didn’t update it before I started on this venture. My file system layout is simple; somewhere I have the directory src which is where I usually check out things from the respective master repositories. There I have src/nss and src/nspr; with a checkout from yesterday or the day before.
So, in src/nss, I do this, to build Address Sanitizer enabled library.
make USE_64=1 USE_ASAN=1 nss_build_all \ CC=clang-mp-4.0 CXX=clang++-mp-4.0 NSS_DISABLE_GTESTS=1
Troubleshooting the NSS Build
I needed the NSS_DISABLE_GTESTS=1
because otherwise it would try to build or link with g++
(I don’t know why); and this is the error message I got without it.
g++ -arch x86_64 -o Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/gtest/src/gtest-all.o -c -g -g -fPIC -fno-common -pipe -DDARWIN -DHAVE_STRERROR -DHAVE_BSD_FLOCK -fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls -Wall -Qunused-arguments -Wno-parentheses-equality -Werror -Wsign-compare -DXP_UNIX -DDEBUG -UNDEBUG -DDEBUG_johann -DNSS_NO_INIT_SUPPORT -DUSE_UTIL_DIRECTLY -DNO_NSPR_10_SUPPORT -DSSL_DISABLE_DEPRECATED_CIPHER_SUITE_NAMES -Igtest/include/ -Igtest -I../../../dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/include -I../../../dist/public/gtest -I../../../dist/private/gtest -std=c++0x gtest/src/gtest-all.cc clang: error: unsupported argument 'address' to option 'fsanitize='
Note that it’s using g++
directly, which is my default Clang, and it’s too old to support -fsanitize=address
. This may not be a problem on the latest OS X Sierra with the most recent Xcode.
Also, without using USE_ASAN=1
(i.e. using CFLAGS=-fsanitize=address
etc.) I was unable to get it going; but note that according to bug 1182382 USE_ASAN
will be removed in the near future. The comment that says so is dated 2017.02.02.
This is the error message I got when I tried without USE_ASAN
.
Undefined symbols for architecture x86_64: "___asan_init", referenced from: _asan.module_ctor in now.o "___asan_register_globals", referenced from: _asan.module_ctor in now.o "___asan_report_load8", referenced from: _main in now.o "___asan_unregister_globals", referenced from: _asan.module_dtor in now.o "___asan_version_mismatch_check_v8", referenced from: _asan.module_ctor in now.o ld: symbol(s) not found for architecture x86_64
Installing the New NSS Build
By default, a successful build of NSS is “installed” into a dist directory in the same place I have src. That is, my installation was in src/../dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ
; quite an ugly and unsightly name, yes.
There we have the usual directories bin
, lib
, and include
. However, the files in them are just symlinks to the real files. More importantly, the way the OS X dynamic linker works, the names in the object files are wrong for a typical Unix-centric installation.
So I created an nss.asan
directory in the same place my src
directory is, and copied everything to there. From src/nss
,
cd ../.. mkdir nss.asan cd nss.asan cp -pRL ../src/dist/Darwin14.5.0_clang-mp-4.0_64_ASAN_DBG.OBJ/ .
to copy all the files and not use symbolic links.
And for some reason the NSS header files are not included, so add them in an extra step with,
find ../src/nss -name '*.h' -exec cp '{}' include/ ';'
Then I fixed the object names, so that the dynamic linker finds everything.
#!/bin/zsh for file in lib/lib*.dylib; do foo=`otool -D ${file}`; for l in ${(f)foo}; do if ( [[ $l =~ "^@executable_path/lib" ]] ) { install_name_tool -id \ ${l/@executable_path/@executable_path/../lib} ${file} } done done for file in lib/lib*.dylib bin/*; do foo=`otool -L ${file} | awk '{ print $1; }'` for l in ${(f)foo}; do if ( [[ $l =~ "^@executable_path" ]] ) { install_name_tool -change ${l} \ ${l/@executable_path/@executable_path/../lib} \ "${file}"; } done; done
And now I can use this build in the normal Unix way with relative run paths.
As I write this, I realize I probably forgot to alter the names in the executable binaries; however, since we’re not using any of the NSS commands in this tutorial (that is, we don’t need the address sanitized versions of them) that doesn’t matter.
I leave it as an exercise for the reader to alter the script to fix the commands too — if needed.
Building and Running the Demonstration Program
At this point, I can build my demonstration program with this command,
clang-mp-4.0 -o dblfree.asan \ -fsanitize=address -O1 -fno-omit-frame-pointer -g \ -Wall -Wextra \ -I${ROOT_PATH}/nss.asan/include dblfree.c \ -L${ROOT_PATH}/nss.asan/lib \ -lnss3 -lssl3 -lnspr4 -lplc4 \ -rpath ${ROOT_PATH}/nss.asan/lib
where ROOT_PATH
is set to the location of my src
and nss.asan
directories.
And when I run it I get,
% ./dblfree.asan Password is: foo ================================================================= ==66093==ERROR: AddressSanitizer: attempting double-free on 0x602000005b30 in thread T0: #0 0x102abaf89 in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89) #1 0x103b73742 in PR_Free (libnspr4.dylib:x86_64+0x14742) #2 0x102a59be5 in main dblfree.c:95 #3 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8) 0x602000005b30 is located 0 bytes inside of 4-byte region [0x602000005b30,0x602000005b34) freed by thread T0 here: #0 0x102abaf89 in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89) #1 0x103b73742 in PR_Free (libnspr4.dylib:x86_64+0x14742) #2 0x103bef9e2 in PORT_Free_Util (libnssutil3.dylib:x86_64+0x179e2) #3 0x1038ad2ed in PK11_DoPassword (libnss3.dylib:x86_64+0x2e2ed) #4 0x1038ad557 in PK11_Authenticate (libnss3.dylib:x86_64+0x2e557) #5 0x1038aebfa in PK11_FindKeyByAnyCert (libnss3.dylib:x86_64+0x2fbfa) #6 0x102a59bd8 in main dblfree.c:89 #7 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8) previously allocated by thread T0 here: #0 0x102abadaf in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58daf) #1 0x103b72d8d in PR_Malloc (libnspr4.dylib:x86_64+0x13d8d) #2 0x102a59b5a in main dblfree.c:63 #3 0x7fff8a90d5c8 in start (libdyld.dylib:x86_64+0x35c8) SUMMARY: AddressSanitizer: double-free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x58f89) in wrap_free ==66093==ABORTING zsh: abort ./dblfree.asan
And now we start to get a clear picture of what’s happening.
There is plainly a double free()
here, and it’s caused by something down the line from PK11_FindKeyByAnyCert()
. So now that we know what’s happening, why is it happening?
The NSPR functions are documented in their respective header files; the NSS functions are not. The online documentation mentions nothing about freeing any memory.
The next thing I did was to look at the functions in-order. They’re not very interesting, and you can skip to the conclusion below without losing the thread.
The code you are about to see is Mozilla Public License 2; if you don’t like that license, avert your eyes.
TL;DR: Skip the source code listings.
SECKEYPrivateKey * PK11_FindKeyByAnyCert(CERTCertificate *cert, void *wincx) { CK_OBJECT_HANDLE certHandle; CK_OBJECT_HANDLE keyHandle; PK11SlotInfo *slot = NULL; SECKEYPrivateKey *privKey = NULL; PRBool needLogin; SECStatus rv; int err; certHandle = PK11_FindObjectForCert(cert, wincx, &slot); if (certHandle == CK_INVALID_HANDLE) { return NULL; } /* * prevent a login race condition. If slot is logged in between * our call to pk11_LoginStillRequired and the * PK11_MatchItem. The matchItem call will either succeed, or * we will call it one more time after calling PK11_Authenticate * (which is a noop on an authenticated token). */ needLogin = pk11_LoginStillRequired(slot,wincx); keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY); if ((keyHandle == CK_INVALID_HANDLE) && needLogin && (SSL_ERROR_NO_CERTIFICATE == (err = PORT_GetError()) || SEC_ERROR_TOKEN_NOT_LOGGED_IN == err ) ) { /* authenticate and try again */ rv = PK11_Authenticate(slot, PR_TRUE, wincx); if (rv == SECSuccess) { keyHandle = PK11_MatchItem(slot,certHandle,CKO_PRIVATE_KEY); } } if (keyHandle != CK_INVALID_HANDLE) { privKey = PK11_MakePrivKey(slot, nullKey, PR_TRUE, keyHandle, wincx); } if (slot) { PK11_FreeSlot(slot); } return privKey; }
That didn’t tell me anything, so I looked at the next one.
/* * make sure a slot is authenticated... * This function only does the authentication if it is needed. */ SECStatus PK11_Authenticate(PK11SlotInfo *slot, PRBool loadCerts, void *wincx) { if (!slot) { return SECFailure; } if (pk11_LoginStillRequired(slot,wincx)) { return PK11_DoPassword(slot, slot->session, loadCerts, wincx, PR_FALSE, PR_FALSE); } return SECSuccess; }
This one is even less interesting, let’s look at the last one.
SECStatus PK11_DoPassword(PK11SlotInfo *slot, CK_SESSION_HANDLE session, PRBool loadCerts, void *wincx, PRBool alreadyLocked, PRBool contextSpecific) { SECStatus rv = SECFailure; char * password; PRBool attempt = PR_FALSE; if (PK11_NeedUserInit(slot)) { PORT_SetError(SEC_ERROR_IO); return SECFailure; } /* * Central server type applications which control access to multiple * slave applications to single crypto devices need to virtuallize the * login state. This is done by a callback out of PK11_IsLoggedIn and * here. If we are actually logged in, then we got here because the * higher level code told us that the particular client application may * still need to be logged in. If that is the case, we simply tell the * server code that it should now verify the clients password and tell us * the results. */ if (PK11_IsLoggedIn(slot,NULL) && (PK11_Global.verifyPass != NULL)) { if (!PK11_Global.verifyPass(slot,wincx)) { PORT_SetError(SEC_ERROR_BAD_PASSWORD); return SECFailure; } return SECSuccess; } /* get the password. This can drop out of the while loop * for the following reasons: * (1) the user refused to enter a password. * (return error to caller) * (2) the token user password is disabled [usually due to * too many failed authentication attempts]. * (return error to caller) * (3) the password was successful. */ while ((password = pk11_GetPassword(slot, attempt, wincx)) != NULL) { /* if the token has a protectedAuthPath, the application may have * already issued the C_Login as part of it's pk11_GetPassword call. * In this case the application will tell us what the results were in * the password value (retry or the authentication was successful) so * we can skip our own C_Login call (which would force the token to * try to login again). * * Applications that don't know about protectedAuthPath will return a * password, which we will ignore and trigger the token to * 'authenticate' itself anyway. Hopefully the blinking display on * the reader, or the flashing light under the thumbprint reader will * attract the user's attention */ attempt = PR_TRUE; if (slot->protectedAuthPath) { /* application tried to authenticate and failed. it wants to try * again, continue looping */ if (strcmp(password, PK11_PW_RETRY) == 0) { rv = SECWouldBlock; PORT_Free(password); continue; } /* applicaton tried to authenticate and succeeded we're done */ if (strcmp(password, PK11_PW_AUTHENTICATED) == 0) { rv = SECSuccess; PORT_Free(password); break; } } rv = pk11_CheckPassword(slot, session, password, alreadyLocked, contextSpecific); PORT_Memset(password, 0, PORT_Strlen(password)); PORT_Free(password); if (rv != SECWouldBlock) break; } if (rv == SECSuccess) { if (!PK11_IsFriendly(slot)) { nssTrustDomain_UpdateCachedTokenCerts(slot->nssToken->trustDomain, slot->nssToken); } } else if (!attempt) PORT_SetError(SEC_ERROR_BAD_PASSWORD); return rv; }
OK, so now we see that the authentication function expects to free the value returned by our callback. Let’s look at the documentation.
This callback function returns one of these values:
If successful, a pointer to the password. This memory must have
been allocated with PR_Malloc or PL_strdup.
So now we know. A thoughtlessness on my part, not to check the documentation before diving into the deep end of the pool.
We fix the problem with this altered password function.
char *passwd_function( PK11SlotInfo __unused *info, PRBool retry, void *arg ) { if ( retry == PR_TRUE ) return 0; // Fixing the double free with strdup(). return PL_strdup( arg ); }
And everything is working correctly again. Kudos to the OS X C Library team to include a rudimentary memory debugger in the normal everyday C Library. Without it, I would probably not have noticed this nearly as fast.
Thanks for reading.
Postscript
This text is noteworthy for including how to get a working Address Sanitizer in OS X Yosemite, and to build NSS with it; that’s why I chose to publish it — and I hope it’ll help someone to do the same.
The effort I put into updating Clang and getting the Address Sanitizer to work paid off immediately because it allowed me to home in on the actual problem.