Closed Bug 478336 Opened 15 years ago Closed 15 years ago

rt->state assertion failure in js_DestroyContext creating/destroying many contexts

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: paul.barnetta, Assigned: igor)

References

Details

(Keywords: fixed1.9.0.11, fixed1.9.1, Whiteboard: fixed-in-tracemonkey)

Attachments

(4 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2009010509 Gentoo Firefox/3.0.5
Build Identifier: Current tip

I have a multi-threaded application that periodically crashes, giving the following assertion error:

$ ./a.out 
Assertion failure: rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING, at jscntxt.cpp:465

I've attached a test program which demonstrates this (see below). The program spawns many threads, each of which create and then destroy a context before exiting. I'd expect the number of contexts active at any time to range between [0..THREADS], possibly transitioning between 0 and non-zero values many times and showing a race condition in the code?

Reproducible: Always

Steps to Reproduce:
Below is a simple application that exhibits the problem 90+% of the time (for me) when run directly from the command line:

8<----

#include <stdlib.h>
#include <pthread.h>

#include "jsapi.h"

static JSRuntime *rt;

#define THREADS 100

static void * testfunc(void *ignored) {

    JSContext *cx = JS_NewContext(rt, 0x1000);
    if (cx) {
        JS_BeginRequest(cx);
        JS_DestroyContext(cx);
    }

    return NULL;
}

int main(void) {

    rt = JS_NewRuntime(0x100000);
    if (rt == NULL)
        return 1;

    /* Uncommenting this to guarantee there's always at least
     * one context in the runtime prevents this problem. */
//  JSContext *cx = JS_NewContext(rt, 0x1000);

    int i;
    pthread_t thread[THREADS];
    for (i = 0; i < THREADS; i++) {
        pthread_create(&thread[i], NULL, testfunc, NULL);
    }

    for (i = 0; i < THREADS; i++) {
        pthread_join(thread[i], NULL);
    }

    return 0;
}

8<----

It seems to be very sensitive to timings as I have trouble reproducing the issue in gdb. For me to trigger it there I just need create/destroy more contexts per thread, but YMMV.

8<----

static void * testfunc(void *ignored) {

    int i = 0;
    for (i = 0; i < 100; ++i) {
        JSContext *cx = JS_NewContext(rt, 0x1000);
        if (cx) {
            JS_BeginRequest(cx);
            JS_DestroyContext(cx);
        }
    }

    return NULL;
}

8<----
Actual Results:  
(gdb) run
Starting program: a.out 
[Thread debugging using libthread_db enabled]
[New Thread 0xb7aed6d0 (LWP 15658)]
[New Thread 0xb7aecb90 (LWP 15661)]
[New Thread 0xb72ebb90 (LWP 15662)]
[New Thread 0xb6aeab90 (LWP 15663)]
[New Thread 0xb62e9b90 (LWP 15664)]
[New Thread 0xb58ffb90 (LWP 15665)]
[New Thread 0xb50feb90 (LWP 15666)]
Assertion failure: rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING, at jscntxt.cpp:465

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0xb6aeab90 (LWP 15663)]
JS_Assert (s=0xb7f3ed18 "rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING", 
    file=0xb7f3e971 "jscntxt.cpp", ln=465) at jsutil.cpp:63
63          abort();
Current language:  auto; currently c++
(gdb) bt
#0  JS_Assert (s=0xb7f3ed18 "rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING", 
    file=0xb7f3e971 "jscntxt.cpp", ln=465) at jsutil.cpp:63
#1  0xb7dda629 in js_DestroyContext (cx=0xb5918a50, mode=JSDCM_FORCE_GC)
    at jscntxt.cpp:465
#2  0xb7dc11ff in JS_DestroyContext (cx=0xb5918a50) at jsapi.cpp:1082
#3  0x08048653 in testfunc (ignored=0x0) at test.c:17
#4  0xb7c48192 in start_thread () from /lib/libpthread.so.0
#5  0xb7d1e20e in clone () from /lib/libc.so.6
(gdb) up
#1  0xb7dda629 in js_DestroyContext (cx=0xb5918a50, mode=JSDCM_FORCE_GC)
    at jscntxt.cpp:465
465         JS_ASSERT(rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING);
(gdb) p rt->state
$1 = JSRTS_LANDING


Expected Results:  
The program shouldn't crash :)

I can work around the issue by always ensuring one context exists (stopping the number of contexts/runtime going zero, non-zero, zero, non-zero etc); but presumably I shouldn't *have* to do this?
Oh, this also happens regardless of whether or not I call JS_BeginRequest before the JS_DestroyContext (I know there's been a lot of discussion around that recently).
For completeness I've also detailed some findings in the mailing list; see http://groups.google.com/group/mozilla.dev.tech.js-engine/browse_thread/thread/0be80cf2fb9ce305

In particular the test program runs without any problem in SpiderMonkey 1.7.0 but always encounters SEGVs when running against TraceMonkey rev < 24662 (whether or not the number of contexts is allowed to dip to zero).

Beginning with rev 24662 (a patch for bug 477021) the program no longer crashes if the number of contexts remains > 0; and now asserts failures and aborts (instead of silently crashing with a SEGV) if the number of contexts is allowed to dip to zero.
Following the announcement of SpiderMonkey 1.8 Release Candidate 1 I thought I would re-test the test case above.

Running with many threads (> 2 on my dual-core laptop) I now get the following assertion regularly:

Assertion failure: JS_PROPERTY_CACHE(cx).disabled >= 0, at jsinterp.c:509

but no longer see the rt->state assertion in the title. The assertions still trigger even if I have at least one context always present (unlike before); but don't trigger if I run with THREADS set to 1.
In hg tip, the test program now gives me this:

#0  JS_Assert (s=0x3d3fb8 "op == JS_DHASH_LOOKUP || RECURSION_LEVEL(table) == 0", file=0x3d3e32 "../jsdhash.cpp", ln=602) at ../jsutil.cpp:68
#1  0x002740ae in JS_DHashTableOperate (table=0x34be4, key=0xb1df4e90, op=JS_DHASH_ADD) at ../jsdhash.cpp:602
#2  0x00264e6e in js_AtomizeString (cx=0x5131c0, str=0xb1df4e90, flags=9) at ../jsatom.cpp:677
#3  0x00265304 in js_Atomize (cx=0x5131c0, bytes=0x3d2b9b "TypeError", length=9, flags=1) at ../jsatom.cpp:783
#4  0x002653b5 in js_InitCommonAtoms (cx=0x5131c0) at ../jsatom.cpp:488
#5  0x00267184 in js_NewContext (rt=0x24000, stackChunkSize=4096) at ../jscntxt.cpp:334
#6  0x002506c0 in JS_NewContext (rt=0x24000, stackChunkSize=4096) at ../jsapi.cpp:1083
#7  0x00001e93 in testfunc (ignored=0x0) at /Users/jason/dev/moz/spidermonkey-1.8/testapp.cpp:13
#8  0x9169b6f5 in _pthread_start ()
#9  0x9169b5b2 in thread_start ()

with all 99 other threads in jscntxt.cpp, not jsdhash.  I can also get these crashes to happen:

0x0029b1ca in js_TraceContext (trc=0xb0e36e84, acx=0x75ff) at ../jsgc.cpp:3016
3016	        FREE_OLD_ARENAS(acx->stackPool);
(gdb) bt
#0  0x0029b1ca in js_TraceContext (trc=0xb0e36e84, acx=0x75ff) at ../jsgc.cpp:3016
#1  0x0029ba54 in js_TraceRuntime (trc=0xb0e36e84, allAtoms=0) at ../jsgc.cpp:3153
#2  0x0029c259 in js_GC (cx=0x50f070, gckind=GC_NORMAL) at ../jsgc.cpp:3562
#3  0x00266e77 in js_DestroyContext (cx=0x50f070, mode=JSDCM_FORCE_GC) at ../jscntxt.cpp:541
#4  0x002506db in JS_DestroyContext (cx=0x50f070) at ../jsapi.cpp:1089
#5  0x00001eb2 in testfunc (ignored=0x0) at /Users/jason/dev/moz/spidermonkey-1.8/testapp.cpp:16
#6  0x9169b6f5 in _pthread_start ()
#7  0x9169b5b2 in thread_start ()

#0  JS_Assert (s=0x3d6e44 "rt->gcMarkingTracer == trc", file=0x3d60dc "../jsgc.cpp", ln=2682) at ../jsutil.cpp:68
#1  0x00299e26 in JS_CallTracer (trc=0xb0bace84, thing=0x39088, kind=2) at ../jsgc.cpp:2682
#2  0x00264aca in js_pinned_atom_tracer (table=0x34be4, hdr=0x80fe00, number=0, arg=0xb0bace84) at ../jsatom.cpp:551
#3  0x00274548 in JS_DHashTableEnumerate (table=0x34be4, etor=0x264a12 <js_pinned_atom_tracer>, arg=0xb0bace84) at ../jsdhash.cpp:742
#4  0x00264b52 in js_TraceAtomState (trc=0xb0bace84, allAtoms=0) at ../jsatom.cpp:566
#5  0x0029ba23 in js_TraceRuntime (trc=0xb0bace84, allAtoms=0) at ../jsgc.cpp:3147
#6  0x0029c259 in js_GC (cx=0x50e6c0, gckind=GC_NORMAL) at ../jsgc.cpp:3562
#7  0x00266e77 in js_DestroyContext (cx=0x50e6c0, mode=JSDCM_FORCE_GC) at ../jscntxt.cpp:541
#8  0x002506db in JS_DestroyContext (cx=0x50e6c0) at ../jsapi.cpp:1089
#9  0x00001eb2 in testfunc (ignored=0x0) at /Users/jason/dev/moz/spidermonkey-1.8/testapp.cpp:16
#10 0x9169b6f5 in _pthread_start ()
#11 0x9169b5b2 in thread_start ()
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: general → igor
At least one problem that I can see from the code is that js_GC does the check:

if (rt->state != JSRTS_UP && gckind != GC_LAST_CONTEXT)
    return;

outside the GC lock. Now suppose there are 3 threads, A, B, C. Threads A and B calls js_DestroyContext and thread C calls js_NewContext. 

First thread A removes its context from the runtime list. That context is not the last one so thread does not touch rt->state and eventually calls js_GC. The latter skips the above check and tries to to take the GC lock.

Before this moment the thread B takes the lock, removes its context from the runtime list, discovers that it is the last, sets rt->state to LANDING, runs the-last-context-cleanup, runs the GC and then sets rt->state to DOWN.

At this stage the thread A gets the GC lock, setup itself as the thread that runs the GC and releases the GC lock to proceed with the GC when rt->state is DOWN.

Now the thread C enters the picture. It discovers under the GC lock in js_NewContext that the newly allocated context is the first one. Since rt->state is DOWN, it releases the GC lock and starts the first context initialization procedure. That procedure includes the allocation of the initial atoms and it will happen when the thread A runs the GC. This may lead precisely to the first stack trace from the comment 4.
With the test program on 64-bit Linux I could not reproduce the bug from the comment 4 but I do see assert from the comment 0 after bumping the number of threads to 1000. The assert is indeed rare, about 2-3% of all runs and I could not reproduce it under GDB. On the other hand, good old printfs have shown what was going on. The problem comes from the following code in js_NewContext:

   JS_LOCK_GC(rt);
    for (;;) {
        first = (rt->contextList.next == &rt->contextList);
        if (rt->state == JSRTS_UP) {
            JS_ASSERT(!first);

            /* Ensure that it is safe to update rt->contextList below. */
            js_WaitForGC(rt);
            break;
        }
...
        JS_WAIT_CONDVAR(rt->stateChange, JS_NO_TIMEOUT);
    }
    JS_APPEND_LINK(&cx->link, &rt->contextList);
    JS_UNLOCK_GC(rt);

Here, when rt->state is up, the code waits for a possible GC on another thread after setting the "first" flag. But this is wrong as during the wait not only that GC may finish, but also yet another thread may execute js_DestroyContext for the last context. Then, when the thread finally wakes up, it would leave the lock with the "first" flag set to false while having the context as the first added to the runtime list.

When this thread later enters js_DestroyContext, it can see both LANDING and DOWN as rt->state depending on its race with that third thread for the previous last context.
Attached patch v1Splinter Review
The fix is to move js_WaitForGC before the code reads the state to set the "first" flag. It should should fix both scenarios from comment 5 and comment 6. 

The patch does not address that issue when the GC for the last context may delegate its job to the GC from another thread, but that at worst may lead to a leak and can be addressed in another bug.

To Jason: could you test if the patch indeed fixes the asserts and segfaults you have observed?
Attachment #366564 - Flags: review?(brendan)
Silly mistake--though I was building against JS_THREADSAFE sources, I was running the test program against a non-threadsafe library.  So the stacks in comment 4 are bogus.

Using the threadsafe library now, I can reproduce the assertions Igor sees (in hg tip, under gdb, pretty reliably).  I just run 4 threads, each looping 1000 times.

#0  JS_Assert (s=0x3e6edc "rt->state == JSRTS_UP || rt->state == JSRTS_LAUNCHING", file=0x3e6db8 "../jscntxt.cpp", ln=464) at ../jsutil.cpp:68
#1  0x002758dc in js_DestroyContext (cx=0x821a00, mode=JSDCM_FORCE_GC) at ../jscntxt.cpp:464
#2  0x0025bc49 in JS_DestroyContext (cx=0x821a00) at ../jsapi.cpp:1089
#3  0x00001ec6 in testfunc (ignored=0x0) at /Users/jason/dev/moz/spidermonkey-1.8/testapp.cpp:16
#4  0x9169b6f5 in _pthread_start ()
#5  0x9169b5b2 in thread_start ()

With hg tip + the patch, no crashes.
Attached patch v1 for SM1.8 (obsolete) — Splinter Review
Applying this patch to SpiderMonkey 1.8 RC 1 fixes the assertions for me.  If v1 gets r+ I'll request that this be landed in the branch.
Like Jason I was unable to reproduce the problem with Igor's patch against the current tip; so thanks for that.

However, after applying the SpiderMonkey 1.8 RC 1 patch I got SEGVs almost every time I ran my test case above; doing one JS_NewContext/JS_DestroyContext per thread:

$ while true; do ./a.out && echo -n . || echo -n X; done 2>/dev/null
.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXX.XXXX.X.XXXXX.XXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX.XX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXX.XXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXX.XXXXXXX.XX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XXX.XXXXXXXXXXXXXXXXXXXXXX.
XXXXXXX..XXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXX..XXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XX.XXX.XXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXX.XXXXXXXXXXXXXXXXXXXX.XXXXXXX.XXXXXXXXXXXXXXX
XX.XXXXXXXX.^C

each X being a SEGV. To trigger the program under GDB I needed to do the extra loops for thread again; then I was able to get the following:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb5c74b90 (LWP 13031)]
0xb7eda74d in js_TraceContext (trc=0xb5c742d0, acx=0x8084730) at jsgc.c:2768
2768	        if (a == acx->stackPool.first.next &&
(gdb) bt
#0  0xb7eda74d in js_TraceContext (trc=0xb5c742d0, acx=0x8084730) at jsgc.c:2768
#1  0xb7edad56 in js_TraceRuntime (trc=0xb5c742d0, allAtoms=0) at jsgc.c:2875
#2  0xb7edb7c1 in js_GC (cx=0x804d000, gckind=GC_NORMAL) at jsgc.c:3235
#3  0xb7eaaed6 in js_DestroyContext (cx=0x804d000, mode=JSDCM_FORCE_GC) at jscntxt.c:443
#4  0xb7e958e8 in JS_DestroyContext (cx=0x804d000) at jsapi.c:1035
#5  0x08048673 in testfunc (ignored=0x0) at test.c:17
#6  0xb7ce61b2 in start_thread () from /lib/libpthread.so.0
#7  0xb7dc2a5e in clone () from /lib/libc.so.6
(gdb) p acx
$1 = (JSContext *) 0x8084730
(gdb) p acx->stackPool
$2 = {first = {next = 0x0, base = 134760308, limit = 134760308, avail = 134760308}, current = 0x8084764, arenasize = 4096, mask = 3, 
  quotap = 0x808475c}
(gdb) list
2763	        /*
2764	         * Release stackPool here, if it has been in existence for longer than
2765	         * the limit specified by gcStackPoolLifespan.
2766	         */
2767	        a = acx->stackPool.current;
2768	        if (a == acx->stackPool.first.next &&
2769	            a->avail == a->base + sizeof(int64)) {
2770	            age = JS_Now() - *(int64 *) a->base;
2771	            if (age > (int64) acx->runtime->gcStackPoolLifespan * 1000)
2772	                JS_FinishArenaPool(&acx->stackPool);
(gdb) p a
$3 = (JSArena *) 0x0
(gdb)

so deferencing a->avail triggered the SEGV. Any thoughts/comments?

Is there a trail of other patches that need to be included (I presume the fix for my other bug 476934 isn't in 1.8 or js_WaitForGC wouldn't only now be included in this patch for instance)? or should I just wait for 1.8.1?
Paul, sorry if this is all wrong, but did you clobber by hand? Until the new autoconf build system unified Makefile.ref and Makefile.in, there was no automatic dependency inference for Makefile.ref. Again, apologies if I'm barking up the wrong tree.

I know of no way to cause the crash you're seeing, if you have a consistent build. Someone will have to reproduce and debug, or you could try remote debugger buddyign. Stop by #jsapi on irc.mozilla.org and call for help if you can get things in gdb.

/be
Attachment #366564 - Flags: review?(brendan) → review+
Comment on attachment 366564 [details] [diff] [review]
v1

All this goes back a ways:

revision 3.65
date: 2002/04/02 04:23:12;  author: brendan%mozilla.org;  state: Exp;  lines: +14 -7
branches:  3.65.2;
Fix next-to-last vs. last context GC race, plus ClaimScope vs. js_DestroyContext race; removed js_ForceGC from the FRIEND JS API (133773, sr=jband&shaver, a=asa).

Thanks for finally fixing it.

/be
Flags: wanted1.9.1?
Brendan, I can reproduce the problem with the following commands:

$ mkdir /tmp/bug478336
$ cd $_
$ wget http://ftp.mozilla.org/pub/mozilla.org/js/js-1.8.0-rc1.tar.gz
$ gunzip -c js-1.8.0-rc1.tar.gz | tar xf -
$ cd js/src
$ wget -O- https://bug478336.bugzilla.mozilla.org/attachment.cgi?id=366581 | patch -p0
$ LD=gcc XCFLAGS="$(nspr-config --cflags)" XLDFLAGS="$(nspr-config --libs)" JS_THREADSAFE=1 make -f Makefile.ref
$ cat > test.c <<!
>      (test app)
> !
$ gcc -g -DXP_UNIX -DJS_THREADSAFE -I. -I Linux_All_DBG.OBJ/ test.c -L Linux_All_DBG.OBJ/ -ljs
$ export LD_LIBRARY_PATH=Linux_All_DBG.OBJ/
$ ./a.out

ie, using a fresh copy of the RC1 distribution and patch file. The output of running the commands above (followed by gdb) is attached. I'll try and get on #jsapi shortly unless someone can see something immediately wrong with the above..
Flags: wanted1.9.1? → wanted1.9.1+
Updated version of the script to include downloading, compiling and building against a fresh version of NSPR, as per Brendan's request on IRC.

Works under Linux; YMMV -- but should be easy to tweak if needed.

I still get a SEGV from the last line..

./doit.sh: line 78:  5264 Segmentation fault      LD_LIBRARY_PATH=. ../a.out
Attachment #366737 - Attachment is obsolete: true
I should probably also mention that the script and built application work fine for me with 1.7.0.

Tested with the minor modifications shown below:

$ diff doit.sh.original doit.sh
23,24c23,24
< wget http://ftp.mozilla.org/pub/mozilla.org/js/js-1.8.0-rc1.tar.gz
< gunzip -c js-1.8.0-rc1.tar.gz | tar xf -
---
> wget http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz
> gunzip -c js-1.7.0.tar.gz | tar xf -
26d25
< wget -O- https://bug478336.bugzilla.mozilla.org/attachment.cgi?id=366581 | patch -p0

doit.sh.original (1.8.0 RC 1 + patch) causes SEGVs; the modified doit.sh (1.7.0) is fine.
Whiteboard: fixed-in-tracemonkey
(In reply to comment #9)
> Created an attachment (id=366581) [details]
> v1 for SM1.8

This does not include the full backport of the patch from the bug 477021.
Depends on: 477021
I suspect that to fully fix the issue for JS18 release it would be necessary to backport to the 1.9.0 branch not only this bug, but also bug 476934 and bug 477021.
Paul, please test this stack of patches:

  first apply attachment 366871 [details] [diff] [review] (for bug 476934)
   then apply attachment 366884 [details] [diff] [review] (for bug 467441)
   then apply attachment 366905 [details] [diff] [review] (for bug 477021)
   then apply attachment 366912 [details] [diff] [review] (for this bug)
http://hg.mozilla.org/mozilla-central/rev/e0a0a14fd1d3
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Hi Jason, these patches apply cleanly and fix both my bugs for me :) Many thanks for gathering them together for the 1.8 release (and to Igor for the initial fixes).
Flags: in-testsuite-
Comment on attachment 366912 [details] [diff] [review]
backport to 1.9.0 (for SpiderMonkey 1.8 source release) v2

Patch 4/4 to fix a crashing bug for embedders that blocks the SpiderMonkey 1.8 source release.  See bug 478336 comment 21 for the full list.
Attachment #366912 - Flags: approval1.9.0.10?
Comment on attachment 366912 [details] [diff] [review]
backport to 1.9.0 (for SpiderMonkey 1.8 source release) v2

Approved for 1.9.0.10, a=dveditz for release-drivers
Attachment #366912 - Flags: approval1.9.0.10? → approval1.9.0.10+
landed to 1.9.0:

Checking in jscntxt.c;
/cvsroot/mozilla/js/src/jscntxt.c,v  <--  jscntxt.c
new revision: 3.140; previous revision: 3.139
done
Keywords: fixed1.9.0.10
Backed out from 1.9.0 - the landed patch depends on the patch from bug 476934 that has caused compilation error on Windows.
Keywords: fixed1.9.0.10
landed to 1.9.0:

Checking in jscntxt.c;
/cvsroot/mozilla/js/src/jscntxt.c,v  <--  jscntxt.c
new revision: 3.144; previous revision: 3.143
done
Keywords: fixed1.9.0.10
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: