Opened 6 years ago

Closed 5 years ago

#5891 closed Bug (fixed)

Crash on session shutdown (evdns_getaddrinfo_cancel)

Reported by: mike.dld Owned by: mike.dld
Priority: Normal Milestone: 2.90
Component: libtransmission Version: 2.84+
Severity: Normal Keywords:
Cc:

Description

Transmission crashes from time to time (tested on Qt and Mac clients) on quit with SIGSEGV and backtrace similar to:

Thread 10 Crashed:
0   ???                           	000000000000000000 0 + 0
1   libevent-2.0.5.dylib          	0x0000000103b8cb87 evdns_getaddrinfo_cancel + 36
2   transmission-qt               	0x0000000103911e8a tau_tracker_free + 74
3   transmission-qt               	0x000000010394cdc0 tr_ptrArrayForeach + 304
4   transmission-qt               	0x000000010394cc79 tr_ptrArrayDestruct + 217
5   transmission-qt               	0x0000000103911e2b tr_tracker_udp_close + 75
6   transmission-qt               	0x0000000103962a6c sessionCloseImpl + 508
7   transmission-qt               	0x0000000103977d26 readFromPipe + 470
8   libevent-2.0.5.dylib          	0x0000000103b70b61 event_base_loop + 1795
9   transmission-qt               	0x0000000103977748 libeventThreadFunc + 248
10  transmission-qt               	0x000000010394b280 ThreadFunc + 48
11  libsystem_pthread.dylib       	0x00007fff93aef2fc _pthread_body + 131
12  libsystem_pthread.dylib       	0x00007fff93aef279 _pthread_start + 176
13  libsystem_pthread.dylib       	0x00007fff93aed4b1 thread_start + 13

Attachments (2)

transmission-qt_2014-12-31-212538_Mikes-MacBook-Pro.crash (54.7 KB) - added by mike.dld 6 years ago.
Crash report for Qt client on Mac
5891-session-close.patch (4.9 KB) - added by mike.dld 6 years ago.

Download all attachments as: .zip

Change History (19)

Changed 6 years ago by mike.dld

Crash report for Qt client on Mac

comment:1 Changed 6 years ago by cfpp2p

related? r14237

comment:2 Changed 6 years ago by x190

---

Last edited 6 years ago by x190 (previous) (diff)

comment:3 follow-up: Changed 6 years ago by x190

https://forum.transmissionbt.com/viewtopic.php?f=4&t=16994#p71159

Thread 5 Crashed:
0   org.m0k.transmission                0x00000001000c87e6 evdns_cancel_request + 82
1   org.m0k.transmission                0x00000001000cbdcf evdns_getaddrinfo_cancel + 81
2   org.m0k.transmission                0x00000001000a17e4 tau_tracker_free + 37
3   org.m0k.transmission                0x000000010007cc85 tr_ptrArrayForeach + 62
4   org.m0k.transmission                0x000000010007cbf5 tr_ptrArrayDestruct + 41
5   org.m0k.transmission                0x00000001000a17aa tr_tracker_udp_close + 44
6   org.m0k.transmission                0x000000010006a201 sessionCloseImpl + 335
7   org.m0k.transmission                0x000000010007d6b2 readFromPipe + 323
8   org.m0k.transmission                0x00000001000b7ce4 event_base_loop + 1848

Also: https://forum.transmissionbt.com/viewtopic.php?t=16639&start=15#p70030 https://bugs.launchpad.net/ubuntu/+source/transmission/+bug/1037960

From mike.dld's report:

libevent-2.0.5.dylib          	0x0000000103b70b61 event_base_loop + 1795

https://raw.githubusercontent.com/libevent/libevent/release-2.0.22-stable/ChangeLog

Changes in version 2.0.11-stable (27 Apr 2011) ...

o Fix possible NULL-deref in evdns_cancel_request (5208544 Sebastian Hahn)

Last edited 6 years ago by x190 (previous) (diff)

Changed 6 years ago by mike.dld

comment:4 Changed 6 years ago by mike.dld

Attached is a working patch which prevents crashes on session close (at least the ones related to evdns_getaddrinfo_cancel). It's not pretty at all, I don't like 3-stage session close approach. The idea is to let libevent call DNS callbacks for requests we cancelled before we free the tau_tracker data. Maybe someone else could make a better patch based on that.

comment:5 in reply to: ↑ 3 Changed 6 years ago by mike.dld

Replying to x190:

From mike.dld's report:

libevent-2.0.5.dylib          	0x0000000103b70b61 event_base_loop + 1795

https://raw.githubusercontent.com/libevent/libevent/release-2.0.22-stable/ChangeLog

Changes in version 2.0.11-stable (27 Apr 2011) ...

o Fix possible NULL-deref in evdns_cancel_request (5208544 Sebastian Hahn)

Currently testing with 2.0.22 (don't even know where did I find 2.0.5 before), same problem. See the patch above.

comment:6 Changed 6 years ago by cfpp2p

Replying to mike.dld:

Attached is a working patch which prevents crashes on session close (at least the ones related to evdns_getaddrinfo_cancel). It's not pretty at all, I don't like 3-stage session close approach. The idea is to let libevent call DNS callbacks for requests we cancelled before we free the tau_tracker data. Maybe someone else could make a better patch based on that.

Is this a rework of r14237 ?

comment:7 Changed 6 years ago by mike.dld

It's an addition to that revision, yes.

comment:8 follow-up: Changed 6 years ago by mike.dld

Here's a more complete story of what's going on (in terms of current unpatched code). First (in sessionCloseImpl), we free evdns:

  evdns_base_free (session->evdns_base, 0);
  session->evdns_base = NULL;

Then on next line, we close UDP tracker(s):

  tr_tracker_udp_close (session);

During the close, tau_tracker_free is being called for each tracker, which in turn tries to cancel corresponding DNS request if one is currently in progress:

    if (t->dns_request != NULL)
        evdns_getaddrinfo_cancel (t->dns_request);

Cancel function looks like this:

void
evdns_getaddrinfo_cancel(struct evdns_getaddrinfo_request *data)
{
  EVDNS_LOCK(data->evdns_base);
  ...
  EVDNS_UNLOCK(data->evdns_base);
}

Since evdns is already freed, *data->evdns_base contains garbage. Then we have

#define EVDNS_LOCK(base)                        \
        EVLOCK_LOCK((base)->lock, 0)

and

#define EVLOCK_LOCK(lockvar,mode)                                       \
        do {                                                            \
                if (lockvar)                                            \
                        _evthread_lock_fns.lock(mode, lockvar);         \
        } while (0)

Since data->evdns_base->lock (a.k.a. (base)->lock and lockvar above) is garbage (so, not NULL most of the time), _evthread_lock_fns.lock() call takes place, but function pointer is NULL unless we call evthread_use_pthreads or evthread_use_windows_threads (and we don't).

comment:9 in reply to: ↑ 8 Changed 6 years ago by cfpp2p

Replying to mike.dld:

Here's a more complete story of what's going on (in terms of current unpatched code). First (in sessionCloseImpl), we free evdns:

  evdns_base_free (session->evdns_base, 0);
  session->evdns_base = NULL;

Here's what I see going on in terms of current unpatched code.

session.c uses evdns_base_free with fail_requests set to 0

evdns_base_free (session->evdns_base, 0);

fail_requests if zero, active requests will be aborted; if non-zero, active requests will return DNS_ERR_SHUTDOWN.

http://www.wangafu.net/~nickm/libevent-2.1/doxygen/html/dns_8h.html#ab221bbadb18451c9007e8dc98bc97786

void evdns_base_free ( struct evdns_base * base,
                       int                 fail_requests ) 			

Shut down the asynchronous DNS resolver and
terminate all active requests.

If the 'fail_requests' option is enabled,
all active requests will return an empty result
with the error flag set to DNS_ERR_SHUTDOWN.
Otherwise, the requests will be silently discarded.

Parameters:
    	evdns_base 	the evdns base to free
    	fail_requests 	if zero, active requests
                        will be aborted;
                        if non-zero, active requests
                        will return DNS_ERR_SHUTDOWN.

The difference is that "If its fail_requests argument is true, it will make all in-flight requests get their callbacks invoked with a canceled error code before it frees the base". If fail_requests argument is false we'll simply abort without any callbacks.

See also "Creating and configuring an evdns_base" http://www.wangafu.net/~nickm/libevent-book/Ref9_dns.html and specifically "Note that the callback function will eventually be invoked whether evdns_getaddrinfo() returns NULL or not, and whether evdns_getaddrinfo_cancel() is called or not."

Since fail_requests is zero, active requests will be aborted by evdns_base_free. But then later on tr_tracker_udp_close calls tau_tracker_free which in turn calls evdns_getaddrinfo_cancel unnecessarily because evdns_base_free does what it's supposed to "Shut down the asynchronous DNS resolver and terminate all active requests" making the evdns_getaddrinfo_cancel redundant and cause the "garbage" you refer to. evdns_base_free does more than just free memory.

The way I see right now and the way I saw it back when I worked on a cygwin UDP tracker issue (ref. #5698 , http://sourceforge.net/p/transmissiondaemon/discussion/general/thread/a43bf816/#bb71 ) is that the evdns_getaddrinfo_cancel is not necessary after we already issue evdns_base_free.

comment:10 Changed 6 years ago by mike.dld

If you're implying that this revision wasn't needed, I could only partially agree with you. Here's the crash I didn't have to work hard to get with r14236, which is essentially the same issue:

Thread 0:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib  0x00007fff904aa48a __semwait_signal + 10
1   transmission-qt         0x0000000101885b54 tr_wait_msec + 84 (utils.c:747)
2   transmission-qt         0x000000010186a912 tr_sessionClose + 354 (session.c:1872)
3   transmission-qt         0x00000001017e20bf Session::stop() + 111 (session.cc:314)
4   transmission-qt         0x00000001017e1ec6 Session::~Session() + 38 (session.cc:276)
5   transmission-qt         0x00000001017e20f5 Session::~Session() + 21 (session.cc:277)
6   transmission-qt         0x00000001017e2118 Session::~Session() + 24 (session.cc:275)
7   transmission-qt         0x000000010176861b MyApp::~MyApp() + 267 (app.cc:343)
8   transmission-qt         0x0000000101768995 MyApp::~MyApp() + 21 (app.cc:350)
9   transmission-qt         0x0000000101769374 main + 1892 (app.cc:536)
10  transmission-qt         0x0000000101761904 start + 52

...

Thread 5 Crashed:
0   libevent.dylib          0x0000000101a8f46e evdns_cancel_request + 49
1   libevent.dylib          0x0000000101a92975 evdns_getaddrinfo_timeout_cb + 123
2   libevent.dylib          0x0000000101a766a1 event_base_loop + 1631
3   transmission-qt         0x0000000101882c0f libeventThreadFunc + 239 (trevent.c:245)
4   transmission-qt         0x0000000101854360 ThreadFunc + 48 (platform.c:107)
5   libsystem_pthread.dylib 0x00007fff837e2268 _pthread_body + 131
6   libsystem_pthread.dylib 0x00007fff837e21e5 _pthread_start + 176
7   libsystem_pthread.dylib 0x00007fff837e041d thread_start + 13

So the issues was there even before r14237, and should've been fixed one way or another.

The fix in r14237 was made after my own crash report to @jordan on IRC (which was similar if not identical to the above), and has nothing to do with mentioned UDP issues (or at least they aren't what this revision actually tried to fix).

Below is the chat transcript (dpaste links don't work already), date is Jan 21 '15. Hopefully, @jordan wouldn't mind. Crash mentioned here occured on GNU/Linux.

<mikedld> I've got a crash in tau_tracker_fail_all (with r14160), probably worth
          looking at, maybe I need to create a ticket or something. the crash is
          inside the first for() loop, on the very first iteration, and `reqs`
          has a very interesting value of {items = 0x0, n_items = 192, n_alloc = 0}
<jordan>  can you pastebin the bt?
<mikedld> http://dpaste.com/1561664/plain/
<jordan>  hmm
<jordan>  the ptrarray destructor doesn't NULL out the items pointer
<jordan>  so this smells less like a destructed ptrarray, and more like the tracker's
          scrapes pointer is pointing out into space
<jordan>  (or the tracker pointer itself)
<jordan>  if you look at the 'tracker' pointer's fields in gdb at
          tau_tracker_fail_all(), is *tracker->announces sane?
<mikedld> http://dpaste.com/1561680/plain/
<mikedld> it looks so, yes
<jordan>  1389937523 was on Friday, so I guess that's a reasonable time_t
<jordan>  is this crash from a few days ago? :)
<mikedld> yep, Jan 17
<mikedld> I just managed to find it today :-P
<jordan>  okay, so ``tracker'' is legit
<mikedld> (the core dump)
<mikedld> but I assume you can't have this {items = 0x0, n_items = 192, n_alloc = 0}
          in an ideal world, right? :)
<jordan>  did this crash happen during shutdown?
<mikedld> no, normal operation (only seeding actually, no leaching at all). I left it
          unattended and then noticed it crashed
<jordan>  darn. I was thinking maybe shutdown called tr_tracker_udp_close() before
          the dns lookup callback returned
<mikedld> http://dpaste.com/1561697/plain/
<jordan>  tr_sessionClose()
<jordan>  that looks like it happened on shutdown. Are you sure this is the same
          backtrace from the crash you're thinking of?
<mikedld> it's the only one I had actually... well maybe I'm wrong after all and it
          crashed some other time w/o me noticing :(
<jordan>  well, this looks like a legitimate bug here... let's fix it
<jordan>  if you have other corefiles, we can handle them separately
<jordan>  if you don't, then we can see if you still get crashes after the fix
<jordan>  announcer-udp.c needs to call evdns_getaddrinfo_cancel() when it frees the
          tau_tracker if is_asking_dns is true
<mikedld> I think I'll need to set up another core naming scheme before I get more
          files, since right now the name is just "core" :)
<jordan>  actually we need to keep the request pointer returned by
          evdns_getaddrinfo(), so we could replace the 'is_asking_dns' field with a
          'dns_request' field, and test it for non-NULL to see if dns is pending
<mikedld> that's out of my understanding for now :)
<jordan>  thanks for reporting this
...
<jordan>  are you able to reproduce this crash?
<mikedld> didn't try yet, but will tomorrow. time to sleep already
<jordan>  http://paste.ubuntu.com/6788748/
<mikedld> k, will try it out once I'm able to reproduce

comment:11 Changed 6 years ago by cfpp2p

It's very helpful to have the actual reason for the r14237 which I had not been aware of previous to now. I use the trac system and forums, I don't use IRC for research. Any implied implications can be attributed to that lack of information.

If it were tried using in a patch with

evdns_base_free (session->evdns_base, 1);

then in-flight requests get their callbacks invoked and maybe that could be worked.

I might suggest that the crash is partially due to threading changes that took place with the 2.8x branch.

I felt with the prior research I'd done, (although flawed due to lack of available information), that I should share it somehow. Sorry for any misunderstandings.

comment:12 Changed 6 years ago by mike.dld

I'm guessing that just changing 0 to 1 won't help much, but I'm not at home so cannot check yet. When I was at home earlier today though, I took a look at evdns.c and got the impression that both 0 and 1 require event loop to be running in order for callbacks to be called (they are scheduled, not called directly), which is part of the patch I've attached (waiting for announcer to become idle with event timer, not plain loop within event thread).

comment:13 Changed 5 years ago by mike.dld

Closed #5959 as duplicate of this ticket.

comment:14 Changed 5 years ago by Dowser101

Hello, i posted #5959 after searching with search engine and managed to find only a very old ticket (2009) on old version, sorry about this. I’m not a programmer and maybe i should not have got involved.

I just want to let you know i don’t endure crashes on released version except upon quitting the app, it ’s not a big deal for me. But as i don’t remember the last time i could quit without crashing, i thought it could matter to the devs. (Thank you 2 very much by the way).

I did save my apple generated message and can lurk into my Console if anybody is interested in some of its log content.

As i said in original post, i tried the latest nightly build and it crashed few hours later. Since that i reloaded it and it did not crash until i tried to quit... Just my 2 more cents

comment:15 Changed 5 years ago by mike.dld

Closed #6023 as duplicate of this ticket.

comment:16 Changed 5 years ago by mike.dld

  • Owner changed from jordan to mike.dld
  • Status changed from new to assigned

comment:17 Changed 5 years ago by mike.dld

  • Milestone changed from None Set to 2.90
  • Resolution set to fixed
  • Status changed from assigned to closed

Since no one came up with a nicer solution, committing the attached patch, r14648+r14649.

Note: See TracTickets for help on using tickets.