Opened 12 years ago

Closed 12 years ago

#1760 closed Bug (fixed)

transmission daemon doesn’t respond

Reported by: nzamb1 Owned by: charles
Priority: Normal Milestone: 1.50
Component: Daemon Version: 1.42
Severity: Major Keywords: patch
Cc:

Description

I run transmission 1.50b1 on popcorn hour A-110. I started downloading some large files. I have set don't limit the global upload speed. Then I have restarted transmission daemon. Now it's checking my files (during whole night). Maybe it already finished checking but I don’t know because transmission daemon doesn’t respond through transmission-remote program. It’s very bad because I can’t pause downloading or checking files and I don’t want to check my files again after restarting transmission daemon.

Attachments (4)

1.txt (793 bytes) - added by nzamb1 12 years ago.
transmission-unlock_verify.diff (312 bytes) - added by turbo 12 years ago.
Try this patch - when verify is complete, 'node' is empty and we break out of the loop and hence the thread. But the lock remains (done a few lines above)
transmission-handle_rpc-inEventThread.diff (6.1 KB) - added by turbo 12 years ago.
Ugly proof-of-consept on running the handle_rpc in an event thread.
transmission-break_verify.diff (959 bytes) - added by turbo 12 years ago.
This fix doesn't seem to be enough 1. Issue a verify on torrent 2. Move torrent 3. Try to do a list. => hang With some extra debugging (I don't know how to make a fancy gdb trace like you did :): tr_verifyRemove(): about to lock. tr_verifyRemove(): got lock. <TORRENT#1>: Queued for verification <TORRENT#1>: Verifying torrent <TORRENT#1>: Queued for move from /share/Downloads.test2 to /share/Downloads.test1 moveThread == NULL <TORRENT#1>: Moving torrent Dir: /share/Downloads.test2/<TORRENT#1> tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume" verifyState: 2 tr_verifyRemove(): about to lock. tr_verifyRemove(): got lock. tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume" The 'verifyState: 2' comes from stopTorrent() - which is called via tr_torrentStop() from moveThreadFunc() just before the copy stage... Now it hangs in tr_verifyRemove() instead - the stopCurrent is never set back to FALSE... The r7830 only moved the problem into the next function…

Download all attachments as: .zip

Change History (26)

Changed 12 years ago by nzamb1

Changed 12 years ago by turbo

Try this patch - when verify is complete, 'node' is empty and we break out of the loop and hence the thread. But the lock remains (done a few lines above)

comment:1 Changed 12 years ago by turbo

  • Keywords patch added

comment:2 Changed 12 years ago by charles

turbo: I don't think that patch is needed. The `break' statement there ends the loop, and the lock is released immediately below the loop...

nzamb1: Just to clarify what your bug report... are you saying that transmission-daemon stops responding to requests during torrent verification? or that it stopped responding and the reason is unknown?

IMO to move this report forward nzamb1 will need to run transmission-daemon in the foreground (with the -f option), preferably in gdb to see where transmission-daemon is spending its cycles when it *should* be handling rpc requests.

(Note: when starting transmission at the gdb prompt, you need to type "handle SIGPIPE nostop noprint" at the (gdb) prompt before running transmission. Transmission handles SIGPIPE, but for some reason gdb un-handles it...)

comment:3 Changed 12 years ago by nzamb1

I don't know where problem exactly is. I suppose that the problem is that device Popcorn Hour A-110 has slow CPU and transmission stops responding during torrent verification. I can only start transmission with –f option because Popcorn doesn’t have gdb.

comment:4 Changed 12 years ago by turbo

I have the same on my Linux Debian Etch on a 2.66GHz P4 w/ 2Gb mem. Very often, the client just can't connect. If I Ctrl-C and try again, it usually work eventually.

To me, this looks more like a thread lock that's gone haywire (or not released correctly). I got the same behaviour with my #920 patch work...

comment:5 Changed 12 years ago by turbo

I see that there's only one RPC thread running/started. How about having that send back a reply - 'please talk to me on port xyz' and then start a thread on that port...

How does other daemons do it?

Changed 12 years ago by turbo

Ugly proof-of-consept on running the handle_rpc in an event thread.

comment:6 Changed 12 years ago by turbo

Didn't do much good (mostly because I found that if the session is already in a thread, it won't start a new one). While checking, the daemon is completley unresponsive..

Yes, got a high load (4.27, 2.80, 1.63), but still...

comment:7 Changed 12 years ago by turbo

Putting some more debugging, it looks like the daemon actually GET the request! It just won't reply (in a timley mannor/at all)...

comment:8 Changed 12 years ago by turbo

However, replacing tr_runInEventThread() in my patch with tr_threadNew() seems to make it better. I'm going to have to do more testing before I can say for sure...

comment:9 Changed 12 years ago by charles

I don't understand what this patch does. RPC handling is already done in the libtransmission thread.

comment:10 Changed 12 years ago by turbo

It is/was supposed to fork of and process the request i a separate thread. Something is blocking the main thread(s) from doing the job... At least that's my working idea...

It didn't seem to work perfectly (although, it worked better than before). My machine now have a load average of '1.66, 2.30, 2.94', but the client (remote) don't get anything from the daemon. The daemon receives the request but doesn't reply.

comment:11 Changed 12 years ago by charles

I think I see the problem here. It's a garden-variety mutex deadlock.

I got the following backtrace when executing a stop' while the torrent in question is being verified. tr_torrentStop() gets a libtransmission mutex lock, then calls stopTorrent() -> tr_verifyRemove(), and since the torrent is currently being verified, spins its cycles until the abort' flag gets honored.

Meanwhile the verify thread is blocked in tr_torrentSetHasPiece() waiting for a libtransmission mutex lock, which it will never get because the other thread's tr_torrentStop() has it, and it won't return to honor the abort flag until that mutex lock is finished...

The `verify' thread:

#0  0x0000003b93a0ce74 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b93a088c5 in _L_lock_965 () from /lib64/libpthread.so.0
#2  0x0000003b93a08727 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000040a84f in tr_lockLock (l=0x62aa650) at platform.c:189
#4  0x000000000040da92 in tr_globalLock (session=0x62ad1c0) at session.c:628
#5  0x0000000000412e89 in tr_torrentLock (tor=0x632cfa0) at torrent.h:253
#6  0x0000000000413932 in tr_torrentSetHasPiece (tor=0x632cfa0, pieceIndex=0, has=0 '\0') at torrent.c:1047
#7  0x000000000041ce96 in checkFile (tor=0x632cfa0, buffer=0x632e110, buflen=32768, fileIndex=0, abortFlag=0x66f160) at verify.c:83
#8  0x000000000041d21e in verifyThreadFunc (unused=0x0) at verify.c:149
#9  0x000000000040a723 in ThreadFunc (_t=0x632dfc0) at platform.c:106
#10 0x0000003b93a06367 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003b92ed30ad in clone () from /lib64/libc.so.6

The `libtransmission' thread:

#0  0x0000003b92e99761 in nanosleep () from /lib64/libc.so.6
#1  0x0000003b92eccda4 in usleep () from /lib64/libc.so.6
#2  0x000000000041c144 in tr_wait (delay_milliseconds=100) at utils.c:979
#3  0x000000000041d55f in tr_verifyRemove (tor=0x632cfa0) at verify.c:238
#4  0x0000000000413fae in stopTorrent (vtor=0x632cfa0) at torrent.c:1232
#5  0x000000000041a019 in tr_runInEventThread (session=0x62ad1c0, func=0x413f6b <stopTorrent>, user_data=0x632cfa0) at trevent.c:361
#6  0x000000000041402f in tr_torrentStop (tor=0x632cfa0) at torrent.c:1249
#7  0x0000000000440ae2 in torrentStop (session=0x62ad1c0, args_in=0x6325f90, args_out=0x6325880, idle_data=0x0) at rpcimpl.c:190
#8  0x0000000000444153 in request_exec (session=0x62ad1c0, request=0x41c32eb0, callback=0x436752 <rpc_response_func>, callback_user_data=0x63261a0)
    at rpcimpl.c:1006
#9  0x000000000044437d in tr_rpc_request_exec_json (session=0x62ad1c0, request_json=0x63259d0, request_len=55, callback=0x436752 <rpc_response_func>, 
    callback_user_data=0x63261a0) at rpcimpl.c:1049
#10 0x00000000004368a9 in handle_rpc (req=0x6326680, server=0x62c6580) at rpc-server.c:419
#11 0x0000000000436be5 in handle_request (req=0x6326680, arg=0x62c6580) at rpc-server.c:501
#12 0x000000000044df6a in evhttp_read_body (evcon=0x6326370, req=0x6326680) at http.c:895
#13 0x000000000044e1f3 in evhttp_get_body (evcon=0x6326370, req=0x6326680) at http.c:1569
#14 0x000000000044ab47 in event_base_loop (base=0x62ac940, flags=0) at event.c:392
#15 0x0000000000419b89 in libeventThreadFunc (veh=0x62aacf0) at trevent.c:238
#16 0x000000000040a723 in ThreadFunc (_t=0x62aa6b0) at platform.c:106
#17 0x0000003b93a06367 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003b92ed30ad in clone () from /lib64/libc.so.6

comment:12 Changed 12 years ago by charles

stupid trac formatting :)

comment:13 Changed 12 years ago by charles

A quick & dirty hack -- that appears solve the problem, and be harmless -- is to remove the lock/unlock calls in tr_torrentSetHasPiece()

comment:14 Changed 12 years ago by turbo

The block happens when moving a torrent to...

comment:15 Changed 12 years ago by turbo

How about a 'torrent->break{Verify,Move}' which the verify/move checks every loop and if set, breaks... ?

comment:16 Changed 12 years ago by charles

  • Milestone changed from None Set to 1.50
  • Status changed from new to assigned
  • Version set to 1.42

r7830: don't bother with the lock/unlock mutex in tr_torrnentSetHasPiece()

turbo: probably the same issue... does moving the torrent call tr_verifyRemove()? And, does r7830 fix it for moving torrents?

comment:17 Changed 12 years ago by turbo

No, but tr_moveRemove() which works the same way...

No, since I'm not calling tr_torrnentSetHasPiece() - I have no interesst in pieces, just complete files...

I was thinking of setting 'torrent->break{Verify,Move}' _before_ tr_{verify,move}Remove(), give it a few sec to be 'discovered', then continue with the *Remove()... ?

You could even put it in checkFile() - it will still give a slight block time, but as soon as the piece check/file moved is done, it will cancel...

comment:18 Changed 12 years ago by turbo

Sorry, the _check_ should go in checkFile(), not the _set_. That still probably need to go before the *Remove().

Changed 12 years ago by turbo

This fix doesn't seem to be enough 1. Issue a verify on torrent 2. Move torrent 3. Try to do a list. => hang With some extra debugging (I don't know how to make a fancy gdb trace like you did :): tr_verifyRemove(): about to lock. tr_verifyRemove(): got lock. <TORRENT#1>: Queued for verification <TORRENT#1>: Verifying torrent <TORRENT#1>: Queued for move from /share/Downloads.test2 to /share/Downloads.test1 moveThread == NULL <TORRENT#1>: Moving torrent Dir: /share/Downloads.test2/<TORRENT#1> tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume" verifyState: 2 tr_verifyRemove(): about to lock. tr_verifyRemove(): got lock. tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume" The 'verifyState: 2' comes from stopTorrent() - which is called via tr_torrentStop() from moveThreadFunc() just before the copy stage... Now it hangs in tr_verifyRemove() instead - the stopCurrent is never set back to FALSE... The r7830 only moved the problem into the next function...

comment:19 Changed 12 years ago by turbo

WTF!? Again with the trac formating!!

This fix doesn't seem to be enough

1. Issue a verify on torrent
2. Move torrent
3. Try to do a list.
   => hang

With some extra debugging (I don't know how to make a fancy gdb trace like you did :):

tr_verifyRemove(): about to lock.
tr_verifyRemove(): got lock.
<TORRENT#1>: Queued for verification
<TORRENT#1>: Verifying torrent
<TORRENT#1>: Queued for move from /share/Downloads.test2 to /share/Downloads.test1
moveThread == NULL
<TORRENT#1>: Moving torrent
Dir: /share/Downloads.test2/<TORRENT#1>
tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume"
verifyState: 2
tr_verifyRemove(): about to lock.
tr_verifyRemove(): got lock.
tr_bencSaveFile saved "/home/turbo/.config/transmission-daemon.test/resume/<TORRENT#1>.fbd2d419b71c1d6d.resume"

The 'verifyState: 2' comes from stopTorrent() - which is called via tr_torrentStop() from moveThreadFunc() just before the copy stage...

Now it hangs in tr_verifyRemove() (in the while() loop waiting for stopCurrent to be false) instead - the stopCurrent is never set back to FALSE...

The r7830 only moved the problem into the next function...

comment:20 Changed 12 years ago by turbo

  • Priority changed from Normal to High
  • Severity changed from Normal to Major

Even if we dissregard all my posts about 'move' and 'verify' at the same time, it's still not fixed...

I just restarted my daemon - because I couldn't get a list (I was moving a whole bunch of torrents). Now, I'm not doing anything - just waiting for it to come up.

In the output, I see a lot of scrape's etc, but the remote can't get a list...

[...]
* About to connect() to localhost port 9091 (#0)
*   Trying 127.0.0.1... * connected
* Connected to localhost (127.0.0.1) port 9091 (#0)
> POST /transmission/rpc HTTP/1.1
User-Agent: transmission-remote/1.42+ (7848)
Host: localhost:9091
Accept: */*
Content-Length: 424
Content-Type: application/x-www-form-urlencoded

That's it. It just hangs there... Load is 2.47 (on a Debian GNU/Linux etc kernel 2.6.18-6-686). And the webinterface can't connect either.

I was thinking that 2.47 was to much for T (!!) so I reniced T to -10, that should give it enough prio to answer 'no matter what'... It didn't help...

And regarding #comment:9, the RPC server is run in the same thread as 'everything else' - not in a separate thread all by itself. My patch is/was supposed to fix that (but I disabled it after your comment - going to test it further).

I'm bumping the severity and priority, because I feel this is a showstopper. Are there really only two people that notice this!?

comment:21 Changed 12 years ago by charles

turbo: regarding comment #20, can you please post a recipe to make this occur with a clean build from trunk?

comment:22 Changed 12 years ago by charles

  • Priority changed from High to Normal
  • Resolution set to fixed
  • Status changed from assigned to closed

09:39 < FransUrbo?> http://transmission.pastebin.com/d41601721
09:40 <@charles_> FransUrbo?: is the disk busy at the time?
09:41 <@charles_> FransUrbo?: this doesn't look frozen; just slow due to IO
09:44 < FransUrbo?> Actually makes sense...
09:44 <@charles_> FransUrbo?: it looks like one thread is verifying the checksum of a piece that was just downloaded, and the other thread is in the middle of verifying a whole torrent
09:46 < FransUrbo?> ... which deals with RPC reqs etc...
09:48 < FransUrbo?> Trying the same thing again, remote can connect without so much as a hickup!
10:10 <@charles_> FransUrbo?: other than the already-known issue that Transmission should batch up writes to avoid excessive disk IO, I'm not sure I see the problem here... I don't see anything deadlocking
10:12 < FransUrbo?> Agree. Sorry for all the noice. It seems like I need faster SATA interface(s)...

Note: See TracTickets for help on using tickets.