Opened 10 years ago

Closed 7 years ago

#4531 closed Bug (fixed)

daemon sometimes hangs and has to be killed and restarted

Reported by: kringe Owned by:
Priority: Normal Milestone: 2.83
Component: Daemon Version: 2.33+
Severity: Normal Keywords:
Cc: gvdl@…

Description

i have noticed for the past couple of weeks that about once a day transmission-daemon will hang and have to be killed and restarted. after it hangs, it consumes no cpu, produces no i/o, and generates no log messages. the last message in the debug log was 'Announcing to tracker'.

using r12930 (2.40b2) with libevent 2.0.13 (from macports) on Mac OS X Lion.

Attachments (4)

backtrace.log (9.5 KB) - added by kringe 10 years ago.
(gdb) thread apply all bt full
settings.json (2.2 KB) - added by kringe 9 years ago.
activity-monitor-sample.txt (11.7 KB) - added by shellbound 9 years ago.
activity-monitor-sample2.txt (11.6 KB) - added by shellbound 9 years ago.

Download all attachments as: .zip

Change History (86)

comment:1 Changed 10 years ago by jordan

Hi Kringe,

Thanks for reporting this bug and helping to make Transmission better.

I'm not seeing this behavior. Maybe you could run transmission-daemon inside of gdb so that you can get a backtrace after it hangs?

(Note: you'll need to handle SIGPIPE) when running Transmission in gdb.)

comment:2 follow-up: Changed 10 years ago by kringe

i'll do my best if you can provide me with instructions.

also, i noticed even after killing the transmission-daemon process with -9 (none of the other signals work), the process frequently continues as a zombie (according to ps) for about 10 minutes and performs a lot of i/o (according to iotop) before finally dying. not sure if that issue is specific to the mac.

comment:3 in reply to: ↑ 2 ; follow-up: Changed 10 years ago by x190

Replying to kringe:

i'll do my best if you can provide me with instructions.

also, i noticed even after killing the transmission-daemon process with -9 (none of the other signals work), the process frequently continues as a zombie (according to ps) for about 10 minutes and performs a lot of i/o (according to iotop) before finally dying. not sure if that issue is specific to the mac.

Specific to the mac?--in all likelihood, yes! I'll bet you've been adding some rather large torrents during the current session, and since OS X doesn't support sparse files, any unwritten-to preallocated space for those torrents, has to be zeroed out upon pause or quit. That's my best theory anyway and I do see it regularly. Just wait it out when pausing or quitting (roughly 1 min. per 2GB of torrent size) and Transmission will recover and pause/quit normally.

I'm curious about why you use transmission-daemon as opposed to the Mac Client.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 10 years ago by kringe

Replying to x190:

i'll do my best if you can provide me with instructions.

found gdb instructions here: https://wiki.ubuntu.com/Backtrace

Specific to the mac?--in all likelihood, yes! I'll bet you've been adding some rather large torrents during the current session, and since OS X doesn't support sparse files, any unwritten-to preallocated space for those torrents, has to be zeroed out upon pause or quit. That's my best theory anyway and I do see it regularly. Just wait it out when pausing or quitting (roughly 1 min. per 2GB of torrent size) and Transmission will recover and pause/quit normally.

bingo! google confirms you hit the head on the nail.

I'm curious about why you use transmission-daemon as opposed to the Mac Client.

i'm running transmission on a mac mini that is mostly headless. also, the daemon uses much less resources (both memory and cpu). and i find it a lot easier to manage my many torrents with a set of scripts talking through transmission-remote, than using a gui that is prone to freezing (no separate i/o thread).

comment:5 follow-up: Changed 10 years ago by livings124

Hey kringe, are we all good to resolve this?

comment:6 in reply to: ↑ 4 Changed 10 years ago by x190

Replying to kringe:

bingo! google confirms you hit the head on the nail.

Yeah, one of these days I've got to buy a hammer--my head's really starting to hurt. :lol:

comment:7 Changed 10 years ago by livings124

  • Resolution set to invalid
  • Status changed from new to closed

I wouldn't recommend a hammer then.

comment:8 in reply to: ↑ 5 Changed 10 years ago by kringe

Replying to livings124:

Hey kringe, are we all good to resolve this?

i just found the instructions on how to use gdb, now i have to wait for it to happen again

comment:9 Changed 10 years ago by livings124

  • Resolution invalid deleted
  • Status changed from closed to reopened

Changed 10 years ago by kringe

(gdb) thread apply all bt full

comment:10 Changed 10 years ago by kringe

attached backtrace produced from r12959

comment:11 follow-up: Changed 10 years ago by x190

Hi kringe,

While you are waiting for the developers to examine your log, I'd like to ask some questions that might provide more info.

#1 How long did the daemon hang in this case?

#2 Were you interacting with the daemon in any way just before the hang--adding/deleting torrents, moving data, etc?

#3 Are there any relevant entries in your transmission-daemon log file corresponding to the time of the hang?

#4 The last line of the log you posted is "The program is running. Exit anyway? (y or n)". Does that mean the daemon recovered on its own?

Thanks!

comment:12 in reply to: ↑ 11 Changed 10 years ago by kringe

Replying to x190:

#1 How long did the daemon hang in this case?

it had been hanging for at least 30 minutes before i issued a kill signal and produced the backtrace

#2 Were you interacting with the daemon in any way just before the hang--adding/deleting torrents, moving data, etc?

no. i had Transmission Remote GUI running at the time, but it only polls the RPC API every 5 seconds.

#3 Are there any relevant entries in your transmission-daemon log file corresponding to the time of the hang?

each time this happens, the last line in the debug log is 'Announcing to tracker'

#4 The last line of the log you posted is "The program is running. Exit anyway? (y or n)". Does that mean the daemon recovered on its own?

no, the program was hanging, so i had to issue a kill signal to produce the backtrace.

comment:13 Changed 10 years ago by kringe

do i need to provide any other information?

comment:14 follow-up: Changed 10 years ago by x190

Okay so I'm no expert, but I think you need to run or attach to transmission-daemon in gdb and then ask for the backtrace while it is hanging.

https://wiki.ubuntu.com/Backtrace

Step: 4

[EDIT: I apologize, if that is exactly what you did.]

Lines 104 and 185 of your log provided in comment:9 indicate "out of bounds" errors. That may or may not be significant.

EDIT: kringe: Please see the following forum post for some settings that might help. https://forum.transmissionbt.com/viewtopic.php?f=4&t=11486#p54325

Last edited 10 years ago by x190 (previous) (diff)

Changed 9 years ago by kringe

comment:15 in reply to: ↑ 14 Changed 9 years ago by kringe

Replying to x190:

[EDIT: I apologize, if that is exactly what you did.]

Lines 104 and 185 of your log provided in comment:9 indicate "out of bounds" errors. That may or may not be significant.

right. i saw that also. actually, there are 3 instances of the the "out of bounds" errors. 2 of them relate to the settings from daemon.c and one of them comes from dht.c. i am attaching my settings.json in case that is the issue.

EDIT: kringe: Please see the following forum post for some settings that might help. https://forum.transmissionbt.com/viewtopic.php?f=4&t=11486#p54325

doesn't seem to apply here. i have no network issues- either with computer or router- when this happens. and limiting to 10 active torrents would limit usefulness of transmission. i currently have 166 active torrents.

i'm using r12982 now and still seeing problem, though not as frequently (about once every 3 or four days).

comment:16 follow-up: Changed 9 years ago by x190

kringe: I went thru your settings.json and found some anomalies.

  • "open-file-limit": 32, This is very low. I suggest an increase.
  • "peer-socket-tos": "default", PeerSocketTOS: Number (Default = 0)
  • "prefetch-enabled": 1, Should be true or false.
  • "ratio-limit": 1.1000, Okay, but this means a ratio of 1.1

https://trac.transmissionbt.com/wiki/EditConfigFiles#EditingConfigurationFiles

comment:17 in reply to: ↑ 16 ; follow-up: Changed 9 years ago by kringe

Replying to x190:

kringe: I went thru your settings.json and found some anomalies.

  • "open-file-limit": 32, This is very low. I suggest an increase.

32 is the default. i just bumped this up to 1024

  • "peer-socket-tos": "default", PeerSocketTOS: Number (Default = 0)

the wiki page to which you linked (which i though was definitive) says this field should be a string. a different page says it should be a number. and this seems to side with the string setting:

transmission.h:#define TR_DEFAULT_PEER_SOCKET_TOS_STR      "default"
  • "prefetch-enabled": 1, Should be true or false.

there's a edit message for the wiki page saying "changed the example boolean from '1' to 'true'", so 1 was probably the default when this setting was added. changing it to true now.

  • "ratio-limit": 1.1000, Okay, but this means a ratio of 1.1

that's how transmission stored it when i set it to 1.1 via rpc.

comment:18 in reply to: ↑ 17 Changed 9 years ago by x190

Replying to kringe:

Replying to x190:

kringe: I went thru your settings.json and found some anomalies.

  • "peer-socket-tos": "default", PeerSocketTOS: Number (Default = 0)

the wiki page to which you linked (which i though was definitive) says this field should be a string. a different page says it should be a number. and this seems to side with the string setting:

transmission.h:#define TR_DEFAULT_PEER_SOCKET_TOS_STR      "default"
  • "prefetch-enabled": 1, Should be true or false.

there's a edit message for the wiki page saying "changed the example boolean from '1' to 'true'", so 1 was probably the default when this setting was added. changing it to true now.

You are absolutely correct. Funny how stuff can get updated overnight isn't it? :-)

EDIT: I just saw what confused me. Scroll to bottom of wiki page and you'll see "Options" which are for Mac Client Preference files. Too bad there isn't a complete list for us Mac Client users. :-( Hmm... I bet that transmission.h file is the ticket! But I digress--Good Luck with your Backtrace.

Last edited 9 years ago by x190 (previous) (diff)

comment:19 Changed 9 years ago by kringe

still seeing this on r13087. jordan, was the backtrace sufficient?

comment:20 Changed 9 years ago by cfpp2p

  • "prefetch-enabled": 1, is CORRECT, the wiki has not been updated properly, start from a non existent settings.json (v 2.42, and I believe also v 2.33) and you will see defaults to: "prefetch-enabled": 1

Also:

Legacy Options

Only keys that differ from above are listed here. These options have been replaced in newer versions of Transmission. 2.31 (and older)

open-file-limit: Number (default = 32

SO for v 2.33 open-file-limit is invalid option as it has been totally eliminated:

I seem to remember a ticket describing open-file-limit being eliminated but I can't seem to find it now...

comment:21 Changed 9 years ago by kringe

i've been experiencing this for over two months now and since there's no sign of this problem even being investigated, i'm switching to a more stable client. so long, and thanks for all the fish.

comment:22 Changed 9 years ago by Luzen

I am having similar problems. Mac 10.7.2; transmission-daemon 2.42 (13168); libevent 2.0.16.

Hangs about once a week for me and I see same thing in log- last message was an announce.

Searched trac and found several other tickets about hanging. ticket:3818 mentions that libevent might be the issue, so I'm going to try setting EVENT_NOKQUEUE=1

comment:23 Changed 9 years ago by Luzen

9 days and no hanging. So it looks like kqueue was the cause. Is there a way to help debug this so you can either fix on your end or have libevent fix upstream?

comment:24 Changed 9 years ago by jordan

I'm not sure what would make sense at the Transmission code level... it just uses whatever settings libevent has set up.

You might want to take this issue upstairs and ask the libevent developers if they have an opinion on what would be The Right Thing here.

comment:25 Changed 9 years ago by Luzen

I opened a ticket on the libevent bugtracker and they need more detailed information that only a developer can provide. If you need me to do something, please spell it out step by step- I'm only an end user.

comment:26 Changed 9 years ago by x190

Luzen: I think it would be helpful if you could produce a backtrace in gdb while the daemon is hanging.

comment:27 Changed 9 years ago by Luzen

Considering I was experiencing the exact symptoms as kringe, and he already posted a backtrace, can't that be used for now? Rereading this thread, it looks like nobody even looked at his backtrace. I'd rather like to avoid having if I can, having to restart transmission and wait about a week for it to happen again, then figure out how to do the backtrace thing.

comment:28 Changed 9 years ago by Luzen

libevent devs response:

I'm afraid debugging this is probably going to need some knowledge of Transmission internals.

Looking at the backtrace, the only thread that seems to me to be possibly blocked are the ones that are in pthread_mutex_wait(), but I don't know what might be causing that. Unless of course the nanosleep()ing threads are the cause.

The best approach for the developers to try to track this down might be to figure out what exactly libevent seems to be doing wrong when running with kqueue. Is there some event that isn't getting activated? If so, doing a dtrace with special attention to poll() or kqueue() events might show what's going on, but you'd probably need to cross-refrenece that with a log of events from transmission to see *which* fd is getting confused.

So is there more debug info you can enable so I can try and help track this down? I am already running the daemon with --log-debug.

comment:29 Changed 9 years ago by Luzen

Just got a hang, even though kqueue is disabled. Took two weeks for transmission to hang this time. Same thing in logs, last message is announcing to tracker.

comment:30 follow-up: Changed 9 years ago by x190

Luzen: Do you have problems accessing the internet when this issue occurs? Could this relate to "router saturation" as is frequently mentioned in the forum?

https://forum.transmissionbt.com/viewtopic.php?f=2&t=12752

Last edited 9 years ago by x190 (previous) (diff)

comment:31 in reply to: ↑ 30 Changed 9 years ago by Luzen

Replying to x190:

Luzen: Do you have problems accessing the internet when this issue occurs? Could this relate to "router saturation" as is frequently mentioned in the forum?

No. In fact my connection is more responsive, since transmission isn't using any bandwidth at all. It is completely wedged.

comment:32 Changed 9 years ago by x190

For those experiencing this issue, please do some testing after making the following settings adjustments in settings.json. Stop daemon before editing the settings file.

  • Test available bandwidth and make settings at 60% of tested (in KB/s).
  • Lower global peer connections to 100 or less/~10 per-torrent/~10 active.
  • Test without µTP and DHT.

Note: I'm not suggesting you should disable DHT permanently. This is only for testing to see if the volume or rate of connections has a bearing on this issue.

comment:33 Changed 9 years ago by shellbound

The same problem here. Only workaround to reliable usage is to have a script to killall -9 transmission-daemon if transmission-remote doesn't respond every 10 minutes. But I think I have another fact to add to others. Not only does my log show the last thing the daemon was doing before hanging was making announces, but when restarted, I see that these announces were for one or more new torrents that were just added. When the daemon is restarted, I see the new torrent(s) are in a paused state with nothing yet downloaded.

@x190: tested with your suggestions and no change.

comment:34 follow-up: Changed 9 years ago by x190

shellbound: Could you possibly run the daemon in gdb and produce a backtrace while it is hanging? Activity Monitor's "Sample Process" might be helpful if taken while the daemon is "Not Responding". Also, if you have Apple's Developer Tools installed, you could get a Shark trace while daemon is NR. https://trac.transmissionbt.com/wiki/Shark

Since new torrents are involved and you find them in a paused state on restart, I suspect pre-allocation might be involved. Have you tried running the daemon without your "killall" script and just waiting for an extended period to see if the daemon will recover on its own? Please see #1753 for more info on this subject.

comment:35 in reply to: ↑ 34 Changed 9 years ago by shellbound

Replying to x190:

shellbound: Could you possibly run the daemon in gdb and produce a backtrace while it is hanging?

I can try, but re-reading this ticket it looks like the other backtrace was never examined by a developer. Would I be wasting my time?

Activity Monitor's "Sample Process" might be helpful if taken while the daemon is "Not Responding".

That looks easy- I can definitely do that one.

Also, if you have Apple's Developer Tools installed, you could get a Shark trace while daemon is NR. https://trac.transmissionbt.com/wiki/Shark

Shark doesn't exist anymore in newer versions (4+?) of Xcode. It's been replaced by something called Instruments, but the Shark instructions don't look relevant anymore. Can somebody update the instructions for using Instruments?

Have you tried running the daemon without your "killall" script and just waiting for an extended period to see if the daemon will recover on its own? Please see #1753 for more info on this subject.

I started using the script when I came back home and saw transmission had been frozen for hours or days and after finding this ticket. #1753 doesn't seem related because these torrents were less than 1GB in size and there were only between 1 and 3 of them recently added (from a python program using the transmissionrpc package).

Changed 9 years ago by shellbound

comment:36 Changed 9 years ago by shellbound

I managed to catch it after freezing for only 2.5 hours. Attaching the output of the Activity Monitory sampling. But the last thing in the log were announces for old torrents and there were no new torrents when I restarted the daemon, so this didn't have anything to do with preallocation.

comment:37 Changed 9 years ago by x190

Jordan or livings124, do you have any insight on this issue?

comment:38 Changed 9 years ago by jordan

The reason the first backtrace wasn't acted upon is that there's nothing in it that appears wrong. All the threads are in an idle state waiting for something to happen such as a timer saying that it's time to reannounce or make a new peer connection, etc.

If transmission isn't responding at all to anything, such as an RPC ping, that's probably an easier issue to investigate since one of the threads will probably show it's blocked on some task. However that /might/ also be a different issue than the one kringe was reporting... it's hard to know.

comment:39 Changed 9 years ago by x190

Thanks. Since shellbound is the current active complainant, what exact steps should he take to provide useful information?

comment:40 Changed 9 years ago by shellbound

Still active here. I had disabled my kill script because I thought I would be helping to debug the problem, and today found it frozen for 8 hours. Did the attached activity monitor sampling show anything? When this happens transmission does not respond to anything except kill -9. Repeating x190's question: what exact steps should I take to provide useful information?

comment:41 Changed 9 years ago by x190

My observations:

kringe's backtrace using r12959: Thread 2 (process 2564) Note: Line 332 in dht.c is "fflush(dht_debug);"

#5 0x0000000100085ec0 in debugf (format=0x803 <Address 0x803 out of bounds>) at dht.c:332 105 args = {{ 106 gp_offset = 8, 107 fp_offset = 48, 108 overflow_arg_area = 0x1005c9830, 109 reg_save_area = 0x1005c9750 110 }}

Also see Line 32 of kringe's backtrace: 32 ptr = 0x84 <Address 0x84 out of bounds>

==========================================================================

  • Both kringe's backtrace and shellbound's Activity Monitor sample indicate work being done in dht including dht_periodic followed by curl_multi_perform (See kringe's backtrace Thread 2 and Thread 3).
  • The switch(message) code in dht_periodic (dht.c r12959) which begins at Line 1926 looks a little problematic to me in that the "case ANNOUNCE_PEER:" option (Line 2086) does not appear to have any terminating "break;" expression should the conditionals all fail. Also there is no "default:" code for this switch statement.
  • dht_periodic calls at Line 2071:

"debugf("Sending nodes for get_peers.\n");

send_closest_nodes(from, fromlen,

tid, tid_len, info_hash, want, 0, NULL, token, TOKEN_SIZE);"

debugf at this point then goes out of bounds "format=0x803 <Address 0x803 out of bounds" according to kringe's backtrace (Lines 105-110).

  • All three posters are running transmission-daemon on a mac, not a very common occurence, I don't think.

kringe's backtrace show's an o.o.b. problem in daemon.c:542

"str = {

buf = "Ñ\000\000\000\000\000\000\000Ñ\000\000\000\000\000\000", ptr = 0x84 <Address 0x84 out of bounds>

}"

Here's an interesting article written by an IBM oldtimer about how oob errors can eventually causes problems, including hangs, on certain systems.

http://lists.linux.org.au/archives/linuxcprogramming/2003-August/msg00005.html

Last edited 9 years ago by x190 (previous) (diff)

comment:42 Changed 9 years ago by shellbound

x190, that is great detective work! I have modified my local copy of third-party/dht/dht.c to incorporate your suggested fixes and for the past 4 days I have had no problem (I would normally expect to see the problem every couple of days). I will report back if I see the problem again in the future.

comment:43 Changed 9 years ago by shellbound

had another hang just now with last debug log message being an announce. to help narrow this down, i'm disabling several options (dht, pex, ldp, utp) for now and will provide another activity monitor sample when this happens again.

Changed 9 years ago by shellbound

comment:44 follow-ups: Changed 9 years ago by shellbound

attaching a sample of transmission-daemon hanging, with dht, pex, ldp, utp disabled. i think viewing the web interface triggered it this time.

comment:45 in reply to: ↑ 44 ; follow-up: Changed 9 years ago by x190

Replying to shellbound:

attaching a sample of transmission-daemon hanging, with dht, pex, ldp, utp disabled. i think viewing the web interface triggered it this time.

Thanks for posting. What version of transmission-daemon are you using? Is "Path: /usr/local/bin/transmission-daemon" the correct path? I ask this because that path does not exist on Snow Leopard but perhaps Lion is different? You say you disabled DHT for this test, but your "sample2" shows the same DHT callstack. Please read the following carefully for info on how to edit your settings.json correctly. Note that the daemon must be stopped before editing or changes will be overwritten.

https://trac.transmissionbt.com/wiki/EditConfigFiles

comment:46 in reply to: ↑ 44 ; follow-up: Changed 9 years ago by rb07

Replying to shellbound:

i think viewing the web interface triggered it this time.

This, and the description, sounds very close to what happens on NASes.

I wouldn't expect it to be happening on Mac OSX, but... The problem in NASes is libevent, we have to set the environment variable EVENT_NOEPOLL=1 (which disables the use of epoll, something Linux has). The problem also changed between versions of libevent, at least I think it did, it went away (with the latest 1.4.x versions), then came back (with the 2.0.x versions).

So you could test with the latest libevent (2.0.19 released this week), and you can test with epoll disabled. If I remember correctly some Mac users do disable kqueue, which is another option, but better research or ask the libevent forum if there are any known problems on Mac OS X.

And the simple test case is to access the Web client, 100% of the time freezes the daemon immediately if libevent is not working right. BTW you don't have to restart the daemon, a simple "killall -HUP transmission-daemon" (or whatever is available in Mac OSX, pkill maybe?) does un-freeze it -- I used to have a script running like a watch dog to wake up the daemon.

Last edited 9 years ago by rb07 (previous) (diff)

comment:47 in reply to: ↑ 45 ; follow-up: Changed 9 years ago by shellbound

Replying to x190:

Replying to shellbound:

attaching a sample of transmission-daemon hanging, with dht, pex, ldp, utp disabled. i think viewing the web interface triggered it this time.

Thanks for posting. What version of transmission-daemon are you using?

2.51 (13294)

Is "Path: /usr/local/bin/transmission-daemon" the correct path? I ask this because that path does not exist on Snow Leopard but perhaps Lion is different?

i think you're right that the directory doesn't exist by default. but i'm installing transmission from svn on a unix-like system, and /usr/local is the traditional location for user-installed software.

You say you disabled DHT for this test, but your "sample2" shows the same DHT callstack. Please read the following carefully for info on how to edit your settings.json correctly. Note that the daemon must be stopped before editing or changes will be overwritten.

https://trac.transmissionbt.com/wiki/EditConfigFiles

i disabled the settings through the web interface, but apparently that doesn't work. i'll edit the config file now for next time.

comment:48 in reply to: ↑ 46 Changed 9 years ago by shellbound

Replying to rb07:

Replying to shellbound: So you could test with the latest libevent (2.0.19 released this week), and you can test with epoll disabled. If I remember correctly some Mac users do disable kpoll, which is another option, but better research or ask the libevent forum if there are any known problems on Mac OS X.

osx doesn't even have epoll, so that can't be the problem. i tried disabling kqueue in the past as suggested by Luzen earlier in the thread, but that didn't help.

And the simple test case is to access the Web client, 100% of the time freezes the daemon immediately if libevent is not working right.

never had a problem with the web client before, and had been using the web client several times before the hang. but macports did just update libevent to 2.0.18. whatever the problem is, nothing seems to trigger it immediately, that's why it's annoying and hard to debug.

comment:49 in reply to: ↑ 47 Changed 9 years ago by x190

Replying to shellbound:

i disabled the settings through the web interface, but apparently that doesn't work. i'll edit the config file now for next time.

Also recheck comment:32 while you're at it.

comment:50 Changed 9 years ago by shellbound

after disabling dht, pex, ldp and utp, i experienced no problems. after re-reading the above comments, dht seemed to be the most likely suspect, so i re-enabled pex, ldp, and utp, and kept dht disabled. still no problems. so i think that pretty much validates the assumption that the dht code is to blame for this.

comment:51 follow-up: Changed 9 years ago by x190

Thanks for reporting back. I don't think the dht code is the problem as there would be tonnes of reports, if it was. I think, however, as I've mentioned several times in this ticket, that your router and/or OS is having trouble with the volume of connections. comment:32 is a good place to start, only in your case, I would cut those numbers in half. Since DHT does create a lot of connections, and since I think you know how to edit source code and build, try changing:

/* max number of peers to ask for per second overall.
     * this throttle is to avoid overloading the router */
    MAX_CONNECTIONS_PER_SECOND = 12,

in libtransmission/peer-mgr.c to 6 or less as well.

comment:52 in reply to: ↑ 51 Changed 9 years ago by shellbound

Replying to x190:

Thanks for reporting back. I don't think the dht code is the problem as there would be tonnes of reports, if it was.

the 3 of us who have made the effort to report on this thread are all, according to you as remarked in comment:41, using an uncommon combination of transmission-daemon and osx.

I think, however, as I've mentioned several times in this ticket, that your router and/or OS is having trouble with the volume of connections. comment:32 is a good place to start, only in your case, I would cut those numbers in half. Since DHT does create a lot of connections, and since I think you know how to edit source code and build, try changing:

my router has no issues when the problem occurs. all other network programs function fine and i've never experienced similar issues with other torrent clients. remember, transmission itself hangs and stops responding, not the network, not the OS, etc.

comment:53 follow-up: Changed 9 years ago by x190

What do you see using "lsof -p <daemon process number> | wc -l" before reaching a NR state? Reference: https://trac.transmissionbt.com/ticket/3504#comment:39.

comment:54 Changed 9 years ago by shellbound

it's 70 with dht disabled. after re-enabling dht (and reloading the settings via SIGHUP) and downloading a public torrent for a while (the rest of my 50 active torrents are all private), the number of open files didn't change much- it jumped to 76 then back down to 71. now that dht is enabled i expect it to hang within the week, i'll check the lsof on the process after it happens (can't predict when that will be so checking the number before doesn't make too much sense).

comment:55 Changed 9 years ago by x190

Did you read #3504, comment 42 and 43, and do they apply?

comment:56 Changed 9 years ago by shellbound

don't think it applies because my libcurl is compiled with --enable-ares

comment:57 Changed 9 years ago by gvdl

  • Cc gvdl@… added

comment:58 in reply to: ↑ 53 Changed 9 years ago by shellbound

Replying to x190:

What do you see using "lsof -p <daemon process number> | wc -l" before reaching a NR state?

just found transmission-daemon hanging again. it wasn't holding open any files and i was randomly checking the count before the hang and it never got above 80, so that theory doesn't seem to be supported. i'm going to disable dht for now, because i think there is enough circumstantial evidence pointing to it as the culprit.

comment:59 Changed 9 years ago by Luzen

I decided to give transmission another try after seeing some movement on this ticket. After disabling DHT, the problem is gone.

comment:60 Changed 9 years ago by shellbound

so after 3 months with dht disabled (and no other change), not a single recurrence of this problem. the conclusion is there's a bug in dht that causes it.

comment:61 Changed 8 years ago by Bugmenot

I was about to switch clients because I was having same problem with daemon on my iMac, but disabling DHT fixed it for me too.

comment:62 Changed 8 years ago by Bugmenot

I started experiencing hangs as well. The daemon would just stop responding and the log no longer updated. iotop showed transmission had no i/o whatsoever, so it was truly hanging.

I think I figured out the cause. I had recently upgraded the curl library, though only to a minor point version. Later when I recompiled transmission the hangs stopped. Is this expected behavior? Is it possible there was some incompatibility between the version of curl that transmission is compiled against and a minor version upgrade? If so, is it possible to get an error instead and have it die.

comment:63 Changed 8 years ago by helpmoi

Have this problem several times a day. Compiled transmission on a fresh OS install, and no libraries have changed since compiled. I'm also using latest source from subversion.

Bug is 2 years old and it's not considered important?

comment:64 follow-up: Changed 8 years ago by jordan

This bug is still open & unresolved because the developers aren't seeing this behavior and because there doesn't seem to be any consistent way to repeat it. Some reporters say that DHT is the culprit, others say that an old version of libcurl is to blame.

helpmoi, do you see the behavior if DHT is disabled?

comment:65 in reply to: ↑ 64 Changed 8 years ago by helpmoi

Replying to jordan:

This bug is still open & unresolved because the developers aren't seeing this behavior and because there doesn't seem to be any consistent way to repeat it. Some reporters say that DHT is the culprit, others say that an old version of libcurl is to blame.

helpmoi, do you see the behavior if DHT is disabled?

Two days of stability after disabling DHT. Thanks.

P.S. I read the bit about curl in previous posts to mean that curl wasn't necessarily an old version, just that curl had be updated to a slightly newer version from the one that transmission was compiled with.

comment:66 follow-up: Changed 8 years ago by x190

Same issue? only this time Linux.

https://forum.transmissionbt.com/viewtopic.php?f=9&t=15001

Worth a try: comment:51.

comment:67 in reply to: ↑ 66 Changed 8 years ago by helpmoi

Replying to x190:

Same issue? only this time Linux.

https://forum.transmissionbt.com/viewtopic.php?f=9&t=15001

Worth a try: comment:51.

That doesn't seem to be the same problem. Everybody else in this ticket experiences transmission becoming completely unresponsive, without any issue with the router or operating system. The only mention of router saturation in this ticket is by you and several others have confirmed that is not an issue. I can add my own confirmation to that tally.

comment:68 Changed 7 years ago by mike.dld

Let me add another two cents. I looked over the backtraces provided and even wrote a small program in attempt to reproduce the issue, but had no luck.

Anyway, the idea was that there is some kind of deadlock (suggested by libevent developers in comment:28 as well). What I noticed from third-party/dht/dht.c code is that debugf() function uses dht_debug variable, which initially is being initialized to NULL, and then in libtransmission/tr-dht.c, in case TR_DHT_VERBOSE environment variable is set, to a valid file handle. So unless this environment variable is set, and I don't think it is for people in this ticket, it's safe to assume that at the time issue is being reproduced dht_debug is NULL. This means that in debugf() function, fflush() is being called with NULL argument, and man 3 fflush says that

If the stream argument is NULL, fflush() flushes all open output streams.

Now I'm saying that in this particular case fflush() shouldn't be called at all, since clearly flushing all the open files is not the intention.

Could someone experiencing the issue check if removing (for the sake of testing) fflush(dht_debug); (third-party/dht/dht.c, line ~332) helps in any way?

comment:69 follow-up: Changed 7 years ago by x190

"Now I'm saying that in this particular case fflush() shouldn't be called at all, since clearly flushing all the open files is not the intention." Or do: Line 332 dht.c

if (dht_debug != NULL)
  fflush(dht_debug);

comment:70 Changed 7 years ago by mike.dld

This is the idea for upstream patch, yes. Actually, even

if (dht_debug == NULL)
  return;

in the beginning of function. But we still need to know whether it helps in this particular case or not.

comment:71 Changed 7 years ago by Bugmenot

I'm running the dht patch now. I'll report back soon.

comment:72 follow-up: Changed 7 years ago by Bugmenot

Tested for 4 days and not a single hang. mike.dld, great job on solving a 2 year old bug!

comment:73 in reply to: ↑ 69 Changed 7 years ago by jch

Replying to x190:

Line 332 dht.c

if (dht_debug != NULL)
  fflush(dht_debug);

Yeah, that's a bug, although it's unlikely to cause the issue at hand. I'll fix it upstream.

(Note that I have a number of minor fixes to the DHT code that have accumulated, but none that would explain the issue you're having.)

--jch

comment:74 in reply to: ↑ 72 ; follow-up: Changed 7 years ago by jch

Replying to Bugmenot:

Tested for 4 days and not a single hang.

If this patch fixes your issue, then it's only hiding a real bug that's somewhere else.

--jch

comment:75 in reply to: ↑ 74 ; follow-up: Changed 7 years ago by mike.dld

Replying to jch:

If this patch fixes your issue, then it's only hiding a real bug that's somewhere else.

Clearly this ticket in its current state didn't help anyone (interested enough) much in finding the root cause. At least we might be able to get other stacktraces after this small patch is applied...

comment:76 Changed 7 years ago by cfpp2p

my current opinion on the issue is: 1) that it is somehow related to #1753 2) that it is somehow related to running the uncommon transmission daemon on MAC 3) that is is somehow related to the users network setup 4) that it is somehow related to dht and router saturation ( well documented here https://forum.transmissionbt.com/viewtopic.php?f=9&t=15001#p65662 as well as several other posts where disabling dht alleviated network issues ) 5) mike.dld suggested test patch https://github.com/mikedld/dht/commit/e972d80072fa3bda3de03f0932b0a5feb1f526a2 should help sort out how IO related issue play into it all. 6) This is an obscure bug.

comment:77 follow-up: Changed 7 years ago by jordan

FWIW, I've committed the fflush fix in r14232.

jch, I don't see any other of the upstream changes you mentioned -- are they not in the master branch at git://git.wifi.pps.univ-paris-diderot.fr/dht-bootstrap ?

comment:78 in reply to: ↑ 75 Changed 7 years ago by Bugmenot

Replying to mike.dld:

Replying to jch:

If this patch fixes your issue, then it's only hiding a real bug that's somewhere else.

Clearly this ticket in its current state didn't help anyone (interested enough) much in finding the root cause. At least we might be able to get other stacktraces after this small patch is applied...

Still no hang after 3 weeks.

comment:79 Changed 7 years ago by x190

Bugmenot: How often did you get hangs pre r14232? Can we mark this one as fixed by r14232?

comment:80 Changed 7 years ago by cfpp2p

thanks Bugmenot. :)

it somehow seems/might be related to a compiler bug, but I'm very glad it's finally fixed. :) For information on the possibility of its relationship to compiler bug see https://github.com/jech/dht/pull/2#issuecomment-32739372

comment:81 in reply to: ↑ 77 Changed 7 years ago by jch

Replying to jordan:

jch, I don't see any other of the upstream changes you mentioned -- are they not in the master branch at git://git.wifi.pps.univ-paris-diderot.fr/dht-bootstrap ?

The dht-bootstrap tree is a boostrap server and is unrelated to Transmission, it's not the dht library which is in the dht repository.

However, most of the changes are in libtransmission itself, they're mostly changes to timeouts to make the dht a little more aggressive (I was overly cautious when I first wrote the code, and I've learnt a lot about the behaviour of Transmission since then). I'd like to commit them myself (in which case you'll need to send me a new password), or I can send you a patch series if you promise to commit them under my name.

-- Juliusz

comment:82 Changed 7 years ago by livings124

  • Milestone changed from None Set to 2.83
  • Resolution set to fixed
  • Status changed from reopened to closed

Finally closing this one out.

Note: See TracTickets for help on using tickets.