Opened 12 years ago
Closed 12 years ago
#3196 closed Bug (fixed)
Daemon hangs on ARM based NAS
Reported by: | rb07 | Owned by: | |
---|---|---|---|
Priority: | Normal | Milestone: | None Set |
Component: | Daemon | Version: | 1.93 |
Severity: | Normal | Keywords: | daemon ARM NAS |
Cc: | nhfnc@… |
Description
Versions 1.92 and 1.93 of the daemon stop working (time varies from hours to days), become totally unresponsive: no upload/download, no RPC (tested with transmission-remote, Web client, Qtr).
Environment: Native build on an HP MediaVault, Linux HPMediaVault 2.6.12.6-arm1 #1 Fri Feb 8 15:45:19 MST 2008 armv5tejl GNU/Linux, using gcc 3.4.4, libcurl 7.19.7, libevent 1.4.13, openssl 0.9.8j .
The process and its threads are still 'active', just doing nothing useful; low CPU usage, a strace snippet:
stat64("/opt/var/transmission/torrentwatch/", {st_mode=S_IFDIR|S_ISGID|0755, st_size=17592186044416, ...}) = 0 open("/opt/var/transmission/torrentwatch/", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 252 fstat64(252, {st_mode=S_IFDIR|S_ISGID|0755, st_size=17592186044416, ...}) = 0 fcntl64(252, F_SETFD, FD_CLOEXEC) = 0 getdents64(252, /* 2 entries */, 4096) = 48 getdents64(252, /* 0 entries */, 4096) = 0 close(252) = 0 time(NULL) = 1273248797 nanosleep({1, 0}, NULL) = 0 time(NULL) = 1273248798 nanosleep({1, 0}, NULL) = 0 t ... (repeats over and over)
Attaching GDB and looking at what its doing:
(gdb) info threads 3 Thread 0x40d31520 (LWP 5748) 0x403f3e10 in read () from /opt/lib/libpthread.so.0 2 Thread 0x416fa520 (LWP 5786) 0x404b8adc in select () from /lib/libc.so.6 * 1 Thread 0x40530b00 (LWP 5747) 0x40490574 in nanosleep () from /lib/libc.so.6 (gdb) bt #0 0x40490574 in nanosleep () from /lib/libc.so.6 #1 0x404b91bc in usleep () from /lib/libc.so.6 #2 0x0000bd58 in main (argc=2492072, argv=0x23c778) at daemon.c:463 (gdb) thread 2 [Switching to thread 2 (Thread 0x416fa520 (LWP 5786))]#0 0x404b8adc in select () from /lib/libc.so.6 (gdb) bt #0 0x404b8adc in select () from /lib/libc.so.6 #1 0x416f9e00 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 0x40d31520 (LWP 5748))]#0 0x403f3e10 in read () from /opt/lib/libpthread.so.0 (gdb) bt #0 0x403f3e10 in read () from /opt/lib/libpthread.so.0 #1 0x00023b90 in readFromPipe (fd=3, eventType=2, veh=0x698c8) at trevent.c:164 #2 0x400254c8 in event_base_loop (base=0x6ab70, flags=0) at event.c:385 #3 0x00023dc8 in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230 #4 0x00014504 in ThreadFunc (_t=0x69958) at platform.c:109 #5 0x403ee0a8 in start_thread (arg=0xfffffe00) at pthread_create.c:263 #6 0x404bf7e0 in clone () from /lib/libc.so.6 Backtrace stopped: frame did not save the PC
I've read the other similar reports, the main difference is I have had no crash at all, don't cross-compile and don't use third-party packages except for the compiler.
The fact is up to version 1.91 everything works fine (I did rebuild 1.91 after 1.92 started doing this and changed with no problem until 1.93 came out and I changed again).
Any ideas about how to debug this are welcome (but a better place to just discuss ideas is the forum, this tracker should be used for hard data only).
Change History (29)
comment:1 Changed 12 years ago by rb07
comment:2 Changed 12 years ago by rb07
Second freeze on version 1.93, about an hour ago after a little more than a day working normally, looks identical to the first:
(gdb) thread apply all bt Thread 3 (Thread 0x40d31520 (LWP 7064)): #0 0x403f3e10 in read () from /opt/lib/libpthread.so.0 #1 0x00023b90 in readFromPipe (fd=3, eventType=2, veh=0x698c8) at trevent.c:164 #2 0x400254c8 in event_base_loop (base=0x6ab70, flags=0) at event.c:385 #3 0x00023dc8 in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230 #4 0x00014504 in ThreadFunc (_t=0x69958) at platform.c:109 #5 0x403ee0a8 in start_thread (arg=0xfffffe00) at pthread_create.c:263 #6 0x404bf7e0 in clone () from /lib/libc.so.6 Backtrace stopped: frame did not save the PC Thread 2 (Thread 0x416fa520 (LWP 7067)): #0 0x404b8adc in select () from /lib/libc.so.6 #1 0x416f9e00 in ?? () Thread 1 (Thread 0x40530b00 (LWP 7063)): #0 0x40490574 in nanosleep () from /lib/libc.so.6 #1 0x404b91bc in usleep () from /lib/libc.so.6 #2 0x0000bd58 in main (argc=0, argv=0xa6c98) at daemon.c:463 #0 0x40490574 in nanosleep () from /lib/libc.so.6 (gdb) info share From To Syms Read Shared Object Library 0x40023c70 0x40031d68 Yes /opt/lib/libevent-1.4.so.2 0x40047bd8 0x40084898 Yes /opt/lib/libcurl.so.4 0x40091b84 0x40099f08 Yes /opt/lib/libidn.so.11 0x400cf46c 0x400f552c Yes /opt/lib/libssh2.so.1 0x4010a358 0x4012a984 Yes /opt/lib/libldap-2.4.so.2 0x40138460 0x4013cde8 Yes /opt/lib/liblber-2.4.so.2 0x40149244 0x40153328 Yes /opt/lib/libresolv.so.2 0x4016316c 0x40172b9c Yes /opt/lib/libsasl2.so.2 0x4017edb0 0x40183ab0 Yes /opt/lib/librt.so.1 0x401a3cf8 0x401cec04 Yes /opt/lib/libssl.so.0.9.8 0x402146b8 0x402ccef0 Yes /opt/lib/libcrypto.so.0.9.8 0x4030da44 0x4030e5dc Yes /opt/lib/libdl.so.2 0x4031a63c 0x40324eec Yes /opt/lib/libz.so.1 0x4032d200 0x4039911c Yes /opt/lib/libm.so.6 0x403d9f40 0x403dfe38 Yes /opt/lib/libgcc_s.so.1 0x403ecd30 0x403f60a4 Yes /opt/lib/libpthread.so.0 0x4041b5d0 0x404f5928 Yes /lib/libc.so.6 0x4051f278 0x405264b0 Yes /opt/lib/liblber-2.3.so.0 0x400006e0 0x40013160 Yes /lib/ld-linux.so.3
Which one looks more suspicious, the select() or the read() ? It must be blocking indefinitely on one or both of them.
comment:3 Changed 12 years ago by charles
Neither of these look suspicious to me.
comment:4 Changed 12 years ago by Aeron
+1, confirm this issue. I have WD World Edition 1TB (WhiteLight?) NAS with installed thru IPKG Transmission 1.93 with libcurl 7.20.1 and openssl 1.0.0 (which was cross-compiled and packed in proper ipk’s by my own). WD NAS environment is Linux 2.6.24.4 armv5tejl. The problem is the same. After starting transmission the «top» shows me 1 active (running) process of transmission-daemon and 3 inactive (sleeping) processes. Further it can hang in any moment… in a couple of minutes or in a couple of hours, randomly. Logs just can’t show anything useful except usual transmission routine. After it hang the «top» shows 4 inactive daemon processes.
The 1.92 has the same issue, but it more stable in work, it can works without hangs for hours and, sometimes, days.
Transmission and both libs was cross-compiled and packed on CentOS (Linux 2.6.18-128.el5 x86_64 GNU/Linux). I can give links to ipk’s and cross-compiles if it can help.
comment:5 follow-up: ↓ 6 Changed 12 years ago by charles
What happens to the CPU when transmission freezes?
Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?
comment:6 in reply to: ↑ 5 ; follow-ups: ↓ 7 ↓ 15 Changed 12 years ago by rb07
Replying to charles:
What happens to the CPU when transmission freezes?
Nothing unusual, the daemon uses very low CPU, all other processes (I have a mail server, ftp server, etc.) continue working normally.
Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?
The gdb output above is all you get, I've done continue, let it alone for a few minutes, Ctrl-C and the threads are in the exact same state. That's why I wrote that it must be blocking on select() and read(); the select is probably from libevent.
If you meant you want to interact with gdb, no problem, let me know what you need.
comment:7 in reply to: ↑ 6 Changed 12 years ago by Aeron
Replying to rb07:
Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?
I think this part of message is for me.
Replying to charles:
Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?
Yep, i’m working on it. Just waiting for daemon hangs.
Replying to charles:
What happens to the CPU when transmission freezes?
CPU is just fine, no overloads or something like that. If daemon hangs, it’s processes CPU usage is 0%.
comment:8 Changed 12 years ago by Aeron
Ok, done:
~ # /opt/bin/gdb dlopen failed on 'libthread_db.so.1' - libthread_db.so.1: cannot open shared object file: No such file or directory GDB will not be able to debug pthreads. GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "arm-none-linux-gnueabi". (gdb) file /usr/local/bin/transmission-daemon Reading symbols from /usr/local/bin/transmission-daemon...done. (gdb) run -g /root/.config/transmission-daemon/ Starting program: /usr/local/bin/transmission-daemon -g /root/.config/transmission-daemon/ BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx' Program exited normally. (gdb) attach 6332 Attaching to program: /usr/local/bin/transmission-daemon, process 6332 warning: process 6332 is a cloned process BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' Loaded symbols for /lib/libnsl.so.1 Loaded symbols for /lib/libresolv.so.2 Loaded symbols for /lib/libcurl.so.4 Loaded symbols for /lib/librt.so.1 Loaded symbols for /lib/libm.so.6 Loaded symbols for /lib/libgcc_s.so.1 Loaded symbols for /lib/libpthread.so.0 Loaded symbols for /lib/libc.so.6 Loaded symbols for /lib/ld-linux.so.3 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Reading symbols from /lib/libnss_dns.so.2...done. Loaded symbols for /lib/libnss_dns.so.2 0x400ffb9c in bn_mul_add_words () from /lib/libcurl.so.4 (gdb) bt #0 0x400ffb9c in bn_mul_add_words () from /lib/libcurl.so.4 #1 0x40165684 in bn_mul_normal () from /lib/libcurl.so.4 #2 0x40165574 in BN_mul () from /lib/libcurl.so.4 #3 0x40168938 in BN_mod_mul_montgomery () from /lib/libcurl.so.4 #4 0x401635ac in BN_mod_exp_mont_consttime () from /lib/libcurl.so.4 #5 0x40162c0c in BN_mod_exp_mont () from /lib/libcurl.so.4 #6 0x40108674 in dh_bn_mod_exp () from /lib/libcurl.so.4 #7 0x401083d8 in generate_key () from /lib/libcurl.so.4 #8 0x40108264 in DH_generate_key () from /lib/libcurl.so.4 #9 0x0002dd4c in ensureKeyExists (crypto=0x484020) at crypto.c:139 #10 0x0002e0a8 in tr_cryptoGetMyPublicKey (crypto=0x484020, setme_len=0xbe7ff774) at crypto.c:215 #11 0x0004a200 in tr_handshakeNew (io=0x347ab8, encryptionMode=TR_ENCRYPTION_PREFERRED, doneCB=0x369bc <myHandshakeDoneCB>, doneUserData=0x79ac8) at handshake.c:335 #12 0x00039fd8 in reconnectPulse (foo=388177907, bar=-26536, vmgr=0x6db392eb) at peer-mgr.c:2787 #13 0x0005bb50 in event_base_loop (base=0x78618, flags=0) at event.c:385 #14 0x0002364c in libeventThreadFunc (veh=0x17231ff3) at trevent.c:230 #15 0x000140d0 in ThreadFunc (_t=0x7a890) at platform.c:109 #16 0x402dce84 in ?? () from /lib/libpthread.so.0 (gdb) info share From To Syms Read Shared Object Library 0x40021304 0x4002c7d8 Yes /lib/libnsl.so.1 0x4003b2b0 0x40046204 Yes /lib/libresolv.so.2 0x400a95c8 0x401ae308 Yes /lib/libcurl.so.4 0x4020aab0 0x4020fed0 Yes /lib/librt.so.1 0x402271e8 0x40286f0c Yes /lib/libm.so.6 0x402c8df8 0x402cea10 Yes /lib/libgcc_s.so.1 0x402dab20 0x402e481c Yes /lib/libpthread.so.0 0x40343750 0x404171fc Yes /lib/libc.so.6 0x400005f0 0x40011ab8 Yes /lib/ld-linux.so.3 0x40437b14 0x4043e46c Yes /lib/libnss_files.so.2 0x40448d1c 0x4044ada0 Yes /lib/libnss_dns.so.2
Sorry, but I never used GDB before. May be I doing it wrong.
P.S.: 6332 — it’s an active daemon process PID
comment:9 Changed 12 years ago by nhfnc
- Cc nhfnc@… added
This issue happens to me as well. As Aeron I am running this on a WD World Edition White light box. Compiled it using WD tool chain (followed instructions on this page: http://bengreen.eu/pmwiki/pmwiki.php?n=Linux.CrossCompileMyBookWorldWhiteLight) on a Ubuntu 9.04 box and just copied the executables over to the WD box. By the way, the 1.93 seems to have fixed a previous issue I had http://trac.transmissionbt.com/ticket/2842, but now I have this new issue instead... Unsure if I can get any gdb traces from my setup, but i am happy to try anything you guys may want me to do.
comment:10 follow-up: ↓ 11 Changed 12 years ago by charles
Aeron: what happens if you force encryption to be turned off via settings.json?
comment:11 in reply to: ↑ 10 Changed 12 years ago by Aeron
Replying to charles:
Aeron: what happens if you force encryption to be turned off via settings.json?
All the same. Daemon just hangs as usually.
GDB output of hanged active daemon process not seems to be very useful:
(gdb) attach 8298 Attaching to program: /usr/local/bin/transmission-daemon, process 8298 warning: process 8298 is a cloned process BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx' BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libcurl.so.4...done. Loaded symbols for /lib/libcurl.so.4 BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx' Reading symbols from /lib/librt.so.1...done. Loaded symbols for /lib/librt.so.1 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx' Reading symbols from /lib/libgcc_s.so.1...done. Loaded symbols for /lib/libgcc_s.so.1 Reading symbols from /lib/libpthread.so.0...done. Loaded symbols for /lib/libpthread.so.0 BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx' Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx' Reading symbols from /lib/ld-linux.so.3...done. Loaded symbols for /lib/ld-linux.so.3 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Reading symbols from /lib/libnss_dns.so.2...done. Loaded symbols for /lib/libnss_dns.so.2 0x402e33b4 in read () from /lib/libpthread.so.0 (gdb) bt #0 0x402e33b4 in read () from /lib/libpthread.so.0 #1 0x00023410 in readFromPipe (fd=4, eventType=-1345, veh=0x78050) at trevent.c:164 #2 0x0005bb50 in event_base_loop (base=0x78618, flags=0) at event.c:385 #3 0x0002364c in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230 #4 0x000140d0 in ThreadFunc (_t=0x7a890) at platform.c:109 #5 0x402dce84 in ?? () from /lib/libpthread.so.0
comment:12 Changed 12 years ago by rb07
I noticed that the problem is repeatable, just like some people have been saying on the forum (and NAS related forums) but in many cases they reported crashes not unresponsiveness: With one or more torrents downloading at 'high' speed the daemon becomes unresponsive whiting an hour.
Of course 'high' is relative, in my case around 200 KB/s (that's kilo bytes).
On another front, bad news, Valgrind doesn't work on this NAS, it compiles (only the svn version has ARM support) but it crashes on any real work.
I made a 'diff' between the version that works, 1.91, and 1.93, and the changes on the deamon are few but I didn't notice anything that could be the cause on a quick review.
comment:13 Changed 12 years ago by charles
I wonder if maybe the libevent web server socket is closing somehow.
comment:14 Changed 12 years ago by nhfnc
In my case I am compiling using the built-in libevent. I will try to compile again with a "normal" libevent.
comment:15 in reply to: ↑ 6 Changed 12 years ago by rb07
Replying to rb07:
Replying to charles:
What happens to the CPU when transmission freezes?
Nothing unusual, the daemon uses very low CPU, all other processes (I have a mail server, ftp server, etc.) continue working normally.
Correction: It actually is using no CPU, stays at 0%.
Another correction: It does write one line to the log when I stop the daemon, the one that says "Closing session".
I tried testing if a socket closes unexpectedly by enabling SIGPIPE handling, but the daemon had problems starting up.
comment:16 follow-up: ↓ 17 Changed 12 years ago by rb07
Another interesting detail: The tracker shows twice the number of real torrents after restarting an unresponsive daemon.
As I said above, the unresponsive daemon writes "Closing session" but according to the tracker it never did w.r.t. that one tracker. --- False alarm: Its a problem in the tracker, not transmission. I just tested a normal daemon shutdown and the tracker doesn't update its info, after start the tracker shows twice the torrents.
comment:17 in reply to: ↑ 16 ; follow-up: ↓ 18 Changed 12 years ago by Aeron
Replying to rb07:
Another interesting detail: The tracker shows twice the number of real torrents after restarting an unresponsive daemon.
It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.
comment:18 in reply to: ↑ 17 ; follow-up: ↓ 19 Changed 12 years ago by rb07
Replying to Aeron:
It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.
No, there is only one daemon running.
In fact I don't see how you could run more than one daemon since it binds to specific ports (yes, I know how to start more than one daemon with different configurations).
Your comment makes no sense, if it "wasn't restarted" then how could you assume "the daemon is running twice"? and yes, it was restarted.
comment:19 in reply to: ↑ 18 Changed 12 years ago by Aeron
Replying to rb07:
Replying to Aeron:
It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.
No, there is only one daemon running.
In fact I don't see how you could run more than one daemon since it binds to specific ports.
It is strange, but it’s true. My NAS shows me the daemon as a four separate processes (may be 5, then one of the torrents verifying his data for example)(every thread of the daemon it’s a process in my case (haven’t libthread_db or not support?)), if I run him again, it’s duplicate daemon processes (up to 8) and trackers shows me double torrents count.
comment:20 Changed 12 years ago by rb07
Threads are not processes, and you can run several daemons because you don't bind to a specific port (except probably the 9091 which means the 2nd daemon can't be reached through RPC).
The point here is I know what I'm doing, you're only adding noise. This comment is noise, doesn't add anything useful, I'm stopping the noise on my end.
comment:21 Changed 12 years ago by rb07
I have been using version 2.00b1 for the last 5 days, no problem until today at around 6:00 AM the daemon became unresponsive... I don't know exactly what it was doing but it wasn't much: 2 active torrents, one downloading slowly, one seeding; 7 torrents total, 2 of them paused.
About 9:00 AM I first tried to contact the daemon on the NAS from my PC using QTransmission (also version 2.00b1) with no response. Then I tried to get some info with netstat and lsof, some of the results might be interesting:
$ netstat -np Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 192.168.10.8:2310 87.217.160.43:53192 SYN_RECV - tcp 0 0 192.168.10.8:2310 114.253.4.110:64400 SYN_RECV - tcp 0 0 192.168.10.8:2310 87.210.124.204:2725 SYN_RECV - tcp 0 0 192.168.10.8:2310 69.142.163.198:4347 SYN_RECV - tcp 0 0 192.168.10.8:2310 92.0.237.99:2007 SYN_RECV - tcp 0 0 192.168.10.8:2310 76.125.5.2:1125 SYN_RECV - tcp 0 0 192.168.10.8:2310 82.236.234.66:62449 SYN_RECV - tcp 0 0 192.168.10.8:2310 71.15.83.169:55886 SYN_RECV - tcp 0 0 192.168.10.8:2310 78.69.176.59:63375 SYN_RECV - tcp 0 0 192.168.10.8:2310 82.39.168.84:1439 SYN_RECV - tcp 0 0 192.168.10.8:2310 71.108.106.125:50178 SYN_RECV - tcp 253 0 127.0.0.1:9091 127.0.0.1:3058 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:1858 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:1200 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:4896 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:4192 CLOSE_WAIT - tcp 80272 0 192.168.10.8:2310 88.149.180.60:58218 ESTABLISHED 12825/transmission- tcp 176 0 192.168.10.8:9091 192.168.10.2:1053 ESTABLISHED - tcp 253 0 127.0.0.1:9091 127.0.0.1:3717 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:3003 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:4715 CLOSE_WAIT - tcp 58387 0 192.168.10.8:2310 70.185.250.55:3934 ESTABLISHED 12825/transmission- tcp 253 0 127.0.0.1:9091 127.0.0.1:1561 CLOSE_WAIT - tcp 253 0 127.0.0.1:9091 127.0.0.1:3726 CLOSE_WAIT - tcp 0 0 127.0.0.1:3726 127.0.0.1:9091 FIN_WAIT2 - tcp 103935 0 192.168.10.8:2310 58.111.249.179:61892 ESTABLISHED 12825/transmission- tcp 253 0 127.0.0.1:9091 127.0.0.1:1406 CLOSE_WAIT - tcp 328 0 192.168.10.8:2310 64.180.200.150:60712 ESTABLISHED 12825/transmission- Active UNIX domain sockets (w/o servers) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] DGRAM 5668678 12825/transmission- unix 3 [ ] STREAM CONNECTED 5668489 12825/transmission- unix 3 [ ] STREAM CONNECTED 5668488 12825/transmission-
The size of the receive queues clearly show there is a network problem. This supports my theory about the daemon being stuck waiting somewhere. Strange thing is that it opens new connections, just doesn't seem to do anything with them.
With lsof there are only a few strange lines, I'm not sure if they are important but they don't show with a daemon working normally (I also show a couple of normal lines, the elided lines show the usual files and network connections):
$ lsof -nP -c transmission-da COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME ... transmiss 12825 nobody 41u inet 7144456 TCP 192.168.10.8:2310->58.111.249.179:61892 (ESTABLISHED) transmiss 12825 nobody 43u sock 0,4 7199566 can't identify protocol ... transmiss 12825 nobody 54u sock 0,4 7196008 can't identify protocol transmiss 12825 nobody 74u sock 0,4 7166237 can't identify protocol transmiss 12825 nobody 100u sock 0,4 7060040 can't identify protocol transmiss 12825 nobody 116u inet 7007388 TCP 192.168.10.8:2310->70.185.250.55:3934 (ESTABLISHED) ...
comment:22 Changed 12 years ago by rb07
Another small update:
If I -HUP the unresponsive daeamon it becomes responsive (and of course reads settings.json using the last parameters stored).
Just to be clear, I used
killall -HUP transmission-daemon
comment:23 Changed 12 years ago by rb07
Just an update, no real testing, I'm just using it:
- Version 2.00b2 --- very unstable, became unresponsive 7 times within a 2 hr span with one torrent.
- Version 2.00 --- stable, just once unresponsive.
- Version 2.01 --- unstable again, unresponsive when only seeding (which is new).
No idea what is changing other than different torrents and trackers, but the context is that I'm not really testing, so it could be coincidence. The fact stands: version 1.91 or below == stable, 1.93 and above != stable.
comment:24 Changed 12 years ago by Aeron
Version 2.00 is unstable too, can hangs out when seeding. When Transmission downloads something it becomes more stable and can work fine all downloading time.
UPD: 2.00 is unstable entirely, but relatively more stable than 1.93 and now hangs every 6–8 hours. Now it works with libevent 1.4.14b and curl 7.21.0, whereas 1.93 worked with libevent 1.4.13 and curl 7.20.1, may be this is irrelevant, but…
UPD2: 2.01 more unstable than 2.00, it can hang every 5 min during download and works less than hour when seeding only.
comment:25 follow-up: ↓ 26 Changed 12 years ago by charles
There were absolutely no changes between 2.00 and 2.01 that could cause the behavior being described here. I think were are chasing phantom bugs caused by build issues or something else not code-related
comment:26 in reply to: ↑ 25 Changed 12 years ago by rb07
Replying to charles:
There were absolutely no changes between 2.00 and 2.01 that could cause the behavior being described here. I think were are chasing phantom bugs caused by build issues or something else not code-related
Sorry, I don't agree. But I know it is difficult to find the problem or even understand it.
First point: There is no real testing being done... so differences between 2.00 and 2.01 are subjective. In my case 2.00 was used for a relatively short time.
Second point: There are no "build issues", the exact same build environment produces a 1.91 that is 100% stable, and all newer versions that are not.
I do agree that there is something in the general environment, an ARM based Linux. The same versions I also build under x86_64 Linux and those have never failed. But that's the reason I put in the title its an ARM problem, I've also seen the (at least 2) other bug reports, and NAS related forum reports which are from different build environments (I don't even know how those versions where made).
Of course I don't know the exact problem, could be a bug in libc (which is pretty bad but I wasn't able to change it when I tried), could be a bug on the threads library, something else. The point is: it could be a problem outside Transmission, its just was brought up by something that changed in Transmission btw. 1.91 and 1.93 (yes, I've also read a couple of times the diff btw. those versions, don't see anything obvious).
I'll keep trying to find the problem, currently its only an annoyance since I'm using a watchdog to kick the daemon when it hangs.
comment:27 Changed 12 years ago by rb07
I might have found the problem...
The context: I was running the daemon inside gdb, I also had a watchdog in cron checking if the daemon was unresponsive, which it did then sent a HUP (which usually wakes up the daemon) caught by gdb and the back trace is:
Program received signal SIGHUP, Hangup. 0x40443984 in nanosleep () from /lib/libpthread.so.0 (gdb) bt #0 0x40443984 in nanosleep () from /lib/libpthread.so.0 #1 0x0002d9b0 in tr_wait_msec (msec=1000) at utils.c:822 #2 0x0000c914 in main (argc=28, argv=0xbed6fc74) at daemon.c:495 (gdb) thread apply all bt Thread 2 (LWP 23400): #0 0x40443984 in nanosleep () from /lib/libpthread.so.0 #1 0x0002d9b0 in tr_wait_msec (msec=1000) at utils.c:822 #2 0x0000c914 in main (argc=28, argv=0xbed6fc74) at daemon.c:495 Thread 1 (LWP 23401): #0 0x40442db0 in read () from /lib/libpthread.so.0 #1 0x0002ba98 in readFromPipe (fd=6, eventType=2, veh=0x8f5d0) at trevent.c:164 #2 0x4035e62c in event_base_loop (base=0x95008, flags=0) at event.c:395 #3 0x0002bd60 in libeventThreadFunc (veh=0x8f5d0) at trevent.c:229 #4 0x00015208 in ThreadFunc (_t=0x8f660) at platform.c:109 #5 0x4043db60 in ?? () from /lib/libpthread.so.0 Cannot access memory at address 0x7d0f00 #0 0x40443984 in nanosleep () from /lib/libpthread.so.0
If I'm not wrong thread 1 is the one blocked on a read(), inside trevent.c :
readFromPipe( int fd, short eventType, void * veh ) { char ch; int ret; tr_event_handle * eh = veh; dbgmsg( "readFromPipe: eventType is %hd", eventType ); /* read the command type */ ch = '\0'; do { ret = piperead( fd, &ch, 1 ); } while( !eh->die && ret < 0 && errno == EAGAIN );
piperead() per macro is just read().
I still do not understand either what the code is supposed to do, or what is the real problem.
- If libevent calls readFromPipe() when it knows there is something to read (I assume there is a select() somewhere), there shouldn't be any blocking... but what is the loop for?
- The loop implies that the pipe is in non-blocking mode, that's why it tests for EAGAIN, but per POSIX standard http://www.opengroup.org/onlinepubs/9699919799/functions/pipe.html that is not true, unless it was changed.
- All of this has nothing to do with changes to version 1.91; did some other (external) change break things?
Right now I'm just confused. I tested the daemon with the read fd on the pipe set to non-blocking: nothing happens, nothing works, the daemon doesn't connect to the trackers, only one tracker query is put on queue.
Still looking for the solution.
comment:28 Changed 12 years ago by rb07
Solved !
Workaround:The old trick for NASes to disable epoll for libevent.
Just run
EVENT_NOEPOLL=1; export EVENT_NOEPOLL
before starting the daemon.
comment:29 Changed 12 years ago by rb07
- Resolution set to fixed
- Status changed from new to closed
I forgot to add, when the daemon becomes unresponsive it also stops writing to the log.
Usually it writes every two minutes that it has saved the stats.json and every one of the resume files for active torrents, plus the DHT messages with a different periodicity.