Opened 11 years ago

Closed 10 years ago

#3196 closed Bug (fixed)

Daemon hangs on ARM based NAS

Reported by: rb07 Owned by:
Priority: Normal Milestone: None Set
Component: Daemon Version: 1.93
Severity: Normal Keywords: daemon ARM NAS
Cc: nhfnc@…

Description

Versions 1.92 and 1.93 of the daemon stop working (time varies from hours to days), become totally unresponsive: no upload/download, no RPC (tested with transmission-remote, Web client, Qtr).

Environment: Native build on an HP MediaVault, Linux HPMediaVault 2.6.12.6-arm1 #1 Fri Feb 8 15:45:19 MST 2008 armv5tejl GNU/Linux, using gcc 3.4.4, libcurl 7.19.7, libevent 1.4.13, openssl 0.9.8j .

The process and its threads are still 'active', just doing nothing useful; low CPU usage, a strace snippet:

stat64("/opt/var/transmission/torrentwatch/", {st_mode=S_IFDIR|S_ISGID|0755, st_size=17592186044416, ...}) = 0
open("/opt/var/transmission/torrentwatch/", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 252
fstat64(252, {st_mode=S_IFDIR|S_ISGID|0755, st_size=17592186044416, ...}) = 0
fcntl64(252, F_SETFD, FD_CLOEXEC)       = 0
getdents64(252, /* 2 entries */, 4096)  = 48
getdents64(252, /* 0 entries */, 4096)  = 0
close(252)                              = 0
time(NULL)                              = 1273248797
nanosleep({1, 0}, NULL)                 = 0
time(NULL)                              = 1273248798
nanosleep({1, 0}, NULL)                 = 0
t
... (repeats over and over)

Attaching GDB and looking at what its doing:

(gdb) info threads
  3 Thread 0x40d31520 (LWP 5748)  0x403f3e10 in read ()
   from /opt/lib/libpthread.so.0
  2 Thread 0x416fa520 (LWP 5786)  0x404b8adc in select () from /lib/libc.so.6
* 1 Thread 0x40530b00 (LWP 5747)  0x40490574 in nanosleep ()
   from /lib/libc.so.6

(gdb) bt
#0  0x40490574 in nanosleep () from /lib/libc.so.6
#1  0x404b91bc in usleep () from /lib/libc.so.6
#2  0x0000bd58 in main (argc=2492072, argv=0x23c778) at daemon.c:463

(gdb) thread 2
[Switching to thread 2 (Thread 0x416fa520 (LWP 5786))]#0  0x404b8adc in select
    () from /lib/libc.so.6
(gdb) bt
#0  0x404b8adc in select () from /lib/libc.so.6
#1  0x416f9e00 in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x40d31520 (LWP 5748))]#0  0x403f3e10 in read ()
   from /opt/lib/libpthread.so.0
(gdb) bt
#0  0x403f3e10 in read () from /opt/lib/libpthread.so.0
#1  0x00023b90 in readFromPipe (fd=3, eventType=2, veh=0x698c8)
    at trevent.c:164
#2  0x400254c8 in event_base_loop (base=0x6ab70, flags=0) at event.c:385
#3  0x00023dc8 in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230
#4  0x00014504 in ThreadFunc (_t=0x69958) at platform.c:109
#5  0x403ee0a8 in start_thread (arg=0xfffffe00) at pthread_create.c:263
#6  0x404bf7e0 in clone () from /lib/libc.so.6
Backtrace stopped: frame did not save the PC

I've read the other similar reports, the main difference is I have had no crash at all, don't cross-compile and don't use third-party packages except for the compiler.

The fact is up to version 1.91 everything works fine (I did rebuild 1.91 after 1.92 started doing this and changed with no problem until 1.93 came out and I changed again).

Any ideas about how to debug this are welcome (but a better place to just discuss ideas is the forum, this tracker should be used for hard data only).

Change History (29)

comment:1 Changed 11 years ago by rb07

I forgot to add, when the daemon becomes unresponsive it also stops writing to the log.

Usually it writes every two minutes that it has saved the stats.json and every one of the resume files for active torrents, plus the DHT messages with a different periodicity.

Last edited 11 years ago by rb07 (previous) (diff)

comment:2 Changed 11 years ago by rb07

Second freeze on version 1.93, about an hour ago after a little more than a day working normally, looks identical to the first:

(gdb) thread apply all bt

Thread 3 (Thread 0x40d31520 (LWP 7064)):
#0  0x403f3e10 in read () from /opt/lib/libpthread.so.0
#1  0x00023b90 in readFromPipe (fd=3, eventType=2, veh=0x698c8)
    at trevent.c:164
#2  0x400254c8 in event_base_loop (base=0x6ab70, flags=0) at event.c:385
#3  0x00023dc8 in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230
#4  0x00014504 in ThreadFunc (_t=0x69958) at platform.c:109
#5  0x403ee0a8 in start_thread (arg=0xfffffe00) at pthread_create.c:263
#6  0x404bf7e0 in clone () from /lib/libc.so.6
Backtrace stopped: frame did not save the PC

Thread 2 (Thread 0x416fa520 (LWP 7067)):
#0  0x404b8adc in select () from /lib/libc.so.6
#1  0x416f9e00 in ?? ()

Thread 1 (Thread 0x40530b00 (LWP 7063)):
#0  0x40490574 in nanosleep () from /lib/libc.so.6
#1  0x404b91bc in usleep () from /lib/libc.so.6
#2  0x0000bd58 in main (argc=0, argv=0xa6c98) at daemon.c:463
#0  0x40490574 in nanosleep () from /lib/libc.so.6

(gdb) info share
From        To          Syms Read   Shared Object Library
0x40023c70  0x40031d68  Yes         /opt/lib/libevent-1.4.so.2
0x40047bd8  0x40084898  Yes         /opt/lib/libcurl.so.4
0x40091b84  0x40099f08  Yes         /opt/lib/libidn.so.11
0x400cf46c  0x400f552c  Yes         /opt/lib/libssh2.so.1
0x4010a358  0x4012a984  Yes         /opt/lib/libldap-2.4.so.2
0x40138460  0x4013cde8  Yes         /opt/lib/liblber-2.4.so.2
0x40149244  0x40153328  Yes         /opt/lib/libresolv.so.2
0x4016316c  0x40172b9c  Yes         /opt/lib/libsasl2.so.2
0x4017edb0  0x40183ab0  Yes         /opt/lib/librt.so.1
0x401a3cf8  0x401cec04  Yes         /opt/lib/libssl.so.0.9.8
0x402146b8  0x402ccef0  Yes         /opt/lib/libcrypto.so.0.9.8
0x4030da44  0x4030e5dc  Yes         /opt/lib/libdl.so.2
0x4031a63c  0x40324eec  Yes         /opt/lib/libz.so.1
0x4032d200  0x4039911c  Yes         /opt/lib/libm.so.6
0x403d9f40  0x403dfe38  Yes         /opt/lib/libgcc_s.so.1
0x403ecd30  0x403f60a4  Yes         /opt/lib/libpthread.so.0
0x4041b5d0  0x404f5928  Yes         /lib/libc.so.6
0x4051f278  0x405264b0  Yes         /opt/lib/liblber-2.3.so.0
0x400006e0  0x40013160  Yes         /lib/ld-linux.so.3

Which one looks more suspicious, the select() or the read() ? It must be blocking indefinitely on one or both of them.

comment:3 Changed 11 years ago by charles

Neither of these look suspicious to me.

comment:4 Changed 11 years ago by Aeron

+1, confirm this issue. I have WD World Edition 1TB (WhiteLight?) NAS with installed thru IPKG Transmission 1.93 with libcurl 7.20.1 and openssl 1.0.0 (which was cross-compiled and packed in proper ipk’s by my own). WD NAS environment is Linux 2.6.24.4 armv5tejl. The problem is the same. After starting transmission the «top» shows me 1 active (running) process of transmission-daemon and 3 inactive (sleeping) processes. Further it can hang in any moment… in a couple of minutes or in a couple of hours, randomly. Logs just can’t show anything useful except usual transmission routine. After it hang the «top» shows 4 inactive daemon processes.

The 1.92 has the same issue, but it more stable in work, it can works without hangs for hours and, sometimes, days.

Transmission and both libs was cross-compiled and packed on CentOS (Linux 2.6.18-128.el5 x86_64 GNU/Linux). I can give links to ipk’s and cross-compiles if it can help.

comment:5 follow-up: Changed 11 years ago by charles

What happens to the CPU when transmission freezes?

Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?

comment:6 in reply to: ↑ 5 ; follow-ups: Changed 11 years ago by rb07

Replying to charles:

What happens to the CPU when transmission freezes?

Nothing unusual, the daemon uses very low CPU, all other processes (I have a mail server, ftp server, etc.) continue working normally.

Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?

The gdb output above is all you get, I've done continue, let it alone for a few minutes, Ctrl-C and the threads are in the exact same state. That's why I wrote that it must be blocking on select() and read(); the select is probably from libevent.

If you meant you want to interact with gdb, no problem, let me know what you need.

comment:7 in reply to: ↑ 6 Changed 11 years ago by Aeron

Replying to rb07:

Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?

I think this part of message is for me.

Replying to charles:

Any chance of you running Transmission inside of gdb so that we can find out what it's doing when it stops responding?

Yep, i’m working on it. Just waiting for daemon hangs.

Replying to charles:

What happens to the CPU when transmission freezes?

CPU is just fine, no overloads or something like that. If daemon hangs, it’s processes CPU usage is 0%.

comment:8 Changed 11 years ago by Aeron

Ok, done:

~ # /opt/bin/gdb


dlopen failed on 'libthread_db.so.1' - libthread_db.so.1: cannot open shared object file: No such file or directory
GDB will not be able to debug pthreads.

GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-none-linux-gnueabi".
(gdb) file /usr/local/bin/transmission-daemon 
Reading symbols from /usr/local/bin/transmission-daemon...done.
(gdb) run -g /root/.config/transmission-daemon/
Starting program: /usr/local/bin/transmission-daemon -g /root/.config/transmission-daemon/
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx'

Program exited normally.
(gdb) attach 6332
Attaching to program: /usr/local/bin/transmission-daemon, process 6332
warning: process 6332 is a cloned process
BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
Loaded symbols for /lib/libnsl.so.1
Loaded symbols for /lib/libresolv.so.2
Loaded symbols for /lib/libcurl.so.4
Loaded symbols for /lib/librt.so.1
Loaded symbols for /lib/libm.so.6
Loaded symbols for /lib/libgcc_s.so.1
Loaded symbols for /lib/libpthread.so.0
Loaded symbols for /lib/libc.so.6
Loaded symbols for /lib/ld-linux.so.3
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x400ffb9c in bn_mul_add_words () from /lib/libcurl.so.4
(gdb) bt
#0  0x400ffb9c in bn_mul_add_words () from /lib/libcurl.so.4
#1  0x40165684 in bn_mul_normal () from /lib/libcurl.so.4
#2  0x40165574 in BN_mul () from /lib/libcurl.so.4
#3  0x40168938 in BN_mod_mul_montgomery () from /lib/libcurl.so.4
#4  0x401635ac in BN_mod_exp_mont_consttime () from /lib/libcurl.so.4
#5  0x40162c0c in BN_mod_exp_mont () from /lib/libcurl.so.4
#6  0x40108674 in dh_bn_mod_exp () from /lib/libcurl.so.4
#7  0x401083d8 in generate_key () from /lib/libcurl.so.4
#8  0x40108264 in DH_generate_key () from /lib/libcurl.so.4
#9  0x0002dd4c in ensureKeyExists (crypto=0x484020) at crypto.c:139
#10 0x0002e0a8 in tr_cryptoGetMyPublicKey (crypto=0x484020, 
    setme_len=0xbe7ff774) at crypto.c:215
#11 0x0004a200 in tr_handshakeNew (io=0x347ab8, 
    encryptionMode=TR_ENCRYPTION_PREFERRED, 
    doneCB=0x369bc <myHandshakeDoneCB>, doneUserData=0x79ac8)
    at handshake.c:335
#12 0x00039fd8 in reconnectPulse (foo=388177907, bar=-26536, vmgr=0x6db392eb)
    at peer-mgr.c:2787
#13 0x0005bb50 in event_base_loop (base=0x78618, flags=0) at event.c:385
#14 0x0002364c in libeventThreadFunc (veh=0x17231ff3) at trevent.c:230
#15 0x000140d0 in ThreadFunc (_t=0x7a890) at platform.c:109
#16 0x402dce84 in ?? () from /lib/libpthread.so.0
(gdb) info share
From        To          Syms Read   Shared Object Library
0x40021304  0x4002c7d8  Yes         /lib/libnsl.so.1
0x4003b2b0  0x40046204  Yes         /lib/libresolv.so.2
0x400a95c8  0x401ae308  Yes         /lib/libcurl.so.4
0x4020aab0  0x4020fed0  Yes         /lib/librt.so.1
0x402271e8  0x40286f0c  Yes         /lib/libm.so.6
0x402c8df8  0x402cea10  Yes         /lib/libgcc_s.so.1
0x402dab20  0x402e481c  Yes         /lib/libpthread.so.0
0x40343750  0x404171fc  Yes         /lib/libc.so.6
0x400005f0  0x40011ab8  Yes         /lib/ld-linux.so.3
0x40437b14  0x4043e46c  Yes         /lib/libnss_files.so.2
0x40448d1c  0x4044ada0  Yes         /lib/libnss_dns.so.2

Sorry, but I never used GDB before. May be I doing it wrong.

P.S.: 6332 — it’s an active daemon process PID

comment:9 Changed 11 years ago by nhfnc

  • Cc nhfnc@… added

This issue happens to me as well. As Aeron I am running this on a WD World Edition White light box. Compiled it using WD tool chain (followed instructions on this page: http://bengreen.eu/pmwiki/pmwiki.php?n=Linux.CrossCompileMyBookWorldWhiteLight) on a Ubuntu 9.04 box and just copied the executables over to the WD box. By the way, the 1.93 seems to have fixed a previous issue I had http://trac.transmissionbt.com/ticket/2842, but now I have this new issue instead... Unsure if I can get any gdb traces from my setup, but i am happy to try anything you guys may want me to do.

comment:10 follow-up: Changed 11 years ago by charles

Aeron: what happens if you force encryption to be turned off via settings.json?

comment:11 in reply to: ↑ 10 Changed 11 years ago by Aeron

Replying to charles:

Aeron: what happens if you force encryption to be turned off via settings.json?

All the same. Daemon just hangs as usually.

GDB output of hanged active daemon process not seems to be very useful:

(gdb) attach 8298
Attaching to program: /usr/local/bin/transmission-daemon, process 8298
warning: process 8298 is a cloned process
BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx'
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libcurl.so.4...done.
Loaded symbols for /lib/libcurl.so.4
BFD: /lib/librt.so.1: warning: sh_link not set for section `.ARM.exidx'
Reading symbols from /lib/librt.so.1...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
BFD: /lib/libgcc_s.so.1: warning: sh_link not set for section `.ARM.exidx'
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
BFD: /lib/libc.so.6: warning: sh_link not set for section `.ARM.exidx'
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
BFD: /lib/ld-linux.so.3: warning: sh_link not set for section `.ARM.exidx'
Reading symbols from /lib/ld-linux.so.3...done.
Loaded symbols for /lib/ld-linux.so.3
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x402e33b4 in read () from /lib/libpthread.so.0
(gdb) bt
#0  0x402e33b4 in read () from /lib/libpthread.so.0
#1  0x00023410 in readFromPipe (fd=4, eventType=-1345, veh=0x78050)
    at trevent.c:164
#2  0x0005bb50 in event_base_loop (base=0x78618, flags=0) at event.c:385
#3  0x0002364c in libeventThreadFunc (veh=0xfffffe00) at trevent.c:230
#4  0x000140d0 in ThreadFunc (_t=0x7a890) at platform.c:109
#5  0x402dce84 in ?? () from /lib/libpthread.so.0

comment:12 Changed 11 years ago by rb07

I noticed that the problem is repeatable, just like some people have been saying on the forum (and NAS related forums) but in many cases they reported crashes not unresponsiveness: With one or more torrents downloading at 'high' speed the daemon becomes unresponsive whiting an hour.

Of course 'high' is relative, in my case around 200 KB/s (that's kilo bytes).

On another front, bad news, Valgrind doesn't work on this NAS, it compiles (only the svn version has ARM support) but it crashes on any real work.

I made a 'diff' between the version that works, 1.91, and 1.93, and the changes on the deamon are few but I didn't notice anything that could be the cause on a quick review.

comment:13 Changed 11 years ago by charles

I wonder if maybe the libevent web server socket is closing somehow.

comment:14 Changed 11 years ago by nhfnc

In my case I am compiling using the built-in libevent. I will try to compile again with a "normal" libevent.

comment:15 in reply to: ↑ 6 Changed 11 years ago by rb07

Replying to rb07:

Replying to charles:

What happens to the CPU when transmission freezes?

Nothing unusual, the daemon uses very low CPU, all other processes (I have a mail server, ftp server, etc.) continue working normally.

Correction: It actually is using no CPU, stays at 0%.

Another correction: It does write one line to the log when I stop the daemon, the one that says "Closing session".

I tried testing if a socket closes unexpectedly by enabling SIGPIPE handling, but the daemon had problems starting up.

comment:16 follow-up: Changed 11 years ago by rb07

Another interesting detail: The tracker shows twice the number of real torrents after restarting an unresponsive daemon.

As I said above, the unresponsive daemon writes "Closing session" but according to the tracker it never did w.r.t. that one tracker. --- False alarm: Its a problem in the tracker, not transmission. I just tested a normal daemon shutdown and the tracker doesn't update its info, after start the tracker shows twice the torrents.

Last edited 11 years ago by rb07 (previous) (diff)

comment:17 in reply to: ↑ 16 ; follow-up: Changed 11 years ago by Aeron

Replying to rb07:

Another interesting detail: The tracker shows twice the number of real torrents after restarting an unresponsive daemon.

It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 11 years ago by rb07

Replying to Aeron:

It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.

No, there is only one daemon running.

In fact I don't see how you could run more than one daemon since it binds to specific ports (yes, I know how to start more than one daemon with different configurations).

Your comment makes no sense, if it "wasn't restarted" then how could you assume "the daemon is running twice"? and yes, it was restarted.

Last edited 11 years ago by rb07 (previous) (diff)

comment:19 in reply to: ↑ 18 Changed 11 years ago by Aeron

Replying to rb07:

Replying to Aeron:

It happens when the daemon is running twice. And it means that the daemon wasn’t restarted.

No, there is only one daemon running.

In fact I don't see how you could run more than one daemon since it binds to specific ports.

It is strange, but it’s true. My NAS shows me the daemon as a four separate processes (may be 5, then one of the torrents verifying his data for example)(every thread of the daemon it’s a process in my case (haven’t libthread_db or not support?)), if I run him again, it’s duplicate daemon processes (up to 8) and trackers shows me double torrents count.

comment:20 Changed 11 years ago by rb07

Threads are not processes, and you can run several daemons because you don't bind to a specific port (except probably the 9091 which means the 2nd daemon can't be reached through RPC).

The point here is I know what I'm doing, you're only adding noise. This comment is noise, doesn't add anything useful, I'm stopping the noise on my end.

comment:21 Changed 11 years ago by rb07

I have been using version 2.00b1 for the last 5 days, no problem until today at around 6:00 AM the daemon became unresponsive... I don't know exactly what it was doing but it wasn't much: 2 active torrents, one downloading slowly, one seeding; 7 torrents total, 2 of them paused.

About 9:00 AM I first tried to contact the daemon on the NAS from my PC using QTransmission (also version 2.00b1) with no response. Then I tried to get some info with netstat and lsof, some of the results might be interesting:

$ netstat -np
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   
tcp        0      0 192.168.10.8:2310       87.217.160.43:53192     SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       114.253.4.110:64400     SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       87.210.124.204:2725     SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       69.142.163.198:4347     SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       92.0.237.99:2007        SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       76.125.5.2:1125         SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       82.236.234.66:62449     SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       71.15.83.169:55886      SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       78.69.176.59:63375      SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       82.39.168.84:1439       SYN_RECV    -                   
tcp        0      0 192.168.10.8:2310       71.108.106.125:50178    SYN_RECV    -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:3058          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:1858          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:1200          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:4896          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:4192          CLOSE_WAIT  -                   
tcp    80272      0 192.168.10.8:2310       88.149.180.60:58218     ESTABLISHED 12825/transmission- 
tcp      176      0 192.168.10.8:9091       192.168.10.2:1053       ESTABLISHED -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:3717          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:3003          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:4715          CLOSE_WAIT  -                   
tcp    58387      0 192.168.10.8:2310       70.185.250.55:3934      ESTABLISHED 12825/transmission- 
tcp      253      0 127.0.0.1:9091          127.0.0.1:1561          CLOSE_WAIT  -                   
tcp      253      0 127.0.0.1:9091          127.0.0.1:3726          CLOSE_WAIT  -                   
tcp        0      0 127.0.0.1:3726          127.0.0.1:9091          FIN_WAIT2   -                   
tcp   103935      0 192.168.10.8:2310       58.111.249.179:61892    ESTABLISHED 12825/transmission- 
tcp      253      0 127.0.0.1:9091          127.0.0.1:1406          CLOSE_WAIT  -                   
tcp      328      0 192.168.10.8:2310       64.180.200.150:60712    ESTABLISHED 12825/transmission- 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node PID/Program name    Path
unix  2      [ ]         DGRAM                    5668678 12825/transmission- 
unix  3      [ ]         STREAM     CONNECTED     5668489 12825/transmission- 
unix  3      [ ]         STREAM     CONNECTED     5668488 12825/transmission- 

The size of the receive queues clearly show there is a network problem. This supports my theory about the daemon being stuck waiting somewhere. Strange thing is that it opens new connections, just doesn't seem to do anything with them.

With lsof there are only a few strange lines, I'm not sure if they are important but they don't show with a daemon working normally (I also show a couple of normal lines, the elided lines show the usual files and network connections):

$ lsof -nP -c transmission-da
COMMAND     PID   USER   FD   TYPE     DEVICE      SIZE    NODE NAME
...
transmiss 12825 nobody   41u  inet    7144456               TCP 192.168.10.8:2310->58.111.249.179:61892 (ESTABLISHED)
transmiss 12825 nobody   43u  sock        0,4           7199566 can't identify protocol
...
transmiss 12825 nobody   54u  sock        0,4           7196008 can't identify protocol
transmiss 12825 nobody   74u  sock        0,4           7166237 can't identify protocol
transmiss 12825 nobody  100u  sock        0,4           7060040 can't identify protocol
transmiss 12825 nobody  116u  inet    7007388               TCP 192.168.10.8:2310->70.185.250.55:3934 (ESTABLISHED)
...

comment:22 Changed 10 years ago by rb07

Another small update:

If I -HUP the unresponsive daeamon it becomes responsive (and of course reads settings.json using the last parameters stored).

Just to be clear, I used

killall -HUP transmission-daemon

comment:23 Changed 10 years ago by rb07

Just an update, no real testing, I'm just using it:

  • Version 2.00b2 --- very unstable, became unresponsive 7 times within a 2 hr span with one torrent.
  • Version 2.00 --- stable, just once unresponsive.
  • Version 2.01 --- unstable again, unresponsive when only seeding (which is new).

No idea what is changing other than different torrents and trackers, but the context is that I'm not really testing, so it could be coincidence. The fact stands: version 1.91 or below == stable, 1.93 and above != stable.

comment:24 Changed 10 years ago by Aeron

Version 2.00 is unstable too, can hangs out when seeding. When Transmission downloads something it becomes more stable and can work fine all downloading time.

UPD: 2.00 is unstable entirely, but relatively more stable than 1.93 and now hangs every 6–8 hours. Now it works with libevent 1.4.14b and curl 7.21.0, whereas 1.93 worked with libevent 1.4.13 and curl 7.20.1, may be this is irrelevant, but…

UPD2: 2.01 more unstable than 2.00, it can hang every 5 min during download and works less than hour when seeding only.

Last edited 10 years ago by Aeron (previous) (diff)

comment:25 follow-up: Changed 10 years ago by charles

There were absolutely no changes between 2.00 and 2.01 that could cause the behavior being described here. I think were are chasing phantom bugs caused by build issues or something else not code-related

comment:26 in reply to: ↑ 25 Changed 10 years ago by rb07

Replying to charles:

There were absolutely no changes between 2.00 and 2.01 that could cause the behavior being described here. I think were are chasing phantom bugs caused by build issues or something else not code-related

Sorry, I don't agree. But I know it is difficult to find the problem or even understand it.

First point: There is no real testing being done... so differences between 2.00 and 2.01 are subjective. In my case 2.00 was used for a relatively short time.

Second point: There are no "build issues", the exact same build environment produces a 1.91 that is 100% stable, and all newer versions that are not.

I do agree that there is something in the general environment, an ARM based Linux. The same versions I also build under x86_64 Linux and those have never failed. But that's the reason I put in the title its an ARM problem, I've also seen the (at least 2) other bug reports, and NAS related forum reports which are from different build environments (I don't even know how those versions where made).

Of course I don't know the exact problem, could be a bug in libc (which is pretty bad but I wasn't able to change it when I tried), could be a bug on the threads library, something else. The point is: it could be a problem outside Transmission, its just was brought up by something that changed in Transmission btw. 1.91 and 1.93 (yes, I've also read a couple of times the diff btw. those versions, don't see anything obvious).

I'll keep trying to find the problem, currently its only an annoyance since I'm using a watchdog to kick the daemon when it hangs.

comment:27 Changed 10 years ago by rb07

I might have found the problem...

The context: I was running the daemon inside gdb, I also had a watchdog in cron checking if the daemon was unresponsive, which it did then sent a HUP (which usually wakes up the daemon) caught by gdb and the back trace is:

Program received signal SIGHUP, Hangup.
0x40443984 in nanosleep () from /lib/libpthread.so.0    
(gdb) bt    
#0  0x40443984 in nanosleep () from /lib/libpthread.so.0    
#1  0x0002d9b0 in tr_wait_msec (msec=1000) at utils.c:822    
#2  0x0000c914 in main (argc=28, argv=0xbed6fc74) at daemon.c:495    
(gdb) thread apply all bt
    
Thread 2 (LWP 23400):    
#0  0x40443984 in nanosleep () from /lib/libpthread.so.0              
#1  0x0002d9b0 in tr_wait_msec (msec=1000) at utils.c:822    
#2  0x0000c914 in main (argc=28, argv=0xbed6fc74) at daemon.c:495    
    
Thread 1 (LWP 23401):    
#0  0x40442db0 in read () from /lib/libpthread.so.0    
#1  0x0002ba98 in readFromPipe (fd=6, eventType=2, veh=0x8f5d0) at trevent.c:164    
#2  0x4035e62c in event_base_loop (base=0x95008, flags=0) at event.c:395    
#3  0x0002bd60 in libeventThreadFunc (veh=0x8f5d0) at trevent.c:229    
#4  0x00015208 in ThreadFunc (_t=0x8f660) at platform.c:109    
#5  0x4043db60 in ?? () from /lib/libpthread.so.0    
Cannot access memory at address 0x7d0f00    
#0  0x40443984 in nanosleep () from /lib/libpthread.so.0

If I'm not wrong thread 1 is the one blocked on a read(), inside trevent.c :

readFromPipe( int    fd,
              short  eventType,
              void * veh )
{
    char              ch;
    int               ret;
    tr_event_handle * eh = veh;

    dbgmsg( "readFromPipe: eventType is %hd", eventType );

    /* read the command type */
    ch = '\0';
    do
    {
        ret = piperead( fd, &ch, 1 );
    }
    while( !eh->die && ret < 0 && errno == EAGAIN );

piperead() per macro is just read().

I still do not understand either what the code is supposed to do, or what is the real problem.

  • If libevent calls readFromPipe() when it knows there is something to read (I assume there is a select() somewhere), there shouldn't be any blocking... but what is the loop for?
  • All of this has nothing to do with changes to version 1.91; did some other (external) change break things?

Right now I'm just confused. I tested the daemon with the read fd on the pipe set to non-blocking: nothing happens, nothing works, the daemon doesn't connect to the trackers, only one tracker query is put on queue.

Still looking for the solution.

comment:28 Changed 10 years ago by rb07

Solved !

Workaround:The old trick for NASes to disable epoll for libevent.

Just run

EVENT_NOEPOLL=1; export EVENT_NOEPOLL

before starting the daemon.

comment:29 Changed 10 years ago by rb07

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.