#2994 closed Bug (invalid)
Regular crash in transmission-daemon near tr_bandwidthClamp()
Reported by: | TheBear | Owned by: | |
---|---|---|---|
Priority: | Normal | Milestone: | None Set |
Component: | Daemon | Version: | 1.91 |
Severity: | Major | Keywords: | |
Cc: |
Description
I've had regular crashes with transmission-daemon (vsns 1.83, 1.90 and 1.91).
System: Slackware 11.0, Headless P4 server, using WWW interface to control.
Detailed Environment: Kernel 2.6.21.1-smp, gcc 3.4.6, (g)libc 2.3.6
Crash occurs regularly, but seems to be related to download bandwidth use (i.e. it's stayed up for a week when working on very slow (<32K/sec) torrents, but crashes within minutes with popular (100K+) torrents.
With debugger, it reports either Segmentation Fault or
"* glibc detected * double free or corruption (!prev): 0x083ec3e8 *"
Run in foreground, under gdb, the report/backtrace is: <attached>
In line 1 of the stackframe, the "howmuch" to tr_evbuffer_write() looks a lot like a negative number, and nothing like the 68 passed to tr_peerIoTryWrite, looking at the code, the only thing in between is tr_bandwidthClamp(), so that's my best guess, and there my understanding of the code stops...
Happy to co-operate further if you have any ideas, it's a shame, as, when it works, it makes a really good headless BT client!
Attachments (4)
Change History (15)
Changed 13 years ago by TheBear
comment:1 Changed 13 years ago by TheBear
Changed 13 years ago by TheBear
Changed 13 years ago by TheBear
comment:2 Changed 13 years ago by charles
FWIW, the 'SIGPIPE' issue in backtrace01 is a red herring. Transmission handles SIGPIPE correctly, but for some reason gdb unhandles it. To avoid gdb stopping in SIGPIPE, use the folloing command in gdb:
handle SIGPIPE nostop noprint pass
comment:3 follow-ups: ↓ 4 ↓ 5 Changed 13 years ago by charles
TheBear?: the second backtrace indicates that some kind of memory corruption is happening, but it doesn't really pin down where. Is there any chance you could run Transmission in valgrind on your system to try to pin it down? A log from something like this:
valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=48 --log-file=x-valgrind --show-reachable=yes ./transmission-daemon -f
would be extremely useful...
comment:4 in reply to: ↑ 3 Changed 13 years ago by TheBear
Replying to charles:
TheBear?: the second backtrace indicates that some kind of memory corruption is happening, but it doesn't really pin down where. Is there any chance you could run Transmission in valgrind on your system to try to pin it down? A log from something like this:
valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=48 --log-file=x-valgrind --show-reachable=yes ./transmission-daemon -f
would be extremely useful...
Not got valgrind compiled up, but I'll go see what I can do.
Meanwhile, I did some further investigation, and am reasonably sure you are right and it is stack corruption, as the value for the howmuch parameter to tr_evbuffer_write() matches the pointer (tr_bandwidth *) io->bandwidth.parent, and that's too much of a coincidence for my liking!
Since we're into the libevent stuff, I've tried the suggestions from the other bug related to embedded devices (setting env vars to control which event mechanism it uses) but none appear to help. I've tried forcing my local libevent, but it's not up to the job (compile errors), and so I'm currently trying the regression test suite for the bundled libevent, which is, erm, still running since yesterday....
I have a core dump from GDB, but I'd rather not attach it if it can be e-mailed instead, and anyway I'm not sure how much more help it will be, and I'm still trying to catch the heap corruption (double free error at runtime) under GDB so I can stack trace that.
Hope that helps.
comment:5 in reply to: ↑ 3 Changed 13 years ago by TheBear
Replying to charles:
TheBear?: the second backtrace indicates that some kind of memory corruption is happening, but it doesn't really pin down where. Is there any chance you could run Transmission in valgrind on your system to try to pin it down? A log from something like this:
valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=48 --log-file=x-valgrind --show-reachable=yes ./transmission-daemon -f
would be extremely useful...
OK, that was less painful that I thought!
valgrind log attached (bz2-ed)
Notes:
I've already rebuilt both Transmission and libevent with -O0 just in case this was a problem with the version of gcc slackware ships with (no change, still crashes)
I had to build a suppression file for valgrind, as it was spamming the log with errors about the (entirely intentional) use of uninitialised data in SHA1_Update, and this was causing the log to stop before it got interesting.
It ran for about 5-10 mins before it crapped out, i.e. not long after it had finished verifying the half done torrents.
It ran much faster under valgrind than I expected ;-) !
Hope this helps, I don't know enough about the code/data structures to do much more debugging, I'm afraid.
Cheers.
Changed 13 years ago by TheBear
comment:6 Changed 13 years ago by charles
==28724== Invalid read of size 1 ==28724== at 0x80AA1C1: select_add (select.c:299) ==28724== by 0x809E95D: event_add (event.c:729) ==28724== by 0x8075545: event_enable (peer-io.c:458) ==28724== by 0x8076C4E: tr_peerIoReconnect (peer-io.c:645) ==28724== by 0x808D0CA: gotError (handshake.c:1153) ==28724== by 0x80776F0: tr_peerIoTryRead (peer-io.c:921) ==28724== by 0x8077A88: tr_peerIoFlush (peer-io.c:965) ==28724== by 0x806DC77: phaseOne (bandwidth.c:221) ==28724== by 0x806DEB5: tr_bandwidthAllocate (bandwidth.c:278) ==28724== by 0x807CF88: bandwidthPulse (peer-mgr.c:3109) ==28724== by 0x809E18A: event_process_active (event.c:385) ==28724== by 0x809E432: event_base_loop (event.c:525) ==28724== by 0x809E2C6: event_loop (event.c:461) ==28724== by 0x809E1B7: event_dispatch (event.c:399) ==28724== by 0x8064034: libeventThreadFunc (trevent.c:230) ==28724== by 0x805417F: ThreadFunc (platform.c:109) ==28724== by 0x42F420D: start_thread (in /lib/tls/libpthread-2.3.6.so) ==28724== by 0x43CA0DD: clone (in /lib/tls/libc-2.3.6.so) ==28724== Address 0x256e4557 is not stack'd, malloc'd or (recently) free'd ==28724== ==28724== ==28724== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==28724== Access not within mapped region at address 0x256E4557 ==28724== at 0x80AA1C1: select_add (select.c:299) ==28724== by 0x809E95D: event_add (event.c:729) ==28724== by 0x8075545: event_enable (peer-io.c:458) ==28724== by 0x8076C4E: tr_peerIoReconnect (peer-io.c:645) ==28724== by 0x808D0CA: gotError (handshake.c:1153) ==28724== by 0x80776F0: tr_peerIoTryRead (peer-io.c:921) ==28724== by 0x8077A88: tr_peerIoFlush (peer-io.c:965) ==28724== by 0x806DC77: phaseOne (bandwidth.c:221) ==28724== by 0x806DEB5: tr_bandwidthAllocate (bandwidth.c:278) ==28724== by 0x807CF88: bandwidthPulse (peer-mgr.c:3109) ==28724== by 0x809E18A: event_process_active (event.c:385) ==28724== by 0x809E432: event_base_loop (event.c:525) ==28724== by 0x809E2C6: event_loop (event.c:461) ==28724== by 0x809E1B7: event_dispatch (event.c:399) ==28724== by 0x8064034: libeventThreadFunc (trevent.c:230)
comment:7 Changed 13 years ago by charles
- Summary changed from Regular crash in transmission-daemon, ?problems with tr_bandwidthClamp()? to Regular crash in transmission-daemon near tr_bandwidthClamp()
comment:8 Changed 13 years ago by charles
The more I look at these crash reports, the less sense they make. I suspect there's some random memory corruption going on here. For example I bet if you ran valgrind a handful of times the crashes would be in different places.
This bug is looking more and more like #2842, so much that I'm tempted to close it, but I'm not positive yet.
Nevertheless IMO since you're willing & able to make your own builds and test, I think you might want to read to the end of the comments of #2842 and join in the testing there... ;)
comment:9 Changed 13 years ago by TheBear
Agree - I've looked at #2842 and it was the closest in symptoms to my crash, however, since that bug relates to embedded devices, they haven't been able to get much in the way of crash reports to confirm it.
I can see why you're thinking random corruption, however, I've sat over the top of the daemon with gdb for over 20 crashes, and it always crashes in one of two places, and always in exactly the same way, and that is way too deterministic for, say, a loose pointer corrupting the heap.
In terms of stats (off the top of my head, admittedly) about 90% are similar to the first crash report I posted, where the arg to tr_evbuffer_write() is getting corrupted/changed to something stupid (which seems to be what causes the SEGV, since it then tries to read past the end of the allocated buffer, and, indeed, the data segment).
I'll do some more runs in the next few days and try to confirm that the corruption is always occurring with the same pointer, and that - in my mind - is going to look less like random corruption, and more like some specific, but rare, condition in the code, possibly in tandem with some kind of interaction with my version of the std. libraries, since it's so rare.
It's nice to work with you guys - just wish I had a better handle on the code.
comment:10 Changed 13 years ago by charles
- Resolution set to invalid
- Status changed from new to closed
We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information. Thanks again!
comment:11 Changed 13 years ago by TheBear
Update:
Got some time (and some disk space) to play again, and updated to 2.03 - which I can confirm works in my config - O/S etc unchanged, build forced me to d/l and rebuild libevent to 1.4.1b so that *may* have been it, but either way, no evidence of problems with 2.03!
Thanks for the help along the way, and whatever you did beween 1.93 and 2.03!
OK, now added a second backtrace that corresponds with the "double free" error without debugger. This definitely looks like corruption of the tree that the tr_bandwidth*() functions use.
Attached another backtrace and the config.log from the build...