Opened 10 years ago

Closed 10 years ago

Last modified 7 years ago

#4164 closed Enhancement (fixed)

__FD_SETSIZE impact on open-file-limit and peer-limit-global

Reported by: romanr Owned by: jordan
Priority: Normal Milestone: None Set
Component: libtransmission Version: 2.03
Severity: Normal Keywords:
Cc:

Description

I have regular segfaults in the same place. Here's backtraces:

Program terminated with signal 11, Segmentation fault.
bt full
#0  0x080694f1 in createEasy (s=0x80, task=0xaf869e0) at web.c:146
        addr = 0xaf869e0
        e = 0xb3f1b138
        verbose = 0
#1  0x08069b08 in tr_webThreadFunc (vsession=0x9f5be80) at web.c:276
        msec = 1000
        msg = 0x0
        mcode = CURLM_OK
        task = 0xaf869e0
        unused = 0
        multi = 0x9f7ea40
        web = 0x9f61058
        taskCount = 0
        session = 0x80
#2  0x080537bf in ThreadFunc (_t=0x9f5b2e0) at platform.c:109
        t = 0x9f5b2e0
#3  0x0060a955 in start_thread () from /mnt/data2/new/transmission/libs/libpthread.so.0
#4  0x001dbe7e in clone () from /mnt/data2/new/transmission/libs/libc.so.6
Program terminated with signal 11, Segmentation fault.
bt full
#0  0x080694f1 in createEasy (s=0x4, task=0xa444e70) at web.c:146
        addr = 0xa444e70
        e = 0xb48a8dc0
        verbose = 0
#1  0x08069b08 in tr_webThreadFunc (vsession=0x9b1ae80) at web.c:276
        msec = 1000
        msg = 0x0
        mcode = CURLM_OK
        task = 0xa444e70
        unused = 0
        multi = 0x9b3d5d0
        web = 0x9b1abe0
        taskCount = 0
        session = 0x4
#2  0x080537bf in ThreadFunc (_t=0x9b1a2e0) at platform.c:109
        t = 0x9b1a2e0
#3  0x0029d955 in start_thread () from /mnt/data2/new/transmission/libs/libpthread.so.0
#4  0x0037ce7e in clone () from /mnt/data2/new/transmission/libs/libc.so.6
Program terminated with signal 11, Segmentation fault.
bt full
#0  0x080694f1 in createEasy (s=0x10780e80, task=0x9de0530) at web.c:146
        addr = 0x9de0530
        e = 0xb4983660
        verbose = 0
#1  0x08069b08 in tr_webThreadFunc (vsession=0x8f80e80) at web.c:276
        msec = 1000
        msg = 0x0
        mcode = CURLM_OK
        task = 0x9de0530
        unused = 0
        multi = 0x8fa3a40
        web = 0x8f86058
        taskCount = 0
        session = 0x10780e80
#2  0x080537bf in ThreadFunc (_t=0x8f802e0) at platform.c:109
        t = 0x8f802e0
#3  0x00d41955 in start_thread () from /mnt/data2/new/transmission/libs/libpthread.so.0
#4  0x00774e7e in clone () from /mnt/data2/new/transmission/libs/libc.so.6

The reason is a code in tr_webThreadFunc function

...
    tr_session * session = vsession;
...
            curl_multi_add_handle( multi, createEasy( session, task ));
...
            max_fd = 0;
            FD_ZERO( &r_fd_set );
            FD_ZERO( &w_fd_set );
            FD_ZERO( &c_fd_set );
            curl_multi_fdset( multi, &r_fd_set, &w_fd_set, &c_fd_set, &max_fd );
...

I use

"open-file-limit": 4096,
"peer-limit-global": 1024,

so actual file descriptor velue can be greater than __FD_SETSIZE (1024 for my build). I use default debian sid transmission build. createEasy opens new handle (which is greater than 1024), then curl_multi_add_handle adds it to CURLM struct, than curl_multi_fdset fills an fd_set with a descriptor greater than __FD_SETSIZE, so it overflows and rewrites ''session'' stack variable so we have a segfault on the next for-loop iteration.
Would be better to validate __FD_SETSIZE on startup and do not allow open-file-limit + peer-limit-global to be greater than __FD_SETSIZE.

Attachments (3)

transmission-day.png (22.5 KB) - added by romanr 10 years ago.
file & socket descriptor count daily graph. We can see that faults occur when descriptor count goes above 1024 (the default for FD_SETSIZE).
transmission (534 bytes) - added by romanr 10 years ago.
This is a munin plugin for that "nifty graph".
fdlimit.c.diff (773 bytes) - added by romanr 10 years ago.
suggested fix

Download all attachments as: .zip

Change History (24)

Changed 10 years ago by romanr

file & socket descriptor count daily graph. We can see that faults occur when descriptor count goes above 1024 (the default for FD_SETSIZE).

comment:1 Changed 10 years ago by KyleK

Off-topic, but may I ask how you created that nifty graph?

Changed 10 years ago by romanr

This is a munin plugin for that "nifty graph".

comment:2 follow-up: Changed 10 years ago by jordan

r12363 -- not a full fix, but a decent band-aid

Changed 10 years ago by romanr

suggested fix

comment:3 Changed 10 years ago by romanr

That fix wont help when there is descriptor leak (seems in 2.03 but not in 2.22). Probably a warning to logfile during startup would be enough. Also setrlimit(RLIMIT_NOFILE, FD_SETSIZE) is a good idea, I think.

comment:4 Changed 10 years ago by jordan

  • Resolution set to fixed
  • Status changed from new to closed

Added to r12373, though I did change it to MIN( FD_SETSIZE, sysconf( _SC_OPEN_MAX ) ) just for extra paranoia's sake...

Thanks again for reporting this issue. This overflow explains a couple of crashes I've seen reported over the last year or so. :)

comment:5 in reply to: ↑ 2 Changed 10 years ago by rb07

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to jordan:

r12363 -- not a full fix, but a decent band-aid

This change and the next r12364 creates a crash when built with MinGW.

Don't know it it affects other builds, MinGW has a very small FD_SETSIZE (64) so max gets a negative value, which causes a SIGSEGV.

I propose for fdlimit.c:778

    max = MAX( FD_SETSIZE - session->fdInfo->socket_limit - buffer_slots, limit );
Last edited 10 years ago by rb07 (previous) (diff)

comment:6 follow-up: Changed 10 years ago by jordan

rb07, but that will still possibly cause memory corruption in curl's select.

It seems to me we'll need to research how other people approach curl + select + windows. Alternately, if there's an alternative to select, such as simply pulsing curl every 10th of a second or so...

comment:7 in reply to: ↑ 6 Changed 10 years ago by rb07

Replying to jordan:

rb07, but that will still possibly cause memory corruption in curl's select.

I don't follow, after that change (from what I saw in the debugger) limit is set to 32 (the original parameter), isn't that what is used?

comment:8 Changed 10 years ago by jordan

Maybe I'm misreading the code, but if a higher value of `limit' is passed in then the MAX will allow it to be used.

Regardless, even if memory corruption doesn't take place we don't know what the values of the fd ints for the curl select() statement will be unless we limit the overall number of FDs (both sockets + local files) to FD_SETSIZE. That means on win32 we'd have to crunch the file cache, AND all the peers, AND the listening sockets, AND the http scrapes & announces, into 64 slots... which is a really terrible proposition.

comment:9 Changed 10 years ago by rb07

Quoting MS's documentation http://msdn.microsoft.com/en-us/library/ms740141%28v=vs.85%29.aspx:

"The variable FD_SETSIZE determines the maximum number of descriptors in a set. (The default value of FD_SETSIZE is 64, which can be modified by defining FD_SETSIZE to another value before including Winsock2.h.) Internally, socket handles in an fd_set structure are not represented as bit flags as in Berkeley Unix. Their data representation is opaque. Use of these macros will maintain software portability between different socket environments."

That means I could define FD_SETSIZE myself as 1024... and that should work for Transmission. On the other hand cURL is made with the default value, I just checked part of their code lib/select.c, and their protection will trigger unless I also build libcurl with a higher FD_SETSIZE.

MS' select() has other restrictions, you can's use it for sockets and files: "The sockets contained within the fd_set structures must be associated with a single service provider." So, file cache?

So far it seems to be working on Windows as it was (limit set to 32), at least nobody has complained of anything related to this. The easier way would be to make a test build with "-DFD_SETSIZE=1024" for both libcurl and libtransmission and see if it still works. Then do a stress test.

comment:10 follow-up: Changed 10 years ago by jordan

That's great news.

Sounds good wrt routing around FD_SETSIZE on win32... do you want to make a patch?

comment:11 in reply to: ↑ 10 Changed 10 years ago by rb07

Replying to jordan:

Sounds good wrt routing around FD_SETSIZE on win32... do you want to make a patch?

No patch required, with more detail what I said is to use:

./configure ... CFLAGS="... -DFD_SETSIZE=1024" ...

both in building libcurl and transmission.

comment:12 follow-up: Changed 10 years ago by jordan

That only works as long as you (and whoever else) remember to do that every time. Seems like adding a patch to CPPFLAGS in the "*mingw32*" section of configure.ac would be safer, at least for the Windows build.

Does libcurl's source refer to FD_SETSIZE by name?

comment:13 in reply to: ↑ 12 Changed 10 years ago by rb07

Replying to jordan:

That only works as long as you (and whoever else) remember to do that every time.

Yes, that is correct.

Seems like adding a patch to CPPFLAGS in the "*mingw32*" section of configure.ac would be safer, at least for the Windows build.

I agree, but it also has to correspond with libcurl... what about using "curl-config --cflags" (usually empty).

Does libcurl's source refer to FD_SETSIZE by name?

Yes, in lib/select.c and it doesn't re-define it or anything (in fact their configure script probably misses the existence of select(), this from a quick read, they do test with select(0, 0, 0, 0) with never works on Windows, probably they force its use). Anyway, it looks like they are happy with the value of 64.

I made a new build of revision 12398 with both libraries as described, currently downloading from 56 peers of 194 connected. The only glitch was in Windows itself, I hit the tcpip connection limit (which I already had increased to 256)... oops! spoke too soon, Transmission-Qt just stopped downloading/uploading, still has the connections... second Windows tcpip warning, perhaps that cause the stopage.

comment:14 Changed 10 years ago by rb07

Repeating the test (now with 2.30b4), after increasing the tcpip limit on Windows, went fine: downloading from 120 of 230 peers.

Conclusion: Changing FD_SETSIZE in Windows looks fine. I would keep the protection on max, just in case it gets nonsense; as I said, I don't know if other OS/environments (besides MinGW there is the off-shot MinGW-w64 which I might use in the future) affected.

comment:15 follow-up: Changed 10 years ago by jordan

rb07, is there anything left to do on this ticket?

Unless I've missed something, the resolution to the reopening in comment:5 is just to patch FD_SETSIZE by hand when you do a Windows build.

comment:16 in reply to: ↑ 15 Changed 10 years ago by rb07

Replying to jordan:

rb07, is there anything left to do on this ticket?

Perhaps add a #warning or #error on the code if FD_SETSIZE is small. The result will crash, so its just better to warn the builder.

Unless I've missed something, the resolution to the reopening in comment:5 is just to patch FD_SETSIZE by hand when you do a Windows build.

That's my current workaround, no problems reported on Windows.

I don't see an alternative, my proposal above was not very good and I don't have any test that shows how performance is affected, the old versions do connect to more than the default FD_SETSIZE peers, I'm not sure if that ever caused memory corruption in curl since I never tested or experienced that problem..

comment:17 Changed 10 years ago by jordan

Tickets #4294 and #4311 have been closed as unwanted, but necessary, side-effects of this ticket.

If anyone has a better solution for this issue I'd be happy to hear it.

comment:18 Changed 10 years ago by romanr

I've hacked my system headers (Linux Ubuntu karmic), set __FD_SETSIZE to 32768 and everything have worked fine for several months. Linux seems does not allow build-time changing of FD_SETSIZE. I use transmission-daemon 2.22 from Debian Sid sources.

comment:19 Changed 10 years ago by jordan

  • Resolution set to fixed
  • Status changed from reopened to closed

See also: r12514

comment:20 Changed 10 years ago by jordan

r12548 /trunk/ (gtk/tr-prefs.c qt/prefs-dialog.cc): (trunk gtk, qt) #4165 "FD_SETSIZE impact on open-file-limit and peer-limit-global" -- cap the per-torrent and per-session peer limits at FD_SETSIZE.

Last edited 10 years ago by jordan (previous) (diff)

comment:21 Changed 7 years ago by zenwryly

I'm not sure if I'm running into the underlying curl bug or not, but I've worked around this crudely by adding a delay and then manually setting the limits *after* Transmission overrides limits.conf and sets the limits to FD_SETSIZE:

while [[ -z $pid ]] ; do
    pid=$(pgrep transmission-da)
done
while [[ $soft != 1024 ]] ; do
    soft=$(prlimit --pid=$pid --nofile | tail -n 1 | cut -d " " -f 9)
done
exec prlimit --pid=$pid --nofile=8192:
Note: See TracTickets for help on using tickets.