Opened 11 years ago
Closed 11 years ago
#4799 closed Bug (invalid)
FD leak due to failed scrape
Reported by: | rb07 | Owned by: | |
---|---|---|---|
Priority: | Normal | Milestone: | None Set |
Component: | Transmission | Version: | 2.50 |
Severity: | Major | Keywords: | fd leak |
Cc: |
Description
File descriptors are left when a tracker fails to respond.
Here's an example log, starting with the final problem which is transmission pausing torrents because it has no file descriptors left:
$ tail /var/log/messages ... Feb 27 17:30:40 transmission-daemon[27444]: Couldn't create socket: Too many open files (fdlimit.c:682) Feb 27 17:30:54 transmission-daemon[27444]: Couldn't create socket: Too many open files (fdlimit.c:682) Feb 27 17:30:54 transmission-daemon[27444]: Couldn't create socket: Too many open files (fdlimit.c:682) ... Feb 27 23:07:33 transmission-daemon[20105]: H Scrape error: Could not connect to tracker (announcer.c:1255) Feb 27 23:07:33 transmission-daemon[20105]: H Retrying scrape in 950 seconds. (announcer.c:1264) ... Feb 27 23:23:33 transmission-daemon[20105]: H Scrape error: Could not connect to tracker (announcer.c:1255) Feb 27 23:23:33 transmission-daemon[20105]: H Retrying scrape in 357 seconds. (announcer.c:1264) $ lsof -p 20105 ... transmiss 20105 nobody 52u inet 33758572 TCP 192.168.10.3:3197->mail28.oserver.com:6969 (CLOSE_WAIT) transmiss 20105 nobody 53u inet 33760395 TCP 192.168.10.3:3302->mail31.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 54u inet 33762334 TCP 192.168.10.3:4941->mail31.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 55u inet 33753613 TCP 192.168.10.3:3895->mail29.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 56u inet 33762728 TCP 192.168.10.3:1807->mail31.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 57u inet 33766407 TCP 192.168.10.3:3345->mail26.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 58u inet 33749884 TCP 192.168.10.3:2888->mail27.oserver.com:60000 (CLOSE_WAIT) transmiss 20105 nobody 59u inet 33749885 TCP 192.168.10.3:1260->mail28.oserver.com:6969 (CLOSE_WAIT) ... transmiss 20105 nobody 67u inet 33766397 TCP 192.168.10.3:2297->mail31.oserver.com:6969 (CLOSE_WAIT) transmiss 20105 nobody 68u inet 33766736 TCP 192.168.10.3:2234->mail30.oserver.com:60000 (CLOSE_WAIT) ... transmiss 20105 nobody 346u inet 33746276 TCP 192.168.10.3:2721->mail29.oserver.com:60000 (CLOSE_WAIT) $ date && lsof -p 20105 | grep CLOSE_WAIT | wc -l Tue Feb 28 00:43:56 CST 2012 20 $ date && lsof -p 20105 | grep CLOSE_WAIT | wc -l Tue Feb 28 13:22:52 CST 2012 99
I modified the tracker name, but notice they are using DNS based redundancy: one tracker name, multiple IP addresses, and they have several ports which in the torrents appear all as primary trackers. Their failure looks temporary, or perhaps some work, some don't.
In the same setup (number of torrents, average activity) version 2.42 did not have this problem. FD leaking started when I changed to 2.50 (but see next paragraph).
The problem (never releasing those sockets) could be in libevent, I'm using version 2.0.17 which was released in 2012-02-10. I haven't tested if Transmission 2.42 with this version of libevent has the problem.
Change History (14)
comment:1 follow-up: ↓ 12 Changed 11 years ago by x190
comment:2 Changed 11 years ago by x190
- Summary changed from FD leak due to failed scrap to FD leak due to failed scrape
comment:3 follow-up: ↓ 4 Changed 11 years ago by rb07
The sockets with CLOSE_WAIT are increasing, and keep increasing until the daemon hits the set limit (which I guess is the one in settings.json: open-file-limit).
It could be a libcurl problem, because I changed to curl 7.24.0 in Jan/24, so its also a change with Transmission 2.50. But I don't think its the option you show, IIRC the option SO_LINGER is needed for closing idle sockets, which is exactly this case: the connection was opened, no response whatsoever, and none will come so hit the time-out, close, and dispose of the socket.
Did I misspelled scrape? Thanks for the correction.
comment:4 in reply to: ↑ 3 Changed 11 years ago by x190
comment:5 Changed 11 years ago by x190
comment:6 follow-up: ↓ 7 Changed 11 years ago by rb07
I agree, if the problem is in libcurl I'll report it there.
But the point of this ticket is that I only have seen this problem since version 2.50; I had to restart the daemon 3 times so far, the first was soon after installing 2.50 since my settings had open-file-limit set to 32 (this was the default some time ago).
I'm currently testing 2.42 with the same libraries (libcurl, libevent, etc) used to build 2.50, the result so far after a couple of hours:
$ date && lsof -p 20105 | grep CLOSE_WAIT | wc -l Wed Feb 29 16:18:14 CST 2012 0
And the relevant part on the log is that there have only been 2 "Scrape error", so its early to reach any conclusion.
I'm about to load a torrent with many tracker errors, let's see how that goes.
comment:7 in reply to: ↑ 6 Changed 11 years ago by x190
comment:8 Changed 11 years ago by rb07
That's useless information. Do you really think I don't know that? No that was not the reason the daemon had too many open files, the reason is in the title of this ticket.
Oh BTW after 1.5 days the CLOSE_WAIT count is 55 for version 2.42. It looks like your "curl is the problem" theory is false.
comment:9 Changed 11 years ago by cfpp2p
I've tested with v2.50 and with torrents with lots of bad tracker plus edited some trackers to create bogus trackers. Once CLOSE_WAIT count got to 4 and promptly went back to zero shorly thereafter. Several hours and CLOSE_WAIT count stays at zero. Logged lsof every few minutes. So this might not be helpful, but I can't reproduce a buildup of CLOSE_WAIT.
comment:10 Changed 11 years ago by rb07
More info about the problem:
- Curl does have a problem with DNS based redundancy http://curl.haxx.se/mail/lib-2012-01/0190.html, in short it divides the timeout value by the number of hosts returned (8 in the case I show), making these time-outs more likely. Also it makes simultaneous connections to all hosts, which explains why their number was growing so fast.
* The real problem is that the daemon doesn't reuse those timed-out curl sessions, it just creates a new one (leaking the socket and the curl session). Upon further review this is not true, and its not a time-out either.
This could be a local problem, given what cfpp2p reports (that the number does decrease).
On the version 2.42 test: I stopped that test with the CLOSE_WAITs growing at one half the rate observed with 2.50. I'll try a new test with 2.50 reusing curl sessions; the alternative is to keep using new ones, but closing the old ones which had the error.
comment:11 Changed 11 years ago by x190
From a careful re-read of #4221:
As for fixing announce-udp... that code doesn't use libcurl. Was your router so full of CLOSE_WAIT that no network connections of any kind were possible? [rb07 is saying, yes as regards transmission-daemon] https://trac.transmissionbt.com/ticket/4221#comment:15
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/netstat.1.html
CLOSE_WAIT: The socket connection has been closed by the remote peer, and
the system is waiting for the local application to close its half of the connection.
As with cfpptp, I have not seen this cause an issue, however, it would appear that at least for some it can, and does, do so.
comment:12 in reply to: ↑ 1 ; follow-up: ↓ 13 Changed 11 years ago by rb07
Replying to x190:
Daniel Stenburg in reply to Jordan's code submission re: the CLOSE_WAIT issue 10 months ago.
The name is Daniel Stenberg, and for reference, the post, which I had already read, is at: http://sourceforge.net/tracker/?func=detail&aid=3297118&group_id=976&atid=100976
[snip]
Admittedly, he uses a couple of weasel words such as "probably" and "supposed to".
I disagree, he answered correctly that the pseudo-code looked fine, and that it wasn't a bug report, it was just a question about how libcurl works.
He also asked some questions to be able to help, which where never answered and the ticket was closed automatically.
Other users who have reported a build-up of CLOSE_WAITs have admitted to having problems with their router.
I have not mentioned a router. In fact it is not a router where I run the daemon, and there is no problem with the server where its running.
Please see also:
http://curl.haxx.se/mail/tracker-2005-12/0024.html
"Explicit close when perform() is done is forced with CURLOPT_FORBID_REUSE."
See my comment in 10.
Replying to x190:
rb07 is saying, yes
No, I'm not.
How do you think I was able to show the log, and the count of CLOSE_WAITS? Its the daemon the one that uses all its allotted connections, not the server.
This is not a forum to exchange points of view, and useless comments that only show that the problem and the technical facts are not understood. This is a bug tracker, and I reported a bug, added relevant information, and if nobody beats me to it, I will post the solution.
I appreciate you trying to help, and cfpp2p reporting things from his side. But I had already read all the similar threads on this tracker, I already knew that the last time this problem was solved was in version 2.33, I do not know when the problem appeared (which is something the people running every revision should have reported to make things easier). Between 2.42 an 2.50 there seems to be a problem, and I know that the problem is triggered by the tracker, one specific tracker, and it may be just a coincidence that I never saw it with 2.42.
comment:13 in reply to: ↑ 12 Changed 11 years ago by x190
Replying to rb07:
if nobody beats me to it, I will post the solution.
What more could anyone ask? :-)
comment:14 Changed 11 years ago by rb07
- Resolution set to invalid
- Status changed from new to closed
Solved. Its not a transmission error, I'm marking the ticket as invalid.
The cause of the problem was building curl with the option --enable-threaded-resolver, and the underlying (system) library is not capable of doing the job.
Solution: install c-ares and rebuild curl with option --enable-ares.
Alternative: just don't use the threaded resolver option (but c-ares is recommended).
Side note: the next curl version (7.25) improves the handling of redundant hosts (like the tracker with DNS based redundancy described above). But this was never the real problem.
Daniel Stenberg in reply to Jordan's code submission re: the CLOSE_WAIT issue 10 months ago.
"The CLOSE_WAIT sockets are probably the ones that libcurl has in its connection cache but that have been "closed" (a FIN was sent) by the server already but not yet by libcurl.
They are not there "indefinitely" (and they really can't be) since the connection cache has a limited size so eventually the old connections should get closed. Also, for each newly created connection when searching if there's an existing one, libcurl is supposed to detect and close "dead" connections such as these. Possibly that detection is somehow not working on Mac OS X.
Your code looks correct and quite frankly, I don't think you describe a bug but mere symptoms of how libcurl works."
https://trac.transmissionbt.com/ticket/4221#comment:17
Admittedly, he uses a couple of weasel words such as "probably" and "supposed to".
Other users who have reported a build-up of CLOSE_WAITs have admitted to having problems with their router.
Please see also:
http://curl.haxx.se/mail/tracker-2005-12/0024.html
"Explicit close when perform() is done is forced with CURLOPT_FORBID_REUSE."