Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#2964 closed Bug (fixed)

Multiple simultaneous DNS resolve operations fail on some platforms causing unnecessary delays in successful announces.

Reported by: gongloo Owned by: charles
Priority: Normal Milestone: 1.92
Component: libtransmission Version: 1.91
Severity: Normal Keywords:
Cc: felipe.contreras@…

Description

It seems that using libevent for dns queries is unreliable for more than a few people on Debian (as well as other distributions, according to some quick searches).

I had been investigating an issue with my tracker being intermittently unscrapable and found that the issue came down to libevent returning with DNS_ERR_TIMEOUT. I found this by throwing in a debug statement with the following output:

[04:00:14.681] web dns_ipv4_done_cb(err = 67, char = , count = 0, ttl = 0) (web.c:453)

err=67 is, of course, DNS_ERR_TIMEOUT according to the libevent header file.

If this indeed what's causing the same symptoms on other users' machines (typically chalked up to the tracker being unavailable), perhaps it's time that libevent is dropped from transmission?

I thought I'd file this ticket in case this particular issue (libevent returning on timeout even though resolves using other methods return in well under the 5 second default for libevent) hasn't already been noted or has a known workaround that just isn't publicized.

For the time being, I've worked around the issue for my own purposes by setting task->resolved_host to task->host when no resolved host could be retrieved via libevent. I know that this will cause curl to block (perhaps with some exceptions), but it seems to be working beautifully for me as a band-aid until something else can be figured out.

Of course, this could just be (and probably) a bug in libevent or in my box's configuration, but again I figure it's better to report such things.

Attachments (1)

web.c.patch (1.3 KB) - added by gongloo 12 years ago.
Patch to web.c for DNS cache changes.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 12 years ago by gongloo

  • Summary changed from libevent unreliable on Debian to Multiple simultaneous DNS resolve operations fail on some platforms causing unnecessary delays in successful announces.

Sorry for the spam on this one. Upon further investigation, it looks like the issue may be that multiple simultaneous libevent resolves are taking too long. Again, this could be a box configuration issue.

It seems like, at startup with several torrents on the same tracker, the first four dns resolutions succeed immediately and any additional simultaneous resolves fail. The first few all cause successful dns cache entry creation but this cache is later invalidated by calls to dns_cache_set_fail when the remaining ones return with a timeout. Subsequent queries agains the cache then, of course, return as failed and since the minimum ttl on the cache is so large (24 hours) it takes quite some time for the system to right itself. A simple solution was to bump the MIN_DNS_CACHE_TIME down.

Proposed Short-Term Solution

I think what's more reasonable is to have a MIN_DNS_CACHE_FAIL_TIME which applies as the ttl for dns_cache_set_fail calls. That way, failed DNS calls are re-attempted at a shorter time interval than 24 hours. If that sounds reasonable to you guys and you'd like me to submit a patch, let me know.

As an aside, I can't think of a reason why there's even a MIN_DNS_CACHE_TIME in the first place. Should we respect the ttl that we're given from the DNS lookup?

comment:2 Changed 12 years ago by charles

gongloo: I agree of most of what you're saying in this short-term solution, but one thing I'd be interested in knowing is what libevent returns as a ttl in these DNS failures that you're describing on Debian. If we're getting an improper DNS failure, we might want to not honor that ttl.

comment:3 follow-up: Changed 12 years ago by charles

gongloo: does the error only happen when we queue up a lot of DNS lookups for the same host?

comment:4 in reply to: ↑ 3 Changed 12 years ago by gongloo

Replying to charles:

one thing I'd be interested in knowing is what libevent returns as a ttl in these DNS failures that you're describing on Debian. If we're getting an improper DNS failure, we might want to not honor that ttl.

As shown in the original ticket description, my debug statement indicates that the ttl returned by libevent in the timeout case is 0. I think we actually do want to honor that TTL, as in my mind it makes sense to retry timed out DNS queries immediately. Taking a cursory look at the source code for libevent, anytime a DNS failure is reported, the ttl is reported as 0. What makes most sense to me is to either respect the TTL given by libevent in all circumstances (failure or not), or respect it on success and use some new macro such as MIN_DNS_CACHE_FAIL_TIME (set to some low value such as 5, 30, 60, or even 120 but preferably not higher than that).

Either way I suggest getting rid of MIN_DNS_CACHE_TIME altogether since it, at least in my mind, disregards the whole idea of ttl and caching rules.

Replying to charles:

does the error only happen when we queue up a lot of DNS lookups for the same host?

The error seems to occur regardless of how many different hosts are being looked up as long as the lookups are concurrent.

comment:5 Changed 12 years ago by charles

Both of these alternatives sound reasonable. Of the two, I lean towards the MIN_DNS_CACHE_TIME so that we don't wind up looking up invalid hostnames ad infinitum when some torrent's announce list has bad trackers in it. Do you want to make a patch for this?

As for the larger question of why libevent returns on timeout when other resolves work in less time than libevent's 5 second default... you really need to submit that one upstream to libevent's bug tracker.

Changed 12 years ago by gongloo

Patch to web.c for DNS cache changes.

comment:6 Changed 12 years ago by gongloo

In the attached patch, I've done away with MIN_DNS_CACHE_TIME, which was used in three places:

  1. When adding a successful DNS resolution to the DNS cache. The TTL returned by libevent is now honored without being mangled.
  2. When adding a failed DNS resolution to the DNS cache. A new DNS_CACHE_FAIL_TTL constant is now used as the lifetime for the cache entry.
  3. When setting up libcurl options for DNS resolution. This was removed since we're not using libcurl to do any DNS resolution anyway.

I set DNS_CACHE_FAIL_TTL to 120 seconds, which is the default DNS cache lifetime for most libraries and programs (libcurl inclusive). In doing so, I took the liberty to re-order said enum contents so as to reflect their natural ordering.

comment:7 follow-up: Changed 12 years ago by felipec

  • Cc felipe.contreras@… added

Why do we need to resolve hostnames anyway? That's an expensive operation that should be avoided if possible.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 12 years ago by charles

Replying to felipec:

Why do we need to resolve hostnames anyway? That's an expensive operation that should be avoided if possible.

How do you propose we do that?

comment:9 in reply to: ↑ 8 Changed 12 years ago by felipec

Replying to charles:

Replying to felipec:

Why do we need to resolve hostnames anyway? That's an expensive operation that should be avoided if possible.

How do you propose we do that?

Sorry, I was talking about reverse DNS lookups; no need for them.

comment:10 Changed 12 years ago by charles

  • Milestone changed from None Set to 1.92
  • Status changed from new to assigned

Fixed in trunk for 1.92 by r10298. Thanks for the patch. :)

comment:11 Changed 12 years ago by charles

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:12 follow-up: Changed 12 years ago by charles

for #2987, I've committed a revision of web.c that delegates all the announce work to a worker thread, so that we can let libcurl handle most of the details and get rid of a lot of our edge-case code.

This commit it relevant to this ticket because it removes all of our evdns lookup code and custom caching code.

gongloo, I'd like to get some wider testing on this new code before 1.92 is released. Could you to test out a nightly build (tarballs @ http://build.transmissionbt.com/job/trunk-linux/) and see if it suffers from the delays reported in this ticket?

comment:13 in reply to: ↑ 12 Changed 12 years ago by gongloo

Replying to charles:

gongloo, I'd like to get some wider testing on this new code before 1.92 is released. Could you to test out a nightly build (tarballs @ http://build.transmissionbt.com/job/trunk-linux/) and see if it suffers from the delays reported in this ticket?

Looks like the resolves fail causing a 'tracker did not respond' error when multiple simultaneous resolves happen. Each is retried in a matter of seconds, though, and once retried they do go through. Looks reasonable to me.

Note: See TracTickets for help on using tickets.