-
Notifications
You must be signed in to change notification settings - Fork 510
grant permits in round robin for the first N queries #6119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
What metric will be improved by this change and which metric will be worse? My understanding is that:
Is this a patch meant to address current urgent PoC or is it meant as a long term solution? I think @rdettai's approach using differentiated service was on the right track, but using separate pool was too wasteful. |
|
the average latency should be exactly the same, just moving it from one query to another. |
rdettai-sk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone else reading this thread:
- This PR is about (partially) reverting #5509
- The alternative with differentiated services mentioned here is this #6010
average latency should be exactly the same
Took me a while to understand @fulmicoton's point as to why this isn't true. I had to come up with an example which I think explains it. Say you have 2 queries with 5 splits each and the provider is initially full:
- sequentially, you would have to wait 5 times for split1 and 10 times for split2, so an average of 7.5
- in round robin you would have to wait 9 times for split1 and 10 times for split2, so an average of 9.5
I really like the idea of shortest remaining time. My only concern is regarding its behavior when the remaining time's estimate is bad, i.e some splits are cached and others not. But I don't think it would make things worse than the current scheduler.
| let (leaf, leaf_position) = if let Some(leaf_permit_request) = | ||
| self.permits_requests.get_mut(self.next_query_to_serve) | ||
| { | ||
| (leaf_permit_request, self.next_query_to_serve) | ||
| } else if let Some(leaf_permit_request) = self.permits_requests.front_mut() { | ||
| (leaf_permit_request, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for easier proof reading I think we can mutate self.next_query_to_serve instead of returning leaf_position
|
tbh, i don't like shortest remaining time. @fulmicoton what did you mean by "watered down version" of SRTF? Did you had a specific idea of what part you wanted to cut out/change? it feels like (assuming a good heuristic, which again, hard), SRTF is really simple, without any part you can remove |
|
@trinity-1686a By watered down I did not mean simplify. I meant watering down its property. My main concern was your second point. I don't have any great idea yet unfortunately, but to give you an example: your job could a budget of "letting other job go in front of you". After this budget is expired you stop the SRTF behavior and just execute before any other job that arrived after you. I don't think this is a great idea either though. |
The worse collapses of the search performance I have seen on QW happened in a similar scenario: You end up with all queries starting right before their timeout, most of them just using some resources before getting cancelled, effectively keeping the searchers in a steady regime of high failure rate. Not to mention that this risk is getting higher as you scale the searcher fleet, because having only one searcher enter this regime is enough to compromise all searches cluster wide. I feel like SRTF would better protect us from this while providing a good average efficiency. Starvation is a lesser of 2 evils for us 😅 |
that kinda sounds like Highest Response Ratio Next. I'm worried SRTF might have some bad behaviours in some cases, but FIFO definitely has bad behaviours in case that matters to us all. My N-first RR will have the same issue as FIFO given enough big queries in parallel. @rdettai-sk's 2-fifo is likely very sensitive to its configuration (and isn't work conservative but we could fix that) So here is what i propose, we start with a SRTF (heuristic = number of splits at the leaf level). If it ends up working good, we can leave it that way. If it has some issues, we make the SearchPermitProvider support multiple schedulers through configuration, so we can more easily test multiple ideas, and eventually pick a winner. wdyt? |
ef75690 to
1dc9229
Compare
1dc9229 to
17535cc
Compare
Description
When scheduling leaf splits, allow up to N queries to run concurrently so that a single large query doesn't hog all ressources on a searcher.
How was this PR tested?
added unit test to verify it emits permits in the expected order