-
Notifications
You must be signed in to change notification settings - Fork 117
Taking weighting seriously #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…liaStats-master
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #487 +/- ##
==========================================
+ Coverage 95.42% 96.98% +1.56%
==========================================
Files 8 8
Lines 1006 1196 +190
==========================================
+ Hits 960 1160 +200
+ Misses 46 36 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hey, Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables). |
not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR. |
|
FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures. Here's the test case pulled from #432 which passes with the in #487. @testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
endCan this test set be added? Is there any other feedback for @gragusa ? It would be great to get this merged if good to go. |
|
Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments. |
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!
What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.
|
A very nice PR. In the tests can we have some test set that compares the results of |
The The implementation for the Linear Model is in (src/lm.jl:366-372), while the GLM implementation (src/glmfit.jl:816-823) This function is used when calculating the variance under |
|
OK, it seems there are different uses of the expression "moment matrix" in regression, but anyway it's internal for now so we can have this discussion in JuliaStats/StatsAPI.jl#16. See https://documenter.juliadocs.org/stable/man/doctests/#Filtering-Doctests to limit precision in docstests. |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken the liberty to push a few commits to push the PR over the finish line. One of the commits removes the export of weights constructors for now as @devmotion had reservations about it and I don't want this minor discussion to block merging the PR. We can continue this in another PR/issue. I'd also like to turn deprecation warnings about passing a non-AbstractWeights type into an error on master before tagging 2.0, and deprecate it in 1.x (#619).
Thanks for persisting through 3,5 years and 500+ comments @gragusa! Now we need to finish 2.0 and release it.
If you still have some energy for this, it would be interesting to implement the missing log-likelihood for some weights types that we left aside.
|
Super! |
|
Does this close #540? |
|
I don't think so. |
|
No. But internally the residuals of #540 are implemented. It should be easy expose (and rename) these internal methods to implement what is proposed in #540.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Viral B. Shah ***@***.***>
Sent: Wednesday, December 24, 2025 12:27 AM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
[https://avatars.githubusercontent.com/u/744411?s=20&v=4]ViralBShah left a comment (JuliaStats/GLM.jl#487)<#487 (comment)>
Does this close #540<#540>?
—
Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAVYSEOAWKA7CDMDIWD4DHFXVAVCNFSM6AAAAACOIB57ICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGIYDGNBSGI>.
You are receiving this because you were mentioned.
|
| @test_logs (:warn, | ||
| "Using `wts` of zero length for unweighted regression is deprecated in favor of " * | ||
| "explicitly using `UnitWeights(length(y))`." * | ||
| " Proceeding by coercing `wts` to UnitWeights of size $(N).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gragusa I had missed this, but it turns out it's a no-op. And if I fix it to check the logs from the line above, it fails because we don't print a warning for uweights(0). Do you think we can just remove it?
Below the situation is similar but not exactly the same. The tests for loglikelihood(lm1) are duplicated, and we can't check the deprecation and test pweights at the same time.
|
I think this is a left-out and was not supposed to be there (previous iterations were throwing a warning. Should I remove it? It seems the only viable way since uweight(0) cannot warn.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Milan Bouchet-Valat ***@***.***>
Sent: Wednesday, December 24, 2025 10:35:47 PM
To: JuliaStats/GLM.jl ***@***.***>
Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***>
Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487)
@nalimilan commented on this pull request.
________________________________
In test/runtests.jl<#487 (comment)>:
+ @test_logs (:warn,
+ "Using `wts` of zero length for unweighted regression is deprecated in favor of " *
+ "explicitly using `UnitWeights(length(y))`." *
+ " Proceeding by coercing `wts` to UnitWeights of size $(N).")
@gragusa<https://github.com/gragusa> I had missed this, but it turns out it's a no-op. And if I fix it to check the logs from the line above, it fails because we don't print a warning for uweights(0). Do you think we can just remove it?
Below the situation is similar but not exactly the same. The tests for loglikelihood(lm1) are duplicated, and we can't check the deprecation and test pweights at the same time.
—
Reply to this email directly, view it on GitHub<#487 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQXWBYBKHKBL3G7NDD4DMBLHAVCNFSM6AAAAACOIB57ICVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTMMJRGU4TSNBSGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
If you can't see why it's there then better remove it yes. And what about the other test below? |
| v += abs2(y[i] - m) * wts[i] | ||
| end | ||
| end | ||
| return v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this should be changed to match deviance and the GLM method, right? Otherwise r2 isn't correct. Looks like we need a test for LinearModel with pweights that would cover this.
return wts isa ProbabilityWeights ? v ./ (sum(wts) / length(y)) : v|
Something else I noticed: we should probably skip observations with a zero weights with probability weights. For other types of weights the definitions are mathematically correct with those, but for probability weights we shouldn't count them in |
This PR addresses several problems with the current GLM implementation.
Current status
In master, GLM/LM only accepts weights through the keyword
wts. These weights are implicitly frequency weights.With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following
The old behavior -- passing a vector
wts=df.wtsis deprecated and for the moment, the array os coerceddf.wtsto FrequencyWeights.To allow dispatching on the weights,
CholPredtakes a parameterT<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.This PR also implements
residuals(r::RegressionModel; weighted::Bool=false)andmodelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.
I have also implemented
momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argumentweighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.Update
I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for
ProbabilityWeights) also work in the rank deficient case (and so doescooksdistance).Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using
Weightsand updatedjldocwith the new signature ofCholeskyPivoted.To do:
Closes #186, #259.