Taking weighting seriously #487

gragusa · 2022-07-15T16:07:11Z

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Deal with weighted standard errors with rank deficient designs
Document the new API
Improve testing

Closes #186, #259.

…liaStats-master

codecov-commenter · 2022-07-16T08:43:43Z

Codecov Report

❌ Patch coverage is 99.52038% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.98%. Comparing base (ae2943f) to head (e3660d9).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/glmfit.jl	99.22%	1 Missing ⚠️
src/negbinfit.jl	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
+ Coverage   95.42%   96.98%   +1.56%     
==========================================
  Files           8        8              
  Lines        1006     1196     +190     
==========================================
+ Hits          960     1160     +200     
+ Misses         46       36      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lrnv · 2022-07-20T07:45:33Z

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa · 2022-07-20T17:14:41Z

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback · 2022-08-14T19:14:57Z

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan · 2022-08-28T18:27:50Z

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl

src/glmfit.jl

src/lm.jl

test/runtests.jl

bkamins · 2022-08-31T08:49:28Z

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

gragusa · 2025-12-18T11:17:46Z

Looks like all new code is tested now! Can you comment on momentmatrix (see above) before I merge?

The momentmatrix function computes the score/estimating equation contributions for each observation in a fitted model.
The function returns an n × p matrix where n is the number of observations and p is the number of parameters. The ith row contains the score contribution (gradient of the log-likelihood with respect to β) for observation i.

The implementation for the Linear Model is in (src/lm.jl:366-372), while the GLM implementation (src/glmfit.jl:816-823)

This function is used when calculating the variance under ProbabilityWeights, which uses the following formula V = A⁻¹ B A⁻¹,
where A is the inverse Hessian of the log-likelihood (inv(X'X) for the linear model) and B = mm' * mm (where mm is the momentmatrix evaluated at the optimal parameter value).

…liaStats-master

nalimilan · 2025-12-18T14:06:53Z

OK, it seems there are different uses of the expression "moment matrix" in regression, but anyway it's internal for now so we can have this discussion in JuliaStats/StatsAPI.jl#16.

See https://documenter.juliadocs.org/stable/man/doctests/#Filtering-Doctests to limit precision in docstests.

test/runtests.jl

src/GLM.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

…liaStats-master

nalimilan

I've taken the liberty to push a few commits to push the PR over the finish line. One of the commits removes the export of weights constructors for now as @devmotion had reservations about it and I don't want this minor discussion to block merging the PR. We can continue this in another PR/issue. I'd also like to turn deprecation warnings about passing a non-AbstractWeights type into an error on master before tagging 2.0, and deprecate it in 1.x (#619).

Thanks for persisting through 3,5 years and 500+ comments @gragusa! Now we need to finish 2.0 and release it.

If you still have some energy for this, it would be interesting to implement the missing log-likelihood for some weights types that we left aside.

gragusa · 2025-12-23T18:04:22Z

Super!

ViralBShah · 2025-12-23T23:27:31Z

Does this close #540?

nalimilan · 2025-12-24T17:24:25Z

I don't think so.

gragusa · 2025-12-24T18:47:26Z

No. But internally the residuals of #540 are implemented. It should be easy expose (and rename) these internal methods to implement what is proposed in #540. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Viral B. Shah ***@***.***> Sent: Wednesday, December 24, 2025 12:27 AM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) [https://avatars.githubusercontent.com/u/744411?s=20&v=4]ViralBShah left a comment (JuliaStats/GLM.jl#487)<#487 (comment)> Does this close #540<#540>? — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAVYSEOAWKA7CDMDIWD4DHFXVAVCNFSM6AAAAACOIB57ICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMOBYGIYDGNBSGI>. You are receiving this because you were mentioned.

nalimilan · 2025-12-24T21:35:26Z

test/runtests.jl

+    @test_logs (:warn,
+                "Using `wts` of zero length for unweighted regression is deprecated in favor of " *
+                "explicitly using `UnitWeights(length(y))`." *
+                " Proceeding by coercing `wts` to UnitWeights of size $(N).")


@gragusa I had missed this, but it turns out it's a no-op. And if I fix it to check the logs from the line above, it fails because we don't print a warning for uweights(0). Do you think we can just remove it?

Below the situation is similar but not exactly the same. The tests for loglikelihood(lm1) are duplicated, and we can't check the deprecation and test pweights at the same time.

gragusa · 2025-12-24T21:48:53Z

I think this is a left-out and was not supposed to be there (previous iterations were throwing a warning. Should I remove it? It seems the only viable way since uweight(0) cannot warn. Sent from Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Milan Bouchet-Valat ***@***.***> Sent: Wednesday, December 24, 2025 10:35:47 PM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) @nalimilan commented on this pull request.

________________________________ In test/runtests.jl<#487 (comment)>:

+ @test_logs (:warn,

+ "Using `wts` of zero length for unweighted regression is deprecated in favor of " * + "explicitly using `UnitWeights(length(y))`." * + " Proceeding by coercing `wts` to UnitWeights of size $(N).") @gragusa<https://github.com/gragusa> I had missed this, but it turns out it's a no-op. And if I fix it to check the logs from the line above, it fails because we don't print a warning for uweights(0). Do you think we can just remove it? Below the situation is similar but not exactly the same. The tests for loglikelihood(lm1) are duplicated, and we can't check the deprecation and test pweights at the same time. — Reply to this email directly, view it on GitHub<#487 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAQXWBYBKHKBL3G7NDD4DMBLHAVCNFSM6AAAAACOIB57ICVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTMMJRGU4TSNBSGE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

nalimilan · 2025-12-25T16:43:33Z

If you can't see why it's there then better remove it yes. And what about the other test below?

nalimilan · 2025-12-26T10:56:07Z

src/lm.jl

+            v += abs2(y[i] - m) * wts[i]
        end
    end
    return v


AFAICT this should be changed to match deviance and the GLM method, right? Otherwise r2 isn't correct. Looks like we need a test for LinearModel with pweights that would cover this.

return wts isa ProbabilityWeights ? v ./ (sum(wts) / length(y)) : v

nalimilan · 2026-01-03T16:07:20Z

Something else I noticed: we should probably skip observations with a zero weights with probability weights. For other types of weights the definitions are mathematically correct with those, but for probability weights we shouldn't count them in nobs at least, and maybe in other places. In doubt, we could throw an error when we find zero weights for now.

gragusa added 20 commits June 10, 2022 20:53

WIP

1754cbd

WIP

1d778a5

WIP

12121a3

Taking weights seriously

4363ba4

WIP

ca702dc

Taking weights seriously

e2b2d12

Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…

bc8709a

…liaStats-master

Add depwarn for passing wts with Vector

84cd990

Cosmettic changes

cbc329f

WIP

23d67f5

Fix loglik for weighted models

f4d90a9

Fix remaining issues

6b7d95c

Final commit

c236b82

Merge branch 'master'

d4bd0c2

Fix merge

8bdfb55

Fix nulldeviance

3eb2ca4

Bypass crossmodelmatrix drom StatsAPI

63c8358

Delete momentmatrix.jl

e93a919

Delete scratch.jl

7bb0959

Delete settings.json

ded17a8

ararslan requested review from andreasnoack and nalimilan August 15, 2022 19:54

nalimilan mentioned this pull request Aug 28, 2022

Fixed linear model with perfectly collinear rhs variables and weights #432

Closed

nalimilan reviewed Aug 31, 2022

View reviewed changes

gragusa added 4 commits December 18, 2025 12:35

Fix testing work warning about weights cohercion.

0e5f8b0

Merge branch 'JuliaStats-master' of github.com:gragusa/GLM.jl into Ju…

5ed9a1b

…liaStats-master

Fix handling of numerical noise in loglik evaluation

e65b5ff

Fix loglikelihood expected values in docs

01cedb6

nalimilan reviewed Dec 18, 2025

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

src/GLM.jl Outdated Show resolved Hide resolved

gragusa and others added 4 commits December 19, 2025 21:44

Update test/runtests.jl

4504088

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Fix import/using of weights from StatsBase

897b72f

Merge branch 'JuliaStats-master' of github.com:gragusa/GLM.jl into Ju…

154149b

…liaStats-master

Merge branch 'master' into JuliaStats-master

cd4c3cf

This was referenced Dec 22, 2025

Cleanup fit method arguments gragusa/GLM.jl#7

Closed

Cleanup fit method arguments #622

Merged

nalimilan added 5 commits December 22, 2025 23:33

Fix doctest

0321bc2

Run JuliaFormatter

9aee0ae

Do not export weights constructors

3160985

Clean up wts argument type

7a3bb3d

Fix doctest

66d6342

nalimilan approved these changes Dec 22, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into JuliaStats-master

e3660d9

nalimilan merged commit e26c5d5 into JuliaStats:master Dec 23, 2025
10 checks passed

nalimilan reviewed Dec 24, 2025

View reviewed changes

nalimilan reviewed Dec 26, 2025

View reviewed changes

Taking weighting seriously #487

Taking weighting seriously #487

Uh oh!

Conversation

gragusa commented Jul 15, 2022 • edited by nalimilan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 16, 2022 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lrnv commented Jul 20, 2022

Uh oh!

gragusa commented Jul 20, 2022

Uh oh!

alecloudenback commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Aug 28, 2022

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkamins commented Aug 31, 2022

Uh oh!

gragusa commented Dec 18, 2025

Uh oh!

nalimilan commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gragusa commented Dec 23, 2025

Uh oh!

ViralBShah commented Dec 23, 2025

Uh oh!

nalimilan commented Dec 24, 2025

Uh oh!

gragusa commented Dec 24, 2025 via email

Uh oh!

nalimilan Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gragusa commented Dec 24, 2025 via email

Uh oh!

nalimilan commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

gragusa commented Jul 15, 2022 •

edited by nalimilan

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

alecloudenback commented Aug 14, 2022 •

edited

Loading

nalimilan commented Dec 25, 2025 •

edited

Loading