ENH: add parameter finder for degrees or freedom for students_t distribution#1385
ENH: add parameter finder for degrees or freedom for students_t distribution#1385dschmitz89 wants to merge 10 commits intoboostorg:developfrom
Conversation
|
CC @JacobHass8 would you have time to help out with a cursory initial review? |
| BOOST_CHECK_CLOSE( | ||
| students_t_distribution<RealType>::invert_probability_with_respect_to_degrees_of_freedom( | ||
| static_cast<RealType>(-6.96455673428326), | ||
| static_cast<RealType>(0.01)), | ||
| static_cast<RealType>(2), | ||
| tol_inv); |
There was a problem hiding this comment.
I see these test values are used in quite a few places. It would be nice to combine all these into a single function which checks every method. This ensures we only have to define the values for the spot check in one place. It also cuts down on the amount of code. Maybe this would be for a separate PR though?
| // | ||
| // Step 1: Edgeworth warm start. | ||
| // F(x; nu) ~ Phi(x) - phi(x)(x + x^3)/(4*nu) + phi(x)(3x + 5x^3 + 7x^5 - 3x^7)/(96*nu^2) | ||
| // Substituting u = 1/nu and setting F(x; nu) = p gives: | ||
| // B*u^2 - A*u + (Phi(x) - p) = 0 | ||
| // where: | ||
| // A = phi(x) * (x + x^3) / 4 | ||
| // B = phi(x) * (3x + 5x^3 + 7x^5 - 3x^7) / 96 | ||
| // | ||
| normal_distribution<RealType, Policy> std_normal(0, 1); | ||
| RealType phi = pdf(std_normal, x_abs); | ||
| RealType Phi = cdf(std_normal, x_abs); | ||
|
|
||
| RealType x2 = x_abs * x_abs; | ||
| RealType x3 = x2 * x_abs; | ||
| RealType x5 = x3 * x2; | ||
| RealType x7 = x5 * x2; | ||
|
|
||
| RealType A = phi * (x_abs + x3) / 4; | ||
| RealType B = phi * (3 * x_abs + 5 * x3 + 7 * x5 - 3 * x7) / 96; | ||
| RealType C = Phi - p_adj; | ||
|
|
||
| RealType hint = static_cast<RealType>(0.01); | ||
| RealType discriminant = A * A - 4 * B * C; | ||
| if (discriminant >= 0 && B != 0) | ||
| { | ||
| RealType sqrt_disc = sqrt(discriminant); | ||
| // Two roots of B*u^2 - A*u + C = 0; pick the smallest positive u (largest nu = 1/u). | ||
| RealType u1 = (A - sqrt_disc) / (2 * B); | ||
| RealType u2 = (A + sqrt_disc) / (2 * B); | ||
| RealType u = -1; | ||
| if (u1 > 0 && u2 > 0) | ||
| u = (u1 < u2) ? u1 : u2; | ||
| else if (u1 > 0) | ||
| u = u1; | ||
| else if (u2 > 0) | ||
| u = u2; | ||
|
|
||
| if (u > 0) | ||
| { | ||
| RealType nu_hat = 1 / u; | ||
| // Step 2: validate by checking relative residual of the exact CDF. | ||
| students_t_distribution<RealType, Policy> t_hat(nu_hat); | ||
| RealType exact_cdf = cdf(t_hat, x_abs); | ||
| RealType residual = (exact_cdf > p_adj) ? (exact_cdf - p_adj) : (p_adj - exact_cdf); | ||
| RealType relative_residual = (p_adj != 0) ? residual / p_adj : residual; | ||
| if (relative_residual <= tools::epsilon<RealType>() * 4) | ||
| return nu_hat; // Edgeworth estimate is already exact to machine precision. | ||
| if (relative_residual <= static_cast<RealType>(0.1)) | ||
| hint = nu_hat; | ||
| } | ||
| } |
There was a problem hiding this comment.
With all this in the function, it was initially hard for me to parse what was going on. Could this be put in a separate function called something like boost::math::detail::edgeworth_approximation? Then you could replace all this with hint = boost::math::detail::edgeworth_approximation(x_abs, p).
There was a problem hiding this comment.
Maybe refactor out, and use in the existing function? That would get automatically tested in the existing tests as well.
| } | ||
|
|
||
| template <class RealType, class Policy> | ||
| BOOST_MATH_GPU_ENABLED RealType students_t_distribution<RealType, Policy>::invert_probability_with_respect_to_degrees_of_freedom( |
There was a problem hiding this comment.
We already have a function for this here:
However, you have a better initial guess to the root finder which would be a very welcome addition!
There was a problem hiding this comment.
I found this function but was not sure how to map the inversion problem we have here (finding degrees of freedom given a probability and a quantile) to this one arising from sample size calculation.
The signatures are:
find_degrees_of_freedom(RealType difference_from_mean, RealType alpha, RealType beta, RealType sd, RealType hint)
invert_probability_with_respect_to_degrees_of_freedom(RealType x, RealType p)
| { | ||
| // CDF(0; df) = 0.5 for all df; only solvable when p == 0.5. | ||
| if (p_adj == static_cast<RealType>(0.5)) | ||
| return boost::math::numeric_limits<RealType>::infinity(); |
There was a problem hiding this comment.
Small point, this should return policies::raise_overflow_error for consistency.
| RealType x5 = x3 * x2; | ||
| RealType x7 = x5 * x2; | ||
|
|
||
| RealType A = phi * (x_abs + x3) / 4; |
There was a problem hiding this comment.
If an LLM wrote most of this then colour me quite impressed ;)
A small point of C++ etiquette: variables are normally all_lower_case and upper case is reserved for macros.
There was a problem hiding this comment.
My workflow was basically:
- work out the edgeworth expansion with claude opus iteratively (it got the signs wrong in the beginning, I simply plotted to check the correctness
- asked claude sonnet to invert the expression
- worked out a simple python implementation to see that it works
- asked sonnet again to translate the python implementation to something that follows boost idioms
- iterate on the PR code with manual interventions to get what we have now
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1385 +/- ##
===========================================
- Coverage 95.35% 95.34% -0.01%
===========================================
Files 825 825
Lines 68203 68269 +66
===========================================
+ Hits 65033 65091 +58
- Misses 3170 3178 +8
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Towards #1305
This adds a parameter finder for the student t distribution with respect to the degrees or freedom.
Disclaimer: this PR is heavily LLM supported as C++ is not (yet?) my forte.
I verified the math with a simple Python program before. I also ran a simple sanity script to see that the function actually executes. I am not so confident about the tests though as I cannot really decipher the output of
b2.Approach: We use$df > 1$ where the t distribution behaves similar to the normal distribution but useless for low degrees of freedom (see the plot below). In this case or if it gives a relative error of the CDF worse than 10%, we fall back to an initial guess of 0.01.
bracket_and_solve_rootwith an initial guess from a second order Edgeworth expansion of the CDF. The initial guess is very good forDetailed approach for finding the initial guess
Given a quantile$x$ and a CDF value $p = P(T \leq x)$ , we want to recover the degrees of freedom $\nu$ .
Step 1 — Edgeworth warm start.
We use the 2nd-order Edgeworth expansion of the$t$ CDF in powers of $1/\nu$ :
where
Setting
where
The physically meaningful root (smallest positive$u$ , i.e. largest $\nu$ ) gives a closed-form starting estimate $\hat\nu = 1/u$ .
Step 2 — Validation.
We plug$\hat\nu$ into the exact $t$ CDF. If the relative residual $|F(x;\hat\nu) - p|/|p|$ exceeds 10%, we fall back to a safe low starting value $\nu_0 = 10^{-2}$ .