You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue gives an account of the current state of statistics in Python, summarized by AI based on interviews of Python developers, data scientists, and data analysts. The intent is to serve as a starting point for continued discussion about statistics in Python. A hand-curated list of potential action items is included as a TLDR:
Potential sprintable items:
Write a "Python vs. R for Statistics" translation table (@stefanv)
Create webpage listing libraries and their uses (@stefanv)
Compose educational materials, specifically for stats (@stefanv)
Design a way for programmers to easily publish statistics routines individually (@stefanv / @mdhaber). The idea is to create a hybrid between a free-for-all (users publish their own package) and getting a PR accepted into an existing package (e.g. SciPy, statsmodels, scikit-learn, pingouin). It's a generalization of something @tupui and I have been discussing in the context of statistical distributions (e.g. see Expansion Plan statistical-python/scikit-stats#4 and other issues at that repo)
@stefanv prepared a Perplexity workspace with custom instructions that included the full transcript of 49 interviews with Python developers, data scientists, and data analysts he performed for a recent grant.
Prompt to Perplexity:
Summarize the state of statistics in the Python ecosystem. What does it have, and what does it need? What is it's relationship to other languages.
Response: (reviewed by humans, and it looks quite good)
Core Offerings
Python has several major libraries for statistics:
scipy.stats for basic statistical tests and distributions
statsmodels for econometrics and classical statistics
scikit-learn for machine learning and some statistical modeling[^1][^2][^3].
There are many smaller, specialized packages, but the ecosystem is dominated by these large, general-purpose libraries[^3].
Strengths
Python is widely used in data science and is increasingly adopted for statistics, especially in education and applied research[^2].
Libraries like scikit-learn offer a clean, consistent interface for modeling, which is appreciated for its engineering rigor and integration with the broader Python data stack[^2].
The core libraries are generally well-tested and reliable, with high software engineering standards[^3].
Weaknesses and Needs
Fragmentation: The ecosystem is fragmented, with different libraries adopting incompatible approaches (e.g., statsmodels vs. scikit-learn), leading to confusion for users and students[^1][^2][^3].
User Experience: There is no central landing place or unified entry point for statistics in Python, unlike R's tidyverse or RStudio. This makes it harder for newcomers to get started[^2].
Interoperability: Data structures (Pandas, NumPy, etc.) do not always work seamlessly across libraries. Conversions are often required, and function outputs are unpredictable compared to R's tidyverse pipelines[^2].
Teaching Resources: Python lacks the abundance of user-friendly, statistics-focused tutorials and case studies found in the R community[^2].
Contributor Barriers: Contributing to core libraries is difficult due to high standards and lack of modularity. Small, specialized packages exist but are less visible and less used than in R[^3].
Statistical Methods Coverage: Some advanced or niche statistical methods are missing or hard to find, especially compared to R's vast CRAN repository[^1][^3].
Community and Culture: The Python statistics community is less cohesive and less connected than R's, which has a strong identity and established events[^2].
Relationship to Other Languages
R: R is still considered the gold standard for statistics, with better branding, a more cohesive ecosystem, and more teaching resources. R's tidyverse and RStudio provide a smoother user experience for statistics[^1][^2].
Interoperability: While some users switch between Python and R in their workflow, true interoperability is limited. Most projects use one language at a time, often using R for data manipulation and Python for modeling or vice versa[^2].
Other Platforms: Tools like GraphPad Prism are still popular among practicing scientists for basic statistical analyses, indicating that neither Python nor R fully dominates in all applied domains[^3].
Summary Table: Python vs. R for Statistics
Aspect
Python
R
Core Libraries
scipy.stats, statsmodels, scikit-learn
base R, tidyverse, many CRAN packages
User Experience
Fragmented, less cohesive
Cohesive, tidyverse pipelines, RStudio
Teaching Resources
Improving, but less abundant
Extensive, beginner-friendly
Community
Large, less connected in statistics
Strong, statistics-focused, welcoming
Package Development
High barriers, less modularity
Easy, many small packages, dev tools
Interoperability
Needs improvement (data structures, APIs)
Strong within tidyverse, RStudio
Branding
Data science/machine learning focus
Statistics-focused
What Python Needs
A more unified, user-friendly interface for statistics, possibly modeled after R's tidyverse[^1][^2].
Better interoperability between core data structures and libraries[^2].
More accessible teaching resources and case studies[^2].
Lower barriers for contributors and more visibility for specialized statistical packages[^3].
Stronger community identity and central organization for statistics in Python[^2].
Conclusion
Python's statistics ecosystem is powerful but fragmented, with significant room for improvement in usability, interoperability, teaching resources, and community cohesion. R remains the default for statistics, but Python is gaining ground, especially as data science and machine learning continue to grow. Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain[^1][^2][^3].
This issue gives an account of the current state of statistics in Python, summarized by AI based on interviews of Python developers, data scientists, and data analysts. The intent is to serve as a starting point for continued discussion about statistics in Python. A hand-curated list of potential action items is included as a TLDR:
Potential sprintable items:
@stefanv prepared a Perplexity workspace with custom instructions that included the full transcript of 49 interviews with Python developers, data scientists, and data analysts he performed for a recent grant.
Prompt to Perplexity:
Response: (reviewed by humans, and it looks quite good)
Core Offerings
Strengths
Weaknesses and Needs
Relationship to Other Languages
Summary Table: Python vs. R for Statistics
What Python Needs
Conclusion
Python's statistics ecosystem is powerful but fragmented, with significant room for improvement in usability, interoperability, teaching resources, and community cohesion. R remains the default for statistics, but Python is gaining ground, especially as data science and machine learning continue to grow. Stronger integration, better documentation, and a more unified vision could help Python become a true peer to R in the statistics domain[^1][^2][^3].