Skip to content

docs: convert reStructuredText sources to MyST markdown#1579

Open
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:doc/phase2-rst-to-md
Open

docs: convert reStructuredText sources to MyST markdown#1579
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:doc/phase2-rst-to-md

Conversation

@timsaucer

@timsaucer timsaucer commented Jun 5, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

There is no open issue but this continues the work done in #1578.

Rationale for this change

Phase 2 of the documentation-site refresh started in #1578. With the modern pydata-sphinx-theme + navigation in place, this PR moves the content format off .rst and onto MyST .md. The motivation:

  • Markdown is the lingua franca of agent-tuned tooling. LLMs trained on GitHub and modern docs parse Markdown reliably; reStructuredText is a minority dialect that frequently confuses both humans editing via PR review and agents reading the source. The Apache datafusion-comet sibling project completed the same migration recently and reported smoother contributor onboarding.

What changes are included in this PR?

  • Format conversion (mechanical, via rst-to-myst).
  • Manual fixes layered on top of the converter output for cross references
  • AGENTS.md is updated so the two .rst paths called out under "Aggregate and Window Function Documentation" point at the new .md equivalents.
  • Switched from myst-parser to myst-nb so that we can do markdown parsing PLUS code execution to render our examples.

Are there any user-facing changes?

No behavioral change to the datafusion package — only the source format of the published documentation. Readers of the rendered site will not notice the migration; the HTML output is slightly updated but still shows all of the relevant content including running code.

Follow-ups (out of scope for this PR)

  • Phase 3: multi-version doc publishing (the comet pattern).
  • Phase 4: asf-site publishing workflow.

@timsaucer timsaucer force-pushed the doc/phase2-rst-to-md branch from a400ec1 to 67c2761 Compare June 7, 2026 13:20
@timsaucer timsaucer marked this pull request as draft June 7, 2026 13:29
@timsaucer timsaucer force-pushed the doc/phase2-rst-to-md branch from 026b9e5 to 30efd76 Compare June 7, 2026 13:37
@timsaucer timsaucer marked this pull request as ready for review June 13, 2026 11:16
timsaucer and others added 10 commits June 13, 2026 18:06
Phase 2 of the documentation-site refresh. Run `rst2myst convert` over
every human-authored .rst file under docs/source/ and remove the
originals. The result:

- 33 .rst files become 33 .md files (user guide, contributor guide,
  index, links).
- Headings, paragraphs, hyperlinks, code blocks, admonitions, and
  toctree directives all map cleanly to MyST syntax.
- Cross-reference anchors round-trip through MyST as `(label)=`
  blocks. The converter kebab-cased the labels (e.g. `(io-csv)=`),
  but every `{ref}` target in the corpus still uses the underscore
  form from the original RST (`{ref}\`CSV <io_csv>\``) and so do the
  Python docstrings that AutoAPI pulls in. Rewrite the anchors back
  to the underscore form so the existing references resolve.
- 86 `{eval-rst}` blocks remain — they all wrap `.. ipython::`
  directives, which have no first-class MyST equivalent. They render
  identically and don't block the build.

conf.py changes:

- Enable `colon_fence` and `deflist` MyST extensions (rst-to-myst
  emits these on a few files, particularly execution-metrics.md).
- Keep `.rst` in `source_suffix` even though no human-authored RST
  remains: sphinx-autoapi generates RST under autoapi/ at build time
  and Sphinx needs the suffix registered to parse it.

AGENTS.md: update the two .rst paths called out under "Aggregate and
Window Function Documentation" to point at the .md equivalents.

Verified by building locally — `build succeeded`, no warnings, all
internal cross-references resolve, the ipython examples on the
landing page and basics page still execute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RST-to-MD conversion emitted MyST `%` comment syntax with blank line
between each header line, which renders as visible text. Replace with
canonical `<!--- ... -->` HTML comment block matching upstream
apache/datafusion and this repo's existing markdown files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RST -> MyST conversion left two intra-page links as undefined
reference-style links, which CommonMark renders as literal bracketed
text (no Sphinx warning, so the --fail-on-warning build still passed).
Point both at the auto-generated heading anchors instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removes the last RST-syntax islands from the converted MyST markdown so
the docs are markdown-native for both human and LLM authors.

Executable examples (A): replace IPython.sphinxext.ipython_directive with
myst-nb. The 83 `{eval-rst}` + `.. ipython:: python` blocks become native
`{code-cell} ipython3` blocks, and the 14 pages that carry them gain
jupytext/kernelspec front matter so myst-nb runs them. conf.py routes .md
through myst-nb with nb_execution_mode="force" and
nb_execution_raise_on_error=True, so a failing example now fails the build.

myst-nb gives each page its own kernel instead of the IPython directive's
single namespace shared across all documents in build order. That isolation
surfaced expressions.md, which only ever worked by inheriting `col`/`lit`
from an earlier-built page — it now imports them itself. It also changes the
execution working directory to each page's own folder, so build.sh symlinks
the example data next to every page that reads it by relative name and
registers the python3 kernel; CI now calls build.sh so it matches local.

Tables (B): the 3 `.. list-table::` directives become GFM markdown tables.

Cross-references (C): the two intra-page links in distributing-work.md that
the conversion left as undefined markdown references (and that built green
while rendering literal brackets) become `{ref}` roles backed by explicit
`(label)=` targets, so a future break fails the build instead of shipping
silently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
myst-nb prefers a cell's `_repr_html_` over its text repr. A datafusion
DataFrame's HTML repr is a Jupyter-oriented widget — inline styles plus an
injected <script> — that renders at the wrong width in the docs theme.

Set nb_mime_priority_overrides so the html builder prefers text/plain. The
35 cells that end in a bare DataFrame now show the same readable ASCII
table the old IPython directive produced, with no per-cell `.show()` edits
and no dependence on the package-generated HTML staying theme-compatible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
apache/datafusion#21411 is resolved — `.alias()` now works directly on a
`grouping()` expression. Removed the note describing the limitation and the
with_column_renamed workaround in the rollup and grouping_sets examples,
aliasing the grouping columns inline instead. Verified on the current
branch: the aliased aggregates execute and produce the named columns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The header logo was the same SVG in both color modes; the light-colored
wordmark was hard to read on the dark theme. Point the theme's image_dark
at a new original_dark.svg whose wordmark uses light strokes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The theme refresh emptied secondary_sidebar_items, dropping the
on-this-page table of contents that the previous site showed. Bring it
back on the right, wrapped in a native <details> so readers can fold it
away on the longer guide pages. Adds a custom page-toc-collapsible
secondary-sidebar template and styles the <summary> toggle (no JS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to restoring the on-this-page TOC: "collapsible" should hide the
entire right-hand frame, not just fold the list. Replace the <details>
wrapper with a floating toggle button (toc-toggle.js) that hides the whole
secondary sidebar via a body class; the flex article container then
reclaims the width (its 60em cap is lifted while hidden). The preference is
remembered across pages in localStorage, and the button is suppressed below
the theme's breakpoint where the sidebar is already collapsed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adding the myst-nb docs stack pulled a newer typing-extensions only on
Python < 3.11, splitting it into two locked versions. Our own
`typing-extensions; python_full_version < '3.13'` dependency then spanned
that split, which uv recorded as a multi-version edge without a `version`
field — a form older uv builds (the one in CI's pinned setup-uv) reject
with "missing source field but has more than one matching package".

Add a [tool.uv] constraint-dependencies pin of typing-extensions>=4.15.0
so it resolves to a single version across all supported Pythons, removing
the fork and the under-specified edge. Relocked; uv lock --locked is clean
and no multi-version package has a marker-only edge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the doc/phase2-rst-to-md branch from c0671bf to 922c6e8 Compare June 15, 2026 11:29

@timsaucer timsaucer left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to review and mark any changes that are not just formatting or reference styles by switching to markdown.

Comment on lines +1 to +9
---
jupytext:
text_representation:
extension: .md
format_name: myst
kernelspec:
name: python3
display_name: Python 3
---

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines indicate to the build system that we are running the jupyter extension to build the examples.

Comment thread docs/source/conf.py
Comment on lines +74 to +76
nb_execution_mode = "force"
nb_execution_timeout = 120
nb_execution_raise_on_error = True

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines will force failing examples to fail the documentation build.

Comment thread docs/source/conf.py
# styles + an injected <script>) built for Jupyter; in the docs theme it
# renders at the wrong width. The text repr is the readable table the old
# IPython directive showed and is stable across datafusion versions.
nb_mime_priority_overrides = [("html", "text/plain", 0)]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary to enforce text output of the dataframes instead of rendered html. It gives a more consistent experience IMO, especially as the html rendering code has had some changes in the past few releases.

Comment on lines +265 to +275
Apply `.alias()` to the `grouping()` expression to give the column a readable name:

```{code-cell} ipython3
result = df.aggregate(
[GroupingSet.rollup(col_type_1)],
[f.count(col_speed).alias("Count"),
f.avg(col_speed).alias("Avg Speed"),
f.grouping(col_type_1).alias("Is Total")]
)
result.sort(col_type_1.sort(ascending=True, nulls_first=True))
```

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantive difference from the prior work. The issue in apache/datafusion#21411 has been resolved and verified in 54.0.0 so I removed the old warning about grouping sets with aliases. You can see the old text in this section: https://datafusion.apache.org/python/user-guide/common-operations/aggregations.html#rollup

Comment on lines +68 to +74
/* Hideable right-hand "On this page" sidebar.
* toc-toggle.js adds the button and toggles `pst-secondary-hidden` on <body>;
* hiding the sidebar lets the flex article container reclaim the width. */

body.pst-secondary-hidden .bd-sidebar-secondary {
display: none;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code added to this section allows you to hide away the right hand table of contents so you get a bigger view of the site content. There is also the corresponding work in toc-toggle.js

Comment on lines +555 to +558
# build.sh downloads the example data, registers the Jupyter kernel
# myst-nb needs, symlinks the data next to each executed page, and
# runs sphinx. Using it here keeps CI identical to a local build.
uv run --no-project bash ./build.sh

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a single build path, both for local development and CI.

Comment thread docs/build.sh
Comment on lines +45 to +48
for d in temp temp/user-guide temp/user-guide/common-operations; do
ln -sf "$script_dir/pokemon.csv" "$d/pokemon.csv"
ln -sf "$script_dir/yellow_tripdata_2021-01.parquet" "$d/yellow_tripdata_2021-01.parquet"
done

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description above explains why these changes were added to build steps.

timsaucer and others added 2 commits June 15, 2026 07:48
# Conflicts:
#	docs/source/user-guide/common-operations/functions.rst
Both were only needed by the old IPython.sphinxext.ipython_directive,
which myst-nb replaced. pickleshare (IPython %store, abandoned 2018) has
no remaining consumer. ipython is now pulled transitively by ipykernel
and myst-nb, so the explicit floor is redundant. Relocked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timsaucer

Copy link
Copy Markdown
Member Author

Here's a view showing the dark mode renders the DataFusion correctly with the updated graphic and the collapsible right hand table of contents.

Screenshot 2026-06-15 at 7 52 51 AM

collapsed:

Screenshot 2026-06-15 at 7 52 58 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant