Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 10 additions & 36 deletions .agent/skills/beam-concepts/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,14 @@
# under the License.

name: beam-concepts
description: Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts.
description: Explains, demonstrates, and troubleshoots core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, Runners, windowing, and triggers. Use when learning Beam fundamentals, writing transforms, debugging pipeline errors, or comparing runner options.
---

# Apache Beam Core Concepts

## The Beam Model
Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model."

## Key Abstractions

### Pipeline
A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data.

```java
// Java
Expand All @@ -48,17 +44,9 @@ with beam.Pipeline(options=options) as p:
```

### PCollection
A distributed dataset that can be bounded (batch) or unbounded (streaming).

#### Properties
- **Immutable** - Once created, cannot be modified
- **Distributed** - Elements processed in parallel
- **May be bounded or unbounded**
- **Timestamped** - Each element has an event timestamp
- **Windowed** - Elements assigned to windows
Distributed dataset β€” bounded (batch) or unbounded (streaming). Immutable, timestamped, windowed.

### PTransform
A data processing operation that transforms PCollections.

```java
// Java
Expand All @@ -73,7 +61,6 @@ output = input | 'Name' >> beam.ParDo(MyDoFn())
## Core Transforms

### ParDo
General-purpose parallel processing.

```java
// Java
Expand All @@ -97,18 +84,13 @@ input | beam.Map(len)
```

### GroupByKey
Groups elements by key.

```java
PCollection<KV<String, Integer>> input = ...;
PCollection<KV<String, Iterable<Integer>>> grouped = input.apply(GroupByKey.create());
```

### CoGroupByKey
Joins multiple PCollections by key.

### Combine
Combines elements (sum, mean, etc.).
### CoGroupByKey / Combine

```java
// Global combine
Expand All @@ -118,24 +100,16 @@ input.apply(Combine.globally(Sum.ofIntegers()));
input.apply(Combine.perKey(Sum.ofIntegers()));
```

### Flatten
Merges multiple PCollections.
### Flatten / Partition

```java
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3);
PCollection<String> merged = collections.apply(Flatten.pCollections());
```

### Partition
Splits a PCollection into multiple PCollections.

## Windowing

### Types
- **Fixed Windows** - Regular, non-overlapping intervals
- **Sliding Windows** - Overlapping intervals
- **Session Windows** - Gaps of inactivity define boundaries
- **Global Window** - All elements in one window (default)
Types: Fixed (non-overlapping), Sliding (overlapping), Session (gap-based), Global (default).

```java
input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));
Expand All @@ -146,7 +120,6 @@ input | beam.WindowInto(beam.window.FixedWindows(300))
```

## Triggers
Control when results are emitted.

```java
input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5)))
Expand All @@ -158,7 +131,6 @@ input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5)))
```

## Side Inputs
Additional inputs to ParDo.

```java
PCollectionView<Map<String, String>> sideInput =
Expand All @@ -174,7 +146,6 @@ mainInput.apply(ParDo.of(new DoFn<String, String>() {
```

## Pipeline Options
Configure pipeline execution.

```java
public interface MyOptions extends PipelineOptions {
Expand All @@ -188,7 +159,6 @@ MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class);
```

## Schema
Strongly-typed access to structured data.

```java
@DefaultSchema(AutoValueSchema.class)
Expand Down Expand Up @@ -222,10 +192,14 @@ PCollectionTuple results = input.apply(ParDo.of(new DoFn<String, String>() {

results.get(successTag).apply(WriteToSuccess());
results.get(failureTag).apply(WriteToDeadLetter());

// Verify: check dead letter output is non-empty in tests
PAssert.that(results.get(failureTag)).satisfies(dlq -> {
assert dlq.iterator().hasNext(); return null;
});
```

## Cross-Language Pipelines
Use transforms from other SDKs.

```python
# Use Java Kafka connector from Python
Expand Down
95 changes: 43 additions & 52 deletions .agent/skills/ci-cd/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# under the License.

name: ci-cd
description: Guides understanding and working with Apache Beam's CI/CD system using GitHub Actions. Use when debugging CI failures, understanding test workflows, or modifying CI configuration.
description: Debugs CI failures, analyzes workflow logs, configures test matrices, and troubleshoots flaky tests in Apache Beam's GitHub Actions CI/CD system. Use when debugging CI failures, understanding test workflows, re-running failed checks, or modifying CI configuration.
---

# CI/CD in Apache Beam
Expand All @@ -44,46 +44,26 @@ Apache Beam uses GitHub Actions for CI/CD. Workflows are located in `.github/wor

## Key Workflows

### PreCommit
| Workflow | Description |
|----------|-------------|
| `beam_PreCommit_Java.yml` | Java build and tests |
| `beam_PreCommit_Python.yml` | Python tests |
| `beam_PreCommit_Go.yml` | Go tests |
| `beam_PreCommit_RAT.yml` | License header checks |
| `beam_PreCommit_Spotless.yml` | Code formatting |

### PostCommit - Java
| Workflow | Description |
|----------|-------------|
| `beam_PostCommit_Java.yml` | Full Java test suite |
| `beam_PostCommit_Java_ValidatesRunner_*.yml` | Runner validation tests |
| `beam_PostCommit_Java_Examples_*.yml` | Example pipeline tests |

### PostCommit - Python
| Workflow | Description |
|----------|-------------|
| `beam_PostCommit_Python.yml` | Full Python test suite |
| `beam_PostCommit_Python_ValidatesRunner_*.yml` | Runner validation |
| `beam_PostCommit_Python_Examples_*.yml` | Examples |

### Load & Performance Tests
| Workflow | Description |
|----------|-------------|
| `beam_LoadTests_*.yml` | Load testing |
| `beam_PerformanceTests_*.yml` | I/O performance |
Naming convention: `beam_{PreCommit,PostCommit}_{Language}[_Variant].yml`

- **PreCommit**: `Java`, `Python`, `Go`, `RAT` (license), `Spotless` (formatting)
- **PostCommit**: full test suites, `ValidatesRunner_*`, `Examples_*`
- **Performance**: `LoadTests_*`, `PerformanceTests_*`

## Triggering Tests

### Automatic
- PRs trigger PreCommit tests
- Merges trigger PostCommit tests

### Triggering Specific Workflows
Use [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) to run specific workflows.
### Re-running Specific Workflows
```bash
# Via GitHub CLI
gh workflow run beam_PreCommit_Java.yml --ref your-branch

### Workflow Dispatch
Most workflows support manual triggering via GitHub UI.
# Or use trigger files per the workflows README
```
See [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) for the full trigger phrase catalog.

## Understanding Test Results

Expand All @@ -95,17 +75,25 @@ Most workflows support manual triggering via GitHub UI.

### Common Failure Patterns

#### Flaky Tests
- Random failures unrelated to change
- Solution: Use [trigger files](https://github.com/apache/beam/blob/master/.github/workflows/README.md#running-workflows-manually) to re-run the specific workflow.

#### Timeout
- Increase timeout in workflow if justified
- Or optimize test

#### Resource Exhaustion
- GCP quota issues
- Check project settings
#### Debugging Workflow
1. **Check if flaky**: review the workflow's recent runs in the Actions tab for the same test
```bash
gh run list --workflow=beam_PreCommit_Java.yml --limit=10
```
2. **If flaky**: re-run the workflow
```bash
gh run rerun <run-id> --failed
```
3. **If consistent**: reproduce locally using the same command from the workflow's `run` step
```bash
# Example: find the failing command in the workflow file
grep -A5 'run:' .github/workflows/beam_PreCommit_Java.yml
# Then run it locally
./gradlew :sdks:java:core:test --info 2>&1 | tail -50
```
4. **If timeout**: check test runtime with `--info` flag; increase timeout only if justified
5. **If resource exhaustion**: check GCP quotas in project settings
6. **Verify fix**: push and confirm the workflow passes in the PR checks tab

## GCP Credentials

Expand Down Expand Up @@ -159,17 +147,20 @@ jobs:

## Local Debugging

### Run Same Commands as CI
Check workflow file's `run` commands:
Reproduce CI commands locally by reading the workflow's `run` step:
```bash
./gradlew :sdks:java:core:test
# Java PreCommit equivalent
./gradlew :sdks:java:core:test --info

# Python PreCommit equivalent
./gradlew :sdks:python:test
```

### Common Issues
- Clean gradle cache: `rm -rf ~/.gradle .gradle`
- Remove build directory: `rm -rf build`
- Check Java version matches CI
# If build state is stale
rm -rf ~/.gradle/caches .gradle build && ./gradlew clean

# Verify Java version matches CI
java -version # CI typically uses JDK 11
```

## Snapshot Builds

Expand Down
6 changes: 4 additions & 2 deletions .agent/skills/contributing/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# under the License.

name: contributing
description: Guides the contribution workflow for Apache Beam, including creating PRs, issue management, code review process, and release cycles. Use when contributing code, creating PRs, or understanding the contribution process.
description: Guides the Apache Beam contribution workflow including creating pull requests, signing the CLA, running precommit checks, labeling issues, and following commit conventions. Use when contributing code, submitting a pull request, understanding how to contribute, or navigating the code review process.
---

# Contributing to Apache Beam
Expand Down Expand Up @@ -69,8 +69,10 @@ description: Guides the contribution workflow for Apache Beam, including creatin
- Use descriptive commit messages

### 5. Create Pull Request
- Run pre-commit tests locally before pushing: `./gradlew javaPreCommit` (Java), `./gradlew :sdks:python:test` (Python)
- Verify tests pass, then push and open PR
- Link to the issue in PR description
- Pre-commit tests run automatically
- Pre-commit tests run automatically on the PR
- If tests fail unrelated to your change, comment: `retest this please`

### 6. Code Review
Expand Down
20 changes: 7 additions & 13 deletions .agent/skills/gradle-build/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# under the License.

name: gradle-build
description: Guides understanding and using the Gradle build system in Apache Beam. Use when building projects, understanding dependencies, or troubleshooting build issues.
description: Configures build.gradle files, runs Gradle tasks, resolves dependency conflicts, and debugs compilation errors in Apache Beam's mono-repo build system. Use when running ./gradlew commands, troubleshooting build failures, managing dependencies, or understanding the BeamModulePlugin.
---

# Gradle Build System in Apache Beam
Expand Down Expand Up @@ -207,18 +207,12 @@ rm -rf .gradle
rm -rf build
```

### Common Errors

#### NoClassDefFoundError
- Run `./gradlew clean`
- Delete gradle cache

#### Proto-related Errors
- Regenerate protos: `./gradlew generateProtos`

#### Dependency Conflicts
- Check dependencies: `./gradlew dependencies`
- Use `--scan` for detailed analysis
### Troubleshooting Workflow
1. If build fails, check error type in output
2. **NoClassDefFoundError**: run `./gradlew clean` then retry; if persists, delete `~/.gradle/caches`
3. **Proto-related errors**: run `./gradlew generateProtos` then retry build
4. **Dependency conflicts**: run `./gradlew :module:dependencies --configuration runtimeClasspath` to inspect, use `--scan` for detailed analysis
5. Verify fix: re-run the original build command to confirm success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I prefer previous structure as some troubleshooting playbooks will be longer than 3 steps.


### Useful Tasks

Expand Down
14 changes: 9 additions & 5 deletions .agent/skills/io-connectors/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# under the License.

name: io-connectors
description: Guides development and usage of I/O connectors in Apache Beam. Use when working with I/O connectors, creating new connectors, or debugging data source/sink issues.
description: Implements read/write transforms, configures connection parameters, and tests I/O connectors (BigQuery, Kafka, JDBC, Pub/Sub, GCS) in Apache Beam. Use when reading from or writing to external data sources, creating new connectors, or debugging data source/sink issues.
---

# I/O Connectors in Apache Beam
Expand Down Expand Up @@ -191,7 +191,11 @@ Beam supports using I/O connectors from one SDK in another via the expansion ser
## Creating New Connectors
See [Developing I/O connectors](https://beam.apache.org/documentation/io/developing-io-overview)

Key components:
1. **Source** - Reads data (bounded or unbounded)
2. **Sink** - Writes data
3. **Read/Write transforms** - User-facing API
### Workflow
1. Implement Source (bounded or unbounded read)
2. Test Source with DirectRunner: `./gradlew :sdks:java:io:myio:test`
3. Implement Sink (write)
4. Create user-facing Read/Write transforms
5. Write integration tests using `TestPipeline`
6. Run integration tests: `./gradlew :sdks:java:io:myio:integrationTest`
7. Verify both read and write paths produce expected results
5 changes: 4 additions & 1 deletion .agent/skills/java-development/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# under the License.

name: java-development
description: Guides Java SDK development in Apache Beam, including building, testing, running examples, and understanding the project structure. Use when working with Java code in sdks/java/, runners/, or examples/java/.
description: Guides Java SDK development in Apache Beam including compiling, running unit and integration tests, building SDK containers, and publishing artifacts. Use when working with Java code in sdks/java/, runners/, or examples/java/, or running ./gradlew Java tasks.
---

# Java Development in Apache Beam
Expand Down Expand Up @@ -119,6 +119,9 @@ Set pipeline options via `-DbeamTestPipelineOptions='[...]'`:
# Publish a specific module
./gradlew -Ppublishing -p sdks/java/io/kafka publishToMavenLocal

# Verify: check artifact exists
ls ~/.m2/repository/org/apache/beam/beam-sdks-java-io-kafka/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gradle so I'm not sure if this verify step will work here.


# Publish all modules
./gradlew -Ppublishing publishToMavenLocal
```
Expand Down
Loading
Loading