Fix amd64 AMI Builds#2229
Conversation
PostgreSQL Extension Dependency Analysis: PR #2229
SummaryNo extensions had dependencies with MAJOR version updates. Full Analysis ResultsPostgreSQL 15 Extension DependenciesExtension: wrappers
Raw Dependency TreePostgreSQL 17 Extension DependenciesExtension: wrappers
Raw Dependency TreeOrioleDB 17 Extension DependenciesExtension: wrappers
Raw Dependency Tree |
PostgreSQL Package Dependency Analysis: PR #2229
SummaryNo packages had MAJOR version updates. Full Analysis ResultsPostgreSQL 15 Dependency ChangesExtracting PostgreSQL 15 dependencies...
Runtime Closure Size
Raw Dependency ClosurePostgreSQL 17 Dependency ChangesExtracting PostgreSQL 17 dependencies...
Runtime Closure Size
Raw Dependency Closure |
945edec to
7d7bc35
Compare
Doesn't make sense to have non-breaking space so looks fish to some tools, rightfully so.
No need to set EXECUTION_ID var and then use it for a differently named var later, just export the expected name from the beginning. I also sorted the nix/build-ami args for an easier time comparing to Release AMI Nix workflow.
Release AMI Nix ends up always building stage1 AMI so lets take in a force input that will pass it along as BUILD_AMI_NIX_FORCE_BUILD. Setting the env is behind an `if` since nix/build-ami will actually treat it as a tri-state. Passing in GIT_SHA too so that nix/build-ami can use it for finding the AMI just like Release AMI Nix does.
These are especially helpful for local actions because GHA does not show the jobs/steps within, it all looks like one "Build AMI" step.
This way when we get a hit we know its for exact same inputs. Changes to
build-ami script can cause a different AMI to have been built so should
be part of the INPUT_HASH calculation. Using $0 takes into account any
${packerSources} changes due to normal nix behavior.
Release AMI Nix *always* builds stage1 so lets make sure we have a way to behave similarly. Also need to make sure we find the correct AMI via architecture and sourceSha. Also sync'ing up the filters used to find stage-1, added architecture and sourceSha. The architecture tag filter should be obviously ok/no-op and sourceSha was already always being passed via build-ami action so all the AMIs should have it as long as it was built in CI. It seems redundant to have both sourceSha and inputHash filters since inputHash should always be better than sourceSha, when content changes inputHash changes and sourceSha is obviously different but a different sourceSha could be from changes to unrelated files. I'll take extra builds for now though just to get some clarity here, will make it better in the near future.
| if ''${CI:-false}; then | ||
| echo "::notice::Setting packer build -on-error=abort since this is CI, this is different than non-CI runs!" | ||
| on_error=abort | ||
| elif ! [[ -t 1 ]]; then |
There was a problem hiding this comment.
This should be -t 0 actually
samrose
left a comment
There was a problem hiding this comment.
Don't know for sure but Potential issue: the release workflow’s path filter no longer covers all files used by the workflow, so some future AMI-build changes may not trigger this release workflow automatically.
Should be no worse than today, I only dropped the common- packer file which isn't in git so would never be part of a push anyway. I'll be back here soonish anyway taking a look at all of these paths and workflows. |
In CI it will abort so that the workflow can grab ec2 console output, hopefully catching any kernel messages there. Otherwise set to ask if stdout is a tty so that the caller can poke around if they want to debug. If stdout is not a tty then we will go with the default.
This is mostly just clean ups, it seems better to add things to the env pre/post build instead of having to use the annoyingly long output syntax or re-derive/parse from matrix values or packer files. I renamed the arch specific matrix values to something that reads better, arch makes more sense as a single value representing the architecture and target is a common name for gha matrix grouping. I changed most uses of the various postgres versions from using matrix value or getting from packer file to env vars. This way we use the same name consistently throughout the file and the names are more explicit. Its easy to guess the difference between POSTGRES_MAJOR_VERSION and POSTGRES_SUPABASE_VERSION compared to postgres_version and PG_VERSION.
No need to repeat ourselves when we have a perfectly good abstraction already.
This way testinfra and release-ami workflow/logs are more comparable.
Order things the same, add Debug AWS role secret step, use AWS_REGION env var instead of hard coding the region everywhere... etc just so we can get a better looking diff between them.
This wasn't really necessary so lets revert.
This is causing the plugin's cpu usage to go from ~.6% -> ~600%! See hashicorp/packer-plugin-amazon#676
What kind of change does this PR introduce?
Bug fix and refactors.
What is the current behavior?
Flaky AMI builds.
What is the new behavior?
No more flaky AMI builds.
Additional context
We've been seeing issues with building the AMI, specifically the
Release AMI Nixworkflow. Took quite some debugging to figure out that we were hit with hashicorp/packer-plugin-amazon#676 ultimately due to packer not supporting lock files and always chasing the newest release possible.The issue boiled down to packer_plugin_amazon causing ~600% cpu load (6 cpus @ 100%) on the blacksmith machines which ended up causing the network to backup until the connection is forced closed by a middleware/hv and the build fail. Looks like arm machines were managing the load well enough to continue passing but we were also affected by the bug there too.
I'm still not sure why
Release AMI Nixseemed to trigger this bug more reliably even after re-working it to use the samebuid-amiaction and thusbuild-aminix package but in the meantime I was able to get a reproducer and narrow down on the issue. I blamed the blacksmith networking (because if its not DNS, its the network) but that was only another symptom (would be nice to have a better indicator/behavior though). Luckily support pointed out that the cpus were at 100% and that caused the network to drop. I don't know how I didn't notice that myself on the runner's metrics page. I recall focusing on network io instead. Once I saw the cpu usage I tried to replicate on my own machine to factor out blacksmith possibly throttling/slow cpu but saw the same behavior which finally led me the bug in question.Fix is to pin to 1.8.0 for now. I'll be working on the CI in general next and will enable renovate once I get it so we aren't running long builds for unrelated changes and are sure that testinfra and co are required to pass for PRs to be mergeable. I am keeping all the "unrelated" cleanups though because they are good and kaizen.