Skip to content

feat(dns-strict-resolver): add unconnected-UDP strict DNS sample#223

Open
officialasishkumar wants to merge 3 commits intomainfrom
feat/dns-strict-resolver-sample
Open

feat(dns-strict-resolver): add unconnected-UDP strict DNS sample#223
officialasishkumar wants to merge 3 commits intomainfrom
feat/dns-strict-resolver-sample

Conversation

@officialasishkumar
Copy link
Copy Markdown
Member

What this adds

A new dns-strict-resolver/ sample — a minimal self-contained Go HTTP server (Go stdlib only, no third-party deps) that exercises the unconnected-UDP + RFC 5452 strict-source-validation DNS client path.

The server:

  • Opens an unconnected UDP socket.
  • Sends a raw DNS A query to the system resolver (/etc/resolv.conf) or an explicit ?nameserver=ip:port.
  • Discards any reply whose source does not match the nameserver it queried (RFC 5452 §9.1 anti-spoofing — the same check dnspython, raw recvfrom-based clients, and glibc's res_send unconnected path perform).

If the strict check rejects every reply, /resolve returns HTTP 502 with source_mismatches > 0 and error: "no accepted reply ..." — the userspace equivalent of the production java.net.UnknownHostException: Temporary failure in name resolution / EAI_AGAIN symptom we chase in keploy/keploy#4092.

Why the sample exists

Keploy has existing DNS samples (dns-dedup) but they use net.LookupHost, which on glibc (cgo) predominantly uses connected UDP. Connected-UDP clients are rescued by Keploy's existing cgroup/getpeername4 hook and therefore never exposed the bug this sample regresses against. A raw UDP client is the smallest deterministic reproducer of the production failure.

Regression target

The companion CI job in keploy/keploydns_strict_resolver_test in .github/workflows/golang_linux.yml — builds and runs this sample under keploy record and keploy test and fails if any /resolve call reports source_mismatches > 0 or returns no IPs.

Files

  • dns-strict-resolver/main.go — ~230 LOC, stdlib-only.
  • dns-strict-resolver/curl.sh — hits /resolve for google.com, cloudflare.com, example.com.
  • dns-strict-resolver/go.mod — no dependencies.
  • dns-strict-resolver/README.md — why and how, with a working example.

All public domains — no internal hostnames.

How to try it

cd dns-strict-resolver
go run . &
curl -sS "http://localhost:8086/resolve?domain=google.com"

Expected (post-fix or no Keploy):

{"domain":"google.com","nameserver":"127.0.0.11:53","rcode":0,"ips":["142.250.x.x"],"source_mismatches":0,"attempts":1,"elapsed_ms":6}

Under a buggy (pre-fix) Keploy recording:

{"domain":"google.com","nameserver":"127.0.0.11:53","rcode":0,"ips":null,"source_mismatches":3,"attempts":3,"elapsed_ms":3000,"error":"no accepted reply from 127.0.0.11:53 after 3 attempts"}

Signed-off-by: Asish Kumar officialasishkumar@gmail.com

Adds a minimal self-contained Go HTTP server that exercises the
unconnected-UDP + RFC 5452 strict-source-validation DNS client path —
the exact path that exposes the bug fixed in keploy/keploy#4093 and
keploy/ebpf#97 (tracking issue keploy/keploy#4092).

The server opens a UDP socket, sends a raw DNS A query to the system
resolver, and accepts the reply only if its source address matches the
nameserver it queried. Any mismatch is counted and discarded. Under the
buggy Keploy build the reply arrived from <agent_ip>:<keploy_dns_port>
instead of the real nameserver, the source check rejected it, and the
endpoint eventually returned HTTP 502 with source_mismatches > 0 — the
userspace equivalent of the production "Temporary failure in name
resolution" / EAI_AGAIN symptom. After the fix, every client passes
with source_mismatches == 0.

net.LookupHost is not used here on purpose: on glibc (cgo) it
predominantly uses connected UDP, which was already rescued by the
existing cgroup/getpeername4 hook and therefore never exposed the bug.
A raw UDP client is the smallest deterministic repro of the production
failure mode.

Public domains (google.com, cloudflare.com, example.com) only; no
internal hostnames.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new dns-strict-resolver/ Go sample to reproduce and guard against unconnected-UDP DNS reply source-mismatch behavior (RFC 5452 strict source validation) relevant to keploy/keploy#4092 regression testing.

Changes:

  • Introduces a stdlib-only Go HTTP server with /resolve performing raw UDP DNS A queries and strict reply source validation.
  • Adds a simple curl traffic generator script for CI/e2e usage.
  • Documents purpose, usage, and Keploy record/test instructions for the sample.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
dns-strict-resolver/main.go Implements raw DNS query construction, strict UDP reply acceptance, and HTTP endpoints.
dns-strict-resolver/go.mod Declares a standalone Go module for the new sample.
dns-strict-resolver/curl.sh Provides traffic generation for exercising /resolve in e2e flows.
dns-strict-resolver/README.md Explains rationale, running instructions, and Keploy usage for the regression sample.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

r.Error = err.Error()
return r
}
defer conn.Close()
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer conn.Close() drops the returned error, which errcheck will report in this repo. Please either handle the close error in a deferred func or explicitly suppress it with //nolint:errcheck (as done elsewhere in the repo).

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +92
off := 12
for q := uint16(0); q < qd && off < len(reply); q++ {
off = skipName(reply, off)
off += 4
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseReply can return nil error for truncated/malformed replies: in the question section, it does off += 4 without checking off+4 <= len(reply), so malformed packets can silently produce empty results. Consider validating bounds and returning an error when the message is structurally invalid.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +193
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "nameserver ") {
return net.JoinHostPort(strings.TrimPrefix(line, "nameserver "), "53")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultNameserver only matches lines starting with the literal prefix "nameserver ", so it will miss valid resolv.conf entries that use tabs/multiple spaces, and it may also break if the line has an inline comment (e.g. nameserver 1.1.1.1 # ...). Consider parsing with strings.Fields and stripping # comments before extracting the IP.

Suggested change
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "nameserver ") {
return net.JoinHostPort(strings.TrimPrefix(line, "nameserver "), "53")
line, _, _ = strings.Cut(line, "#")
fields := strings.Fields(line)
if len(fields) >= 2 && fields[0] == "nameserver" {
return net.JoinHostPort(fields[1], "53")

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +59
```bash
sudo -E env PATH=$PATH keploy record -c "./dns-strict-resolver"
# hit /resolve endpoints, then stop keploy

sudo -E env PATH=$PATH keploy test -c "./dns-strict-resolver" --delay 10
```
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Under Keploy” section runs keploy record -c "./dns-strict-resolver" / keploy test -c "./dns-strict-resolver" but doesn’t mention building the binary first. As written, these commands will fail unless the user has already run go build . (producing ./dns-strict-resolver). Consider adding an explicit build step (or use -c "go run .").

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +42
func buildQuery(domain string, txid uint16) ([]byte, error) {
var b bytes.Buffer
binary.Write(&b, binary.BigEndian, txid)
binary.Write(&b, binary.BigEndian, uint16(0x0100)) // RD=1
binary.Write(&b, binary.BigEndian, uint16(1)) // QDCOUNT
binary.Write(&b, binary.BigEndian, uint16(0)) // ANCOUNT
binary.Write(&b, binary.BigEndian, uint16(0)) // NSCOUNT
binary.Write(&b, binary.BigEndian, uint16(0)) // ARCOUNT
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildQuery ignores the returned errors from binary.Write. With errcheck enabled in this repo, this will be reported (and it also makes error propagation inconsistent if the writer ever changes). Please handle the errors (and return them) or build the header using byte-slice appends that don’t involve error-returning writes.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +55
b.WriteByte(byte(len(label)))
b.WriteString(label)
}
b.WriteByte(0)
binary.Write(&b, binary.BigEndian, uint16(1)) // QTYPE A
binary.Write(&b, binary.BigEndian, uint16(1)) // QCLASS IN
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildQuery also ignores errors from WriteByte / WriteString (and later binary.Write calls). With errcheck enabled, this will be flagged. Please handle these errors (or restructure to avoid error-returning writes).

Copilot uses AI. Check for mistakes.

http.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprint(w, "ok")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /health handler ignores the error return from fmt.Fprint. With errcheck enabled, this will be flagged. Please check the returned error (or explicitly suppress with a justified //nolint:errcheck).

Suggested change
fmt.Fprint(w, "ok")
if _, err := fmt.Fprint(w, "ok"); err != nil {
fmt.Fprintf(os.Stderr, "failed to write /health response: %v; check whether the client disconnected before retrying the request\n", err)
}

Copilot uses AI. Check for mistakes.
return r
}
for time.Now().Before(deadline) {
_ = conn.SetReadDeadline(time.Now().Add(800 * time.Millisecond))
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from SetReadDeadline is intentionally ignored (_ = ...), but with errcheck enabled this will be flagged, and if setting the deadline fails the read loop behavior becomes unpredictable. Please handle the error (e.g., return it via res.Error) or explicitly justify/suppress it.

Suggested change
_ = conn.SetReadDeadline(time.Now().Add(800 * time.Millisecond))
if err := conn.SetReadDeadline(time.Now().Add(800 * time.Millisecond)); err != nil {
r.Error = err.Error()
r.ElapsedMS = time.Since(start).Milliseconds()
return r
}

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +135
ns, err := net.ResolveUDPAddr("udp", nsAddr)
if err != nil {
r.Error = err.Error()
return r
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveStrict returns early on several error paths without setting ElapsedMS, so the JSON response can show elapsed_ms: 0 even though work was done (and it differs from later error paths where ElapsedMS is set). For consistency and easier CI diagnostics, consider setting ElapsedMS on all returns (e.g., via a defer that updates it).

Copilot uses AI. Check for mistakes.
r.Error = err.Error()
return r
}
conn, err := net.ListenUDP("udp", &net.UDPAddr{IP: net.IPv4zero, Port: 0})
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net.ListenUDP is currently bound to 0.0.0.0 (IPv4 only). If /etc/resolv.conf (or the nameserver query param) provides an IPv6 nameserver, the lookup will fail due to address-family mismatch. Consider selecting udp4 vs udp6 and binding to the matching unspecified address based on ns.IP.To4() (or use a dual-stack approach).

Suggested change
conn, err := net.ListenUDP("udp", &net.UDPAddr{IP: net.IPv4zero, Port: 0})
listenNetwork := "udp4"
listenAddr := &net.UDPAddr{IP: net.IPv4zero, Port: 0}
if ns.IP.To4() == nil {
listenNetwork = "udp6"
listenAddr = &net.UDPAddr{IP: net.IPv6unspecified, Port: 0}
}
conn, err := net.ListenUDP(listenNetwork, listenAddr)

Copilot uses AI. Check for mistakes.
The companion CI job moved from non-docker loopback mode to docker
mode (keploy PR #4093). Bare-Linux loopback on ubuntu-latest turned
out to not invoke cgroup/recvmsg4 for the sample's unconnected-UDP
reads, which made the fix look like a no-op even when the build
binary had it — docker mode (sample container → CoreDNS container
over a bridge network) reliably exercises recvmsg4 and mirrors the
original Flipkart production topology.

Adds:

- Dockerfile — stdlib-only Go build on golang:1.22-alpine, runtime
  on alpine:3.20, exposes :8086.
- coredns/Corefile + coredns/zone — minimal CoreDNS config with
  hardcoded A records for google.com, cloudflare.com, example.com so
  the test does not depend on external resolvers.

README stays accurate — strict-source validation is still exercised
end-to-end; only the CI topology changed.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Adds an optional NAMESERVER env var (ip:port) that, when set, appends
`&nameserver=<value>` to each /resolve call. Lets the CI harness aim
queries at a known CoreDNS fixture container even when keploy rewrites
the sample's `--net` into `--network=container:<keploy-v3>`, in which
case the sample's own /etc/resolv.conf (e.g. 127.0.0.11) doesn't
point anywhere reachable from inside the shared netns.

Unchanged by default — runs without NAMESERVER still use whatever
/etc/resolv.conf resolves to, which is what you want for local
standalone testing outside keploy.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants