Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale OCSP response and "unable to write OCSP staple file" error #6524

Open
WGH- opened this issue Aug 17, 2024 · 12 comments
Open

Stale OCSP response and "unable to write OCSP staple file" error #6524

WGH- opened this issue Aug 17, 2024 · 12 comments
Labels
bug 🐞 Something isn't working

Comments

@WGH-
Copy link

WGH- commented Aug 17, 2024

I'm running Caddy with TLS by reusing certificate obtained externally with acmetool:

{
        auto_https off
}
https://:8443 {
        tls /var/lib/acme/live/example.com/fullchain /var/lib/acme/live/example.com/privkey {
        }
        // ...
}

After a while, Firefox stops opening the site, citing stale OCSP response. Restarting Caddy fixes the problem. Reloading doesn't help.

I noticed the following error in the log file, which might be related to the problem:

{"level":"error","ts":1723840048.373416,"logger":"tls.cache.maintenance","msg":"stapling OCSP","identifiers":["example.com"],"error":"unable to write OCSP staple file for [example.com]: mkdir caddy: not a directory"}
@mholt
Copy link
Member

mholt commented Aug 17, 2024

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible using the latest version of Caddy. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍 I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

This should be the latest version of Caddy:

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials. Please enable debug and access logs.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

@mholt mholt added the needs info 📭 Requires more information label Aug 17, 2024
@WGH-
Copy link
Author

WGH- commented Aug 18, 2024

I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364

When stapleOCSP fails to write the staple file, which is not actually a hard error since cert.ocsp is still updated, RenewManagedCertificates skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated loop).

When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.

@WGH-
Copy link
Author

WGH- commented Aug 18, 2024

I think the fix is to set HOME environment variable to a writable directory (services having $HOME is weird, but I dunno). I'll report back in a week or so, and tell whether this helped.

@francislavoie
Copy link
Member

Please fill out the template as asked. How are you running Caddy, etc? It matters. We can save time that way.

@WGH-
Copy link
Author

WGH- commented Aug 18, 2024

1. Environment

1a. Operating system and version

Debian 11

1b. Caddy version (run caddy version or paste commit SHA)

This should be the latest version of Caddy:

v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk=

2. Description

2a. What happens (briefly explain what is wrong)

In unknown circumstances (theory available), after some uptime,
Caddy is unable to serve up to date OCSP staple.

2b. Why it's a bug (if it's not obvious)

Caddy should always serve up to date OCSP staple, if OCSP stapling is enabled.

2c. Log output

Aug 17 17:00:07 subdomain.example.com systemd[1]: Starting foobar.service...
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7447097,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7498074,"msg":"using adjacent Caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.754352,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7640152,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7667134,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0000e9a80"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7858684,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7866526,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7873237,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7880044,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.788521,"msg":"failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details."}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7890544,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.7896967,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:00:08 subdomain.example.com systemd[1]: Started foobar.service.
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.7882998,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723914008.800019,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.800352,"logger":"tls","msg":"finished cleaning storage units"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723914008.7997715,"msg":"serving initial configuration"}
Aug 17 17:00:08 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723914008.8008697,"msg":"$HOME environment variable is empty - please fix; some assets might be stored in ./caddy"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloading foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"warn","ts":1723915243.7769516,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7871993,"msg":"using adjacent Caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3488281]: {"level":"info","ts":1723915243.7911592,"msg":"adapted config to JSON","adapter":"caddyfile"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8025868,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/load","remote_ip":"127.0.0.1","remote_port":"50174","headers":{"Accept-Encoding":["gzip"],"Cache-Control":["must-revalidate"],"Content-Length":["818"],"Content-Type":["application/json"],"Origin":["http://localhost:2019"],"User-Agent":["Go-http-client/1.1"]}}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.80679,"logger":"admin","msg":"admin endpoint started","address":"localhost:2019","enforce_origin":false,"origins":["//localhost:2019","//[::1]:2019","//127.0.0.1:2019"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1723915243.869151,"logger":"tls","msg":"stapling OCSP","error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8700619,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8708005,"logger":"http.handlers.forward_proxy","msg":"Secret domain used to connect to proxy: subdomain.example.com"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8716269,"logger":"http","msg":"enabling HTTP/3 listener","addr":":8443"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8720572,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1","h2","h3"]}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.872618,"logger":"http","msg":"servers shutting down with eternal grace period"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1723915243.8733037,"msg":"unable to create folder for config autosave","dir":"caddy","error":"mkdir caddy: not a directory"}
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.8747997,"logger":"admin.api","msg":"load complete"}
Aug 17 17:20:43 subdomain.example.com systemd[1]: Reloaded foobar.service.
Aug 17 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1723915243.9524853,"logger":"admin","msg":"stopped previous server","address":"localhost:2019"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"warn","ts":1724001643.881339,"logger":"tls","msg":"unable to get instance ID; storage clean stamps will be incomplete","error":"open caddy/instance.uuid: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724001643.895825,"logger":"tls","msg":"could not clean default/global storage","error":"unable to acquire storage_clean lock: creating lock file: open caddy/locks/storage_clean.lock: not a directory"}
Aug 18 17:20:43 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724001643.8962796,"logger":"tls","msg":"finished cleaning storage units"}
Aug 18 19:00:08 subdomain.example.com caddy[3487890]: {"level":"error","ts":1724007608.9948008,"logger":"tls.cache.maintenance","msg":"stapling OCSP","identifiers":["example.com","subdomain.example.com","wgh.example.com"],"error":"unable to write OCSP staple file for [example.com subdomain.example.com wgh.example.com]: mkdir caddy: not a directory"}
Aug 18 19:38:30 subdomain.example.com caddy[3487890]: {"level":"info","ts":1724009910.9073286,"msg":"shutting down apps, then terminating","signal":"SIGTERM"}

2d. Workaround(s)

Ensure Caddy has writable data directory (probably).

2e. Relevant links

I have a theory. The bug is here, in this very line: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/maintain.go#L364

When stapleOCSP fails to write the staple file, which is not actually a hard error since cert.ocsp is still updated, RenewManagedCertificates skips the rest of the steps, and doesn't update the staple (see for certKey, update := range updated loop).

When the server is started, the error is logged, but otherwise ignored: https://github.com/caddyserver/certmagic/blob/3bad5b6bb595b09c14bd86ff0b365d302faaf5e2/certificates.go#L331-L334. That explains why Caddy works fine after restarting, but breaks down much later.

3. Tutorial (minimal steps to reproduce the bug)

This tutorial might not be minimal.

Caddyfile:

{
	auto_https off
}

https://:8443 {
	tls /var/lib/acme/live/example.com/fullchain /var/lib/acme/live/example.com/privkey {
	}
	log {
		output stderr
	}
	route {
		forward_proxy {
			probe_resistance subdomain.example.com
			basic_auth <redacted>
			hide_ip
			hide_via
		}
		file_server
	}
}

Having externally managed certificate might not be the required step, but
that's the configuration I have. It's managed by acmetool, and obtained
via Let's Encrypt.

Both Caddyfile and caddy executable are in /opt/foobar. All of them,
including the directory, are owned by root.

systemd unit:

# /etc/systemd/system/foobar.service
[Unit]
After=network.target

[Service]
DynamicUser=true
SupplementaryGroups=ssl-cert
ExecStart=/opt/foobar/caddy run
ExecReload=/opt/foobar/caddy reload --force
WorkingDirectory=/opt/foobar/
Type=notify
PrivateTmp=true
ProtectSystem=full

[Install]
WantedBy=multi-user.target

Having DynamicUser=true, this service will have no $HOME set. You could probably
create a system user with no writable home (like / or /nonexistent) to
the same effect. Note that, as mentioned above, the current directory
is also nonwritable, and it also already contains a file named caddy (Caddy
tries to use ./caddy/ as a data directory as fallback).

Now, start Caddy. It will work fine.

Then check when OCSP staple will expire: openssl s_client -connect subdomain.example.com:8443 -status.
This will likely be a week from now, so you have to wait.

After that, try to open this with Firefox, which is known to verify OCSP staples.
It will return an error.

@francislavoie
Copy link
Member

Why are you using DynamicUser=true? Our recommendation has always been to run with a caddy user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service

@WGH-
Copy link
Author

WGH- commented Aug 18, 2024

Why are you using DynamicUser=true? Our recommendation has always been to run with a caddy user. See https://caddyserver.com/docs/install#debian-ubuntu-raspbian and https://caddyserver.com/docs/running#linux-service

I know. I just don't want to create a user upfront. (in any case, that's likely not a problem - lack of writable home directory is)

@francislavoie
Copy link
Member

I'm inclined to close this. It's not really a Caddy issue, moreso invalid system configuration.

@WGH-
Copy link
Author

WGH- commented Aug 18, 2024

Perhaps non-writeable data dir should be a hard error on start? We get confusing issues otherwise. Especially like "this appears to work, but will fail after a week".

@mholt
Copy link
Member

mholt commented Aug 19, 2024

@WGH-
Copy link
Author

WGH- commented Aug 19, 2024

Aha, I see it's called by obtainCert and renewCert, but in my configuration, Caddy doesn't obtain certificates.

@mholt
Copy link
Member

mholt commented Aug 20, 2024

Oh... that's a good point. And an unusual use case.

I think in order to properly fix this we should implement the TODO:

// TODO: this is not necessary every time; should only perform check once every so often for each storage, which may require some global state...

Because OCSP ops are a lot more common than cert renewals.

Basically, we should cache/remember the storage checks for... the lifetime of the process, I guess? Or at least a day or something like that.

Then this problem would be solved.

@mholt mholt added bug 🐞 Something isn't working and removed needs info 📭 Requires more information labels Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants