Date: 2024-03-13
Context: https://github.com/neondatabase/neon/issues/7124
This is a copy of my post on Slack, for posterity and for publishing it to the outside world.
Christian Schwarz 15 days ago
I did more regression testing to make sure the issue is properly understood.Commits & experiment results are pushed to the branch
(neon-kPa6L9nl-py3.9) cs@devvm-mbp:[~/src/neon]: git log --reverse --no-decorate HEAD~3..HEAD
commit 72a8e090ddd3a5cf322b2e8174da2828a725a0c6
Author: Christian Schwarz <[email protected]>
Date: Wed Mar 13 12:49:45 2024 +0000
experiment: for create_delta_layer, use global io_engine, but inside a spawn_blocking single-threaded runtime
This makes things worse with tokio-epoll-uring
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 19.574 s
This is a partial revert of 3da410c8fee05b0cd65a5c0b83fffa3d5680cd77
commit 4a8e7f87168e5f1836cfffa2aa1e426347b035a1
Author: Christian Schwarz <[email protected]>
Date: Wed Mar 13 13:41:38 2024 +0000
experiment: for create_delta_layer _write path_, use StdFs io engine in a spawn_blocking thread single-threaded runtime
builds on top of the previous commit
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 13.153 s
commit c8c04c0db8ec58d07d63b9f7f240650627ad09ad
Author: Christian Schwarz <[email protected]>
Date: Wed Mar 13 14:06:33 2024 +0000
experiment: for EphemeralFile write path, use StdFs io engine
together with previous commits, this brings us back down to
pre-regression
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s
In current main, we're burning significantly less CPU with tokio-epoll-uring but take longer for the same amount of work.What's happening is that the delta layer writer / EphemeralFile writers have small 8k buffers internally, and don't do double-buffering.So, every time the 8k is full, we do a VirtualFile::write_*.await , which we know has higher baseline latency than std-fs on systems that aren't under memory pressure.Why is that? We don't do direct IO, so, std-fs writes only go to kernel page cache without waiting for any sort of kernel page evictions / swapping / write-back of pages.If we were using direct IO, the std-fs would still be tangentially (but not meaningfully) faster in this benchmark because there's only 1 stream of wal ingested, hence only 1 executor thread busy. It wouldn't be meaningfully faster because the direct IO latency would dominate both std-fs and tokio-epoll-uring.We can demonstrate the effect of direct IO without actually implementing it by opening VirtualFile with O_SYNC (for O_DIRECT, we'd need to actually align our buffers first)So, what does this mean with regard to revert / fixing? Ideas:
last_record_lsn metrics.I'll scope out the work required for (3) increased buffer sizes, and edit this message / post an update.
John Spray🗓️