Why test_bulk_ingest is slower with tokio-epoll-uring

Date: 2024-03-13

Context: https://github.com/neondatabase/neon/issues/7124

This is a copy of my post on Slack, for posterity and for publishing it to the outside world.

Christian Schwarz 15 days ago

I did more regression testing to make sure the issue is properly understood.Commits & experiment results are pushed to the branch

(neon-kPa6L9nl-py3.9) cs@devvm-mbp:[~/src/neon]: git log --reverse --no-decorate HEAD~3..HEAD
commit 72a8e090ddd3a5cf322b2e8174da2828a725a0c6
Author: Christian Schwarz <[email protected]>
Date:   Wed Mar 13 12:49:45 2024 +0000

    experiment: for create_delta_layer, use global io_engine, but inside a spawn_blocking single-threaded runtime

    This makes things worse with tokio-epoll-uring

    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 19.574 s

    This is a partial revert of 3da410c8fee05b0cd65a5c0b83fffa3d5680cd77

commit 4a8e7f87168e5f1836cfffa2aa1e426347b035a1
Author: Christian Schwarz <[email protected]>
Date:   Wed Mar 13 13:41:38 2024 +0000

    experiment: for create_delta_layer _write path_, use StdFs io engine in a spawn_blocking thread single-threaded runtime

    builds on top of the previous commit

    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 13.153 s

commit c8c04c0db8ec58d07d63b9f7f240650627ad09ad
Author: Christian Schwarz <[email protected]>
Date:   Wed Mar 13 14:06:33 2024 +0000

    experiment: for EphemeralFile write path, use StdFs io engine

    together with previous commits, this brings us back down to
    pre-regression

    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
    test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s

In current main, we're burning significantly less CPU with tokio-epoll-uring but take longer for the same amount of work.What's happening is that the delta layer writer / EphemeralFile writers have small 8k buffers internally, and don't do double-buffering.So, every time the 8k is full, we do a VirtualFile::write_*.await , which we know has higher baseline latency than std-fs on systems that aren't under memory pressure.Why is that? We don't do direct IO, so, std-fs writes only go to kernel page cache without waiting for any sort of kernel page evictions / swapping / write-back of pages.If we were using direct IO, the std-fs would still be tangentially (but not meaningfully) faster in this benchmark because there's only 1 stream of wal ingested, hence only 1 executor thread busy. It wouldn't be meaningfully faster because the direct IO latency would dominate both std-fs and tokio-epoll-uring.We can demonstrate the effect of direct IO without actually implementing it by opening VirtualFile with O_SYNC (for O_DIRECT, we'd need to actually align our buffers first)So, what does this mean with regard to revert / fixing? Ideas:

Do nothing, just take the hit. It's not great because in a bulk ingest scenario, the SK storage is occupied for longer. But, I think the case is actually quite rare. We can start quantifying it by looking at last_record_lsn metrics.
Revert the change. I'm not a fan of that, we want to wrap up the asyncification of the pageserver, and the regression-introducing change delivered an important part of that.
Increase buffer sizes of delta layer writer and EphemeralFile writer. Should be easy compared to the next item.
Introduce double-buffering on the write paths (delta, EphemeralFile, Image): open up a new BytesMut and flush the old one it in the background.
I have a WIP draft for this EphemeralFile in my git-stash.
I think it's 3-4 days of work to get that right.

I'll scope out the work required for (3) increased buffer sizes, and edit this message / post an update.

John Spray🗓️