Why use fd_write with smaller chunks than the maximum chunk we want to write?
Because the write(2) system call /copies/ the data to kernel space. (1)
It is _far_ faster to write small chunks (4k with page alignment would be optimal, really). (2)
Let's say that we have a 1Gb buffer. It's fairly likely the we can not write the whole thing at once to a nonblocking file, right? So, this way at most 4k is copied needlessly.
1) Not true in all cases. But as a simplification it is.
2) If the buffer was at all times page aligned 1 could be made untrue by using nice async_io linux extensions.
You can see a small discussion carried out in comments in file.c