The actual reason I beefed up Buffer a bit lately is *because* I need to do some protocol decoding of a byte stream.
Now I see IOBuffer arrive. In order to avoid code bloat, wouldn't it be a better idea to integrate the functionality of IOBuffer into Buffer and just keep one? Or is the performance difference so strikingly great that this sounds like a bad idea?
Stephen R. van den Berg wrote:
The actual reason I beefed up Buffer a bit lately is *because* I need to do some protocol decoding of a byte stream.
Now I see IOBuffer arrive. In order to avoid code bloat, wouldn't it be a better idea to integrate the functionality of IOBuffer into Buffer and just keep one? Or is the performance difference so strikingly great that this sounds like a bad idea?
Ideally, IOBuffer should inherit from String.Buffer, for maximum code reuse.
Stephen R. van den Berg wrote:
Stephen R. van den Berg wrote:
Now I see IOBuffer arrive. In order to avoid code bloat, wouldn't it be a better idea to integrate the functionality of IOBuffer into Buffer and just keep one? Or is the performance difference so strikingly great that this sounds like a bad idea?
I've been tracking IOBuffer extensions back to String.Buffer, I'll present what I have shortly. I suspect (but benchmarks will have to tell) that the String.Buffer implementation is not significantly slower than the current IOBuffer one (whilst supporting the full range of character widths).
However, reviewing the IOBuffer interface, I wonder about the following issues: - Isn't it prudent to drop set_error_mode() and simply implement this functionality (the throw()) using a custom range_error() override? - Why insist on lock()ing the buffer when subbuffers are active? Couldn't the code figure out by itself when a subbuffer exists and then decide on-demand and automatically when a copy needs to be made to transparently support the desired operation? And, in order to make the code throw errors when unintended copies are being made, we could implement a prohibit_copy() method (like read_only()) which makes the object throw errors as soon as it would need to make a copy to support some requested action.
I've been tracking IOBuffer extensions back to String.Buffer, I'll present what I have shortly.
I suspect (but benchmarks will have to tell) that the String.Buffer implementation is not significantly slower than the current IOBuffer one (whilst supporting the full range of character widths).
Well. You have some minor optimizations to do:
| int perf(object buffer) | { | buffer->add("11"); | for(int i=0;i<10000; i++ ) | { | int l; | if( buffer->cut ) | { | l = (buffer[0]<<8) | buffer[1]; | buffer->cut(0,2,1); | } | else | l = buffer->read_int(2); | buffer->add(random_string(l)); | return sizeof(buffer); | } | }
perf( String.Buffer() );
Result 2: 325250971 Compilation: 624ns, Execution: 144.86s
perf( Stdio.IOBuffer() );
Result 3: 328787331 Compilation: 639ns, Execution: 194.06ms
(note that the length differs due to random_string)
However, reviewing the IOBuffer interface, I wonder about the following issues:
- Isn't it prudent to drop set_error_mode() and simply implement this functionality (the throw()) using a custom range_error() override?
Well. That would work, yes, I just simply did not remove the old version of throwing errors since it would most often be used using the simply buff->set_error_mode(1) when doing sub-parsing as I showed in the documentation for set_error_mode.
The need to do rather complex sub-classing for that common usecase seemed somewhat pointless.
- Why insist on lock()ing the buffer when subbuffers are active? Couldn't the code figure out by itself when a subbuffer exists and then decide on-demand and automatically when a copy needs to be made to transparently support the desired operation?
Not really, since the subbuffer only contains a pointer directly into the memory area of the main buffer, if the main buffer changes that using realloc or malloc it would be invalid, this could of course be fixed by adding a list of subbuffers to the main buffer, but then you run into issues with refcounting and such. Since the usecase where you have a subbuffer active and want to modify the main buffer is rather uncommon I thought it was OK that you have to call trim() on the subbuffer to do that.
Why not return IOBuffers practically everywhere, and then let the caller decide when and if to cast them to a string? It gets rid of excessive method diversification due to there needing to be a string and a buffer returning one. Returning a buffer is cheap, it doesn't copy the content.
Well, there is about a factor of 3 performance difference:
string perf2(object b) { while( b->read(1) ); } string perf3(object b) { while( b->read_buffer(1) ); }
perf2(Stdio.IOBuffer(mb100));
Result 5: 0 Compilation: 664ns, Execution: 92.19ms
perf3(Stdio.IOBuffer(mb100));
Result 6: 0 Compilation: 680ns, Execution: 173.68ms
Since that does not include the cast, which should be about as fast as the first read, it becomes about 3x slower.
And most of the time you actually want the string version, not the buffer version.
One other thing:
Why not return IOBuffers practically everywhere, and then let the caller decide when and if to cast them to a string? It gets rid of excessive method diversification due to there needing to be a string and a buffer returning one. Returning a buffer is cheap, it doesn't copy the content.
Returning the buffer is cheap assuming that you have one already. Otherwise you have the cost of object creation, which - depending on the length of the buffer content - will be more expensive than the potential memcpy.
In what places do you think it would make sense to return a buffer object instead of a string?
arne
On 09/03/14 11:10, Stephen R. van den Berg wrote:
One other thing:
Why not return IOBuffers practically everywhere, and then let the caller decide when and if to cast them to a string? It gets rid of excessive method diversification due to there needing to be a string and a buffer returning one. Returning a buffer is cheap, it doesn't copy the content.
Arne Goedeke wrote:
Returning the buffer is cheap assuming that you have one already. Otherwise you have the cost of object creation, which - depending on the length of the buffer content - will be more expensive than the potential memcpy.
Erm... We are *in* a Buffer object, so by definition we have one. So returning a readonly-copy with zero-copy effort is easy. It basically delays the creation of the shared string as long as possible.
In what places do you think it would make sense to return a buffer object instead of a string?
As long as one is doing string operations (adding/substracting/matching) Buffer objects are better. Once done with that, the final "result" can/should be a shared string.
Erm... We are *in* a Buffer object, so by definition we have one. So returning a readonly-copy with zero-copy effort is easy. It basically delays the creation of the shared string as long as possible.
Not really, we are in /a/ buffer object, not the subsection of it that should be returned. You have to create a new one to return a subsection. read_buffer is about as fast as it gets, it does the minimal amount of work.
As long as one is doing string operations (adding/substracting/matching) Buffer objects are better. Once done with that, the final "result" can/should be a shared string.
Nothing that adds data to the buffer returns a string in the current buffer code.
The only thing that returns a string is if you call read() or read_hstring() on it.
An additional comment: By definition you are almost never actually "done" with a IOBuffer.
They are designed to be input and output buffers for IO.
And now the basic support is there to use them for Stdio.File objects.
Stdio.File now has a new nonblocking mode: Buffered I/O
In this mode the file object maintains two buffers, one for input and one for output.
The read callback will get the buffer as the second argument, and data that the user does not read from that buffer is kept until the next time data arrives from the file (this means you do not have to do your own buffering of input)
The output buffer is, unsurprisingly, used to output data from.
This has at least three somewhat convenient effects:
o The write callback will now receive that buffer as a second argument. You just add data to it to write it.
o Adding data to the buffer when /not/ in the write callback will still trigger sending of data if no write callback is pending.
o Your write callback will not be called until the buffer is actually empty.
An extremely small demo:
| void null() {} | | int main() | { | Stdio.IOBuffer output = Stdio.IOBuffer(); | Stdio.File fd = Stdio.File(0); | | fd->set_buffer_mode( 0, output ); | | fd->set_nonblocking( Function.uncurry(output->add), null, null ); | | return -1; | }
This will case all data received on stdio to be eched to .. stdin using buffered output.
Not the most useful application, but it does show how easy it is. Buffered mode in general is mainly useful because it removes the need for you to handle the buffers manually in your code.
Currently read() and write() are not in any way modified by having buffered output enabled, if you interact directly with the file object it will bypass the buffer. I am unsure if this is a good idea or not.
I haven't been keeping up with the *Buffer stuff. Would it be possible to make a line iterator on top of this, and would it do any differance to performance?
I have sticky note on my desk that says I should check why Python is faster on some basic log parsing of huge ascii logs I do.
I haven't been keeping up with the *Buffer stuff. Would it be possible to make a line iterator on top of this, and would it do any differance to performance?
I have sticky note on my desk that says I should check why Python is faster on some basic log parsing of huge ascii logs I do.
Well, feel free to use the tokenization capabilities of Stdio.IOBuffer to see if that works better. Who knows?
class MyBuffer(Stdio.File huge_text_file ) { inherit Stdio.IOBuffer; int range_error( int bytes ) { string s = huge_text_file->read(8192); if( s && strlen(s)) { add(s); return; } } }
MyBuffer x = MyBuffer(whatever_fd); while( string line = buf->sscanf( "%[^\n]\n" ) ) .. process da line
The current buffered I/O mode for Stdio.File does not work unless the file is in non-blocking mode due to how the reading in the buffer works.
It can be fixed in at least two ways (one which is fairly obvious from the code above) but it is not yet done.
And now the basic support is there to use them for Stdio.File objects.
Stdio.File now has a new nonblocking mode: Buffered I/O
In this mode the file object maintains two buffers, one for input and one for output.
The read callback will get the buffer as the second argument, and data that the user does not read from that buffer is kept until the next time data arrives from the file (this means you do not have to do your own buffering of input)
The output buffer is, unsurprisingly, used to output data from.
This has at least three somewhat convenient effects:
o The write callback will now receive that buffer as a second argument. You just add data to it to write it.
o Adding data to the buffer when /not/ in the write callback will still trigger sending of data if no write callback is pending.
o Your write callback will not be called until the buffer is actually empty.
[...]
Great addition, although I'm a bit uncertain about whether it would be more suitable to instead add it to Stdio.FILE.
One thing that I'm missing is a way to enqueue a close on write done.
Hmm... Maybe the something like the following would work?
fd->set_write_callback(Function.uncurry(Function.uncurry( Function.curry(fd->close, "w"))));
But I don't like the circular reference...
Currently read() and write() are not in any way modified by having buffered output enabled, if you interact directly with the file object it will bypass the buffer. I am unsure if this is a good idea or not.
It doesn't sound like a good idea, since it would cause interleaving between the manual and automatic calls at random places in the stream, and it would be complicated to recover in a deterministic way.
As a comparison SSL.File also has buffers on both read and write (I hope to be able to replace these with IOBuffers soon), and there read(), write() and close() all go via the buffers.
Then there's of course the separate problem with out of band data...
/grubba
Great addition, although I'm a bit uncertain about whether it would be more suitable to instead add it to Stdio.FILE.
Well, that is not usually what you get from accept() and such. Also, it's not all that usable in Stdio.FILE, since that one can be widestring based.
One thing that I'm missing is a way to enqueue a close on write done.
Hmm... Maybe the something like the following would work? fd->set_write_callback(Function.uncurry(Function.uncurry( Function.curry(fd->close, "w"))));
In your write callback, add fd->close()? It will only be called when the buffer is empty.
That is basically when the ->set_write_callback above does, but it's probably easier to simply set a normal write callback. :)
Currently read() and write() are not in any way modified by having buffered output enabled, if you interact directly with the file object it will bypass the buffer. I am unsure if this is a good idea or not.
It doesn't sound like a good idea, since it would cause interleaving between the manual and automatic calls at random places in the stream, and it would be complicated to recover in a deterministic way.
I sort of agree. It's easy enough to make them redirect data to and from the buffers instead. It was just that that would be an actually noticeable change to the file object. Not a huge one, really, but somewhat sizable.
Having close buffered is however probably not needed for most buffered protocols. If you do want a delayed close, simply set the write callback to one that calls fd->close.
Ideally, IOBuffer should inherit from String.Buffer, for maximum code reuse.
For the reasons I mentioned in the last mail that is not actually posible.
Or, well, for sure, I could inherit String.Buffer, but _all_ methods would have to be overridden, and some do not really make sense, like `+, the main point of IOBuffer is to avoid copying of memory.
The actual reason I beefed up Buffer a bit lately is *because* I need to do some protocol decoding of a byte stream.
Now I see IOBuffer arrive. In order to avoid code bloat, wouldn't it be a better idea to integrate the functionality of IOBuffer into Buffer and just keep one?
Sorry about the timing, I have had IOBuffer on the way for some time (I am still wondering where to put it, however, that has, believe it or not, been a blocker for me. Perhaps Stdio.Buffer? I will create a buffered stream that reads to and writes from said object, without creating pike strings)
Unfortunately it is not possible to make String.Buffer even close to as efficient as long as it uses a stringbuilder. And not using a stringbuilder slows some things down (sprintf comes to mind) and makes others more or less impossible (wide strings) without excessive code duplication.
The whole reason for IOBuffer is that it uses pointer arithmetics on guaranteed 8bit strings to be fast at both reading from the beginning and writing to the end at the same time (I am, by the way, considering converting it to be a circular buffer to avoid the one memmove it does at times), and it is also efficient at creating sub-buffers.
The fact that it is guaranteed to only contain 8bit characters helps a lot too.
Or is the performance difference so strikingly great that this sounds like a bad idea?
As things stands now, yes.
Things like subbuffers is unfortunately actually impossible when using a stringbuilder.
I guess I might outline the plan for IOBuffer some more (I actually did this during the last pike conference, but it has been a while. :))
o Add support for reading to and writing from file objects to it. Either add support in Stdio.File (also add System.Memory + String.Buffer?) or do it the other way around (that way lies madness, however, see also: Shuffler)
The main goal here is to do one copy less and avoid the pike string creation
o Add support for System.Memory & String.Buffer to add()
o Add support for reading other things than "binary holerith" and integers + line + word + json object + encode_value?
o Add support for "throw mode".
It is rather useful to be able to change what happens when you try to read data that is not in the buffer.
| void read_callback() | { | inbuffer->add( new_data ); | // Read next packet | while( Buffer packet = inbuffer->read_hbuffer(2) ) | { | packet->set_error_mode(Buffer.THROW_ERROR); | if( mixed e = catch(handle_packet( packet ) ) ) | if( e->buffer_error ) | protocol_error(); // illegal data in packet | else | throw(e); // the other code did something bad | } | }
The handle_packet function then just assumes the packet is well formed, and reads without checking if the read succeed (since it will, or an error will be thrown).
This code snipplet also demonstrates why the subbuffers are handy, read_hbuffer does not actually copy any data, the returned buffer just keeps a reference to the data in the original buffer.
pike-devel@lists.lysator.liu.se