String handling optimisation

29 Mar 2012


      ...
Does anyone know how often in the code we actually depend on the
fact that the same string will be at the same address in memory?
Often, but it's probably not hard to find a set of gatekeeper
functions that cover all the cases.
...
Because I'm contemplating an optimisation which would involve making
the string duplication avoidance opportunistic instead of mandatory.
I.e. something along the lines of all strings shorter than stringmin will
always be optimised to a single reference, and all strings above that *might*
have more than one reference, but not necessarily do (i.e. they're not
fully hashed all the time, to avoid the overhead of rehashing large strings
repeatedly when juggling around lots of strings).
I did some preparations for this ~6 years ago (cf commit 3788c640).
Note that it typically isn't the calculation of the hash that is
expensive, but the comparison on hash-hit.
...
All the places that depend on same string = same address would need to be
patched.  Also, to determine stringmin, some profiling of existing apps
would be interesting.  Is that statistic available for say Roxen, to know
the distribution of string length and reference count in a running
application?
Note also that in Pike the string hash function is parameterized, and
the parameters are changed depending on how the hashtable is balanced.
All of the statistics used for the balancing of the hashtable
currently aren't visible at Pike-level, and they differ somewhat
between different versions of Pike.
/grubba

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

String handling optimisation