It is an inalienable fact of software engineering that Strings are immutable. Every time you “mutate” a string, a new one is created.
If for example, you want to concatenate two strings, the result will be that within memory, the first string, second string, and resulting string will all exist
1 2 3 |
var a = "Hello "; var b = "World!"; var c = a + b; |
Even if you don’t need the original, and you have something more like
1 2 |
var a = "Hello "; a += "World!"; |
within memory, you will still have all 3 strings.
As a note, +=
on a string will actually compile into a string.Concat
which doesn’t change anything, but i just think its really neat. Sharplab.io example
So if you were to do something like
1 2 |
var s = "The quick brown fox jumps over the lazy dog"; var words = s.Split(" "); |
you will end up having the string in its original form. And then each word will be allocated in memory once more. Therefore doubling your memory footprint.
Its made even worse when you start splitting splits
1 2 3 4 5 6 7 8 9 |
var s = "a,b|c,d|e,f" var pairs = s.Split("|"); foreach(var pair in pairs) { var split = pair.Split(","); var key = split[0]; var value = split[1]; // stuff } |
within this example each value ends up existing as strings in the original s
and then once again in the array pairs
(sans |) and then once again in split
(sans ,)
This means that if you want to write really good code, you have to do things like read character by character, and write your own state machine to parse out the string.
And ultimately, there’s no reason anyone would want to write fancy fast code when they can instead write something they can come back to 6 months later and understand.
Enter .net core string improvements.
in .net core, there have been some amazing improvements where you can use Span
in order to have a sort of safe way of referring to memory.
Ultimately a Span
is essentially like having pointers to a chunk of memory. Its a lot like in C++ where you could just have a pointer to somewhere in memory and then interpret that memory. But instead its safe! sort of. You always have to accept the risks associated with memory access. Something else could change the memory out from under you. You could assume something is 4 bytes long, but something changes and now they’re 8. etc.
However, when you combine the immutable nature of a string with spans, something magical happens.
In .net core, microsoft introduced a string constructor that accepts a ReadOnlySpan
This means that when you split strings, or do certain other string operations, it can now reference the original immutable string’s memory.
Because both things referencing the memory are ultimately immutable, the operation remains safe, and quite a bit faster.
Within our code, we have a string of the format key1, key2|value1|value2
which would result in a Dictionary<string, List<string>>
or in this case something like
1 2 |
{key1, [] }, {key2, [value1, value2, value3] } |
but imagine a much much longer string.
within our code, we had 3 possible ways of parsing.
- String.Split – previously unthinkable because of the amount of memory usage, and performance
- Regular Expressions – not very readable, but a lot shorter than a state machine
- StringReader and a manually implemented state machine
Obviously, we went with the StringReader in the past. But when you look at the benchmarks now, its a whole new world for code maintainability.
1 2 3 4 5 |
| Method | Mean | Error | StdDev | Allocated | |------------- |----------:|---------:|---------:|----------:| | String.Split | 38.55 us | 0.768 us | 0.822 us | 48 KB | | Regex | 825.91 us | 9.690 us | 7.565 us | 317 KB | | StringReader | 37.68 us | 0.211 us | 0.187 us | 70 KB | |
The regex is clearly awful. Its slow, and allocates significantly more memory.
String.Split and the StringReader are within a margin of error. And String.Split allocated significantly less memory.