如果字符串在.NET中是不可变的,那么为什么Substring需要O(n)时间?
本文翻译自:If strings are immutable in .NET, then why does Substring take O(n) time?
Given that strings are immutable in .NET, I'm wondering why they have been designed such that string.Substring() takes O( substring.Length ) time, instead of O(1) ? 鉴于字符串在.NET中是不可变的,我想知道为什么它们的设计使得string.Substring()需要O( substring.Length )时间而不是O(1) ?
ie what were the tradeoffs, if any? 即,有什么权衡,如果有的话?
#1楼
参考:https://stackoom.com/question/SI8p/如果字符串在-NET中是不可变的-那么为什么Substring需要O-n-时间
#2楼
Java used to reference larger strings, but: Java用于引用更大的字符串,但是:
Java changed its behavior to copying as well, to avoid leaking memory. Java也将其行为改为复制 ,以避免泄漏内存。
I feel like it can be improved though: why not just do the copying conditionally? 我觉得它可以改进:为什么不只是有条件地进行复制?
If the substring is at least half the size of the parent, one can reference the parent. 如果子字符串至少是父字符串大小的一半,则可以引用父字符串。 Otherwise one can just make a copy. 否则就可以复制一份。 This avoids leaking a lot of memory while still providing a significant benefit. 这样可以避免泄漏大量内存,同时仍能提供显着的优势。
#3楼
None of the answers here addressed "the bracketing problem", which is to say that strings in .NET are represented as a combination of a BStr (the length stored in memory "before" the pointer) and a CStr (the string ends in a '\\0'). 这里的答案都没有解决“包围问题”,也就是说.NET中的字符串表示为BStr(存储在“指针之前”的内存中的长度)和CStr(字符串以字符串结尾)的组合'\\ 0')。
The string "Hello there" is thus represented as 字符串“Hello there”因此表示为
0B 00 00 00 48 00 65 00 6C 00 6F 00 20 00 74 00 68 00 65 00 72 00 65 00 00 00
(if assigned to a char* in a fixed -statement the pointer would point to the 0x48.) (如果在fixed语句中分配给char* ,则指针将指向0x48。)
This structure allows for fast lookup of the length of a string (useful in many contexts) and allows for the pointer to be passed in a P/Invoke to Win32 (or other) APIs which expect a null-terminated string. 此结构允许快速查找字符串的长度(在许多上下文中有用),并允许指针在P / Invoke中传递到Win32(或其他)API,这些API期望以null结尾的字符串。
When you do Substring(0, 5) the "oh, but I promised there'd be a null-character after the last character" rule says you need to make a copy. 当你做Substring(0, 5) ,“哦,但我承诺在最后一个字符之后会有一个空字符”规则说你需要复制一份。 Even if you got the substring at the end then there'd be no place to put the length without corrupting the other variables. 即使你在最后得到了子串,那么在没有破坏其他变量的情况下也没有地方可以放置长度。
Sometimes, though, you really do want to talk about "the middle of the string", and you don't necessarily care about the P/Invoke behavior. 但有时候,你真的想谈谈“字符串的中间部分”,而你并不一定关心P / Invoke的行为。 The recently added ReadOnlySpan structure can be used to get a no-copy substring: 最近添加的ReadOnlySpan结构可用于获取无副本子字符串:
string s = "Hello there";
ReadOnlySpan hello = s.AsSpan(0, 5);
ReadOnlySpan ell = hello.Slice(1, 3);
The ReadOnlySpan "substring" stores the length independently, and it does not guarantee that there's a '\\0' after the end of the value. ReadOnlySpan “substring”独立存储长度,并不保证值结束后有'\\ 0'。 It can be used in many ways "like a string", but it is not "a string" since it doesn't have either BStr or CStr characteristics (much less both of them). 它可以在很多方面“像字符串”一样使用,但它不是“字符串”,因为它没有BStr或CStr特性(更不用说它们两者)。 If you never (directly) P/Invoke then there's not much of a difference (unless the API you want to call doesn't have a ReadOnlySpan overload). 如果您从未(直接)P / Invoke那么没有太大的区别(除非您要调用的API没有ReadOnlySpan重载)。
ReadOnlySpan cannot be used as the field of a reference type, so there's also ReadOnlyMemory ( s.AsMemory(0, 5) ), which is an indirect way of having a ReadOnlySpan , so the same differences-from- string exist. ReadOnlySpan不能用作引用类型的字段,因此还有ReadOnlyMemory ( s.AsMemory(0, 5) ),这是一个具有ReadOnlySpan的间接方式,所以相同的差异-from- string存在。
Some of the answers/comments on previous answers talked about it being wasteful to have the garbage collector have to keep a million-character string around while you continue to talk about 5 characters. 以前的答案的一些答案/评论谈到,当你继续谈论5个字符时,让垃圾收集器必须保留一百万个字符的字符串是浪费的。 That is precisely the behavior you can get with the ReadOnlySpan approach. 这正是您使用ReadOnlySpan方法可以获得的行为。 If you're just doing short computations, the ReadOnlySpan approach is probably better. 如果您只是进行简短的计算,ReadOnlySpan方法可能会更好。 If you need to persist it for a while and you're going to keep only a small percentage of the original string, doing a proper substring (to trim off the excess data) is probably better. 如果你需要坚持一段时间并且你只保留原始字符串的一小部分,那么做一个合适的子字符串(以删除多余的数据)可能会更好。 There's a transition point somewhere in the middle, but it depends on your specific usage. 中间位置有一个转换点,但这取决于您的具体用法。
#4楼
Precisely because Strings are immutable, .Substring must make a copy of at least a portion of the original string. 正因为字符串是不可变的,。子.Substring必须复制原始字符串的至少一部分。 Making a copy of n bytes should take O(n) time. 制作n个字节的副本应该花费O(n)时间。
How do you think you would copy a bunch of bytes in constant time? 您认为如何在固定时间内复制一堆字节?
EDIT: Mehrdad suggests not copying the string at all, but keeping a reference to a piece of it. 编辑:Mehrdad建议不要复制字符串,而是保留对它的一部分的引用。
Consider in .Net, a multi-megabyte string, on which someone calls .SubString(n, n+3) (for any n in the middle of the string). 在.Net中考虑一个多兆字节的字符串,有人调用.SubString(n, n+3) (对于字符串中间的任何n)。
Now, the ENTIRE string cannot be Garbage Collected just because one reference is holding on to 4 characters? 现在,ENTIRE字符串不能仅仅因为一个引用持有4个字符而被收集垃圾? That seems like a ridiculous waste of space. 这似乎是一种荒谬的浪费空间。
Further, tracking references to substrings (which may even be inside substrings), and trying to copy at optimal times to avoid defeating the GC (as described above), makes the concept a nightmare. 此外,跟踪对子串的引用(甚至可能在子串内),并尝试在最佳时间复制以避免击败GC(如上所述),使得该概念成为一场噩梦。 It is far simpler, and more reliable, to copy on .SubString , and maintain the straightforward immutable model. 复制.SubString并且维护简单的不可变模型要简单得多,也更可靠。
EDIT: Here's a good little read about the danger of keeping references to substrings within larger strings. 编辑:这是一个很好的小读物,关于在较大的字符串中保持对子串的引用的危险。
#5楼
Java (as opposed to .NET) provides two ways of doing Substring() , you can consider whether you want to keep just a reference or copy a whole substring to a new memory location. Java(而不是.NET)提供了两种方式来执行Substring() ,您可以考虑是仅要保留引用还是将整个子字符串复制到新的内存位置。
The simple .substring(...) shares the internally used char array with the original String object, which you then with new String(...) can copy to a new array, if needed (to avoid hindering garbage collection of the original one). 简单的.substring(...)与原始的String对象共享内部使用的char数组,然后你可以根据需要将new String(...)复制到一个新数组(以避免阻碍原始的垃圾收集)一)。
I think this kind of flexibility is a best option for a developer. 我认为这种灵活性是开发人员的最佳选择。
#6楼
UPDATE: I liked this question so much, I just blogged it. 更新:我非常喜欢这个问题,我只是写了博客。 See Strings, immutability and persistence 请参阅字符串,不变性和持久性
The short answer is: O(n) is O(1) if n does not grow large. 简短的回答是: 如果n不变大,则O(n)为O(1)。 Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant . 大多数人从微小的字符串中提取微小的子串,因此渐进式渐近增长的方式完全无关紧要 。
The long answer is: 答案很长的答案是:
An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. 构建一个不可变的数据结构,使得实例上的操作允许仅使用少量(通常为O(1)或O(lg n))复制或新分配来重复使用原始内存,称为“持久”不可变数据结构。 Strings in .NET are immutable; .NET中的字符串是不可变的; your question is essentially "why are they not persistent"? 你的问题基本上是“为什么他们不坚持”?
Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. 因为当你看看那些在.NET程序字符串通常进行的操作,它是在每一个相关的方式都很难糟糕简单地做出一个全新的字符串。 The expense and difficulty of building a complex persistent data structure doesn't pay for itself. 构建复杂的持久数据结构的费用和难度并不能为其付出代价。
People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. 人们通常使用“substring”来提取一个短字符串 - 比如十个或二十个字符 - 用一个稍长的字符串 - 可能是几百个字符。 You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. 您在逗号分隔文件中有一行文本,并且您想要提取第三个字段,这是一个姓氏。 The line will be maybe a couple hundred characters long, the name will be a couple dozen. 该行可能是几百个字符长,名称将是几十个。 String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. 在现代硬件上,50字节的字符串分配和存储器复制速度惊人地快 。 That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; 这使得一个新的数据结构由一个指向现有字符串中间的指针加上一个长度组成,这也是惊人的快速无关紧要; "fast enough" is by definition fast enough. “足够快”的定义足够快。
The substrings extracted are typically small in size and short in lifetime; 提取的子串通常尺寸小,寿命短; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. 垃圾收集器很快就要收回它们,并且它们首先没有在堆上占用太多空间。 So using a persistent strategy that encourages reuse of most of the memory is also not a win; 因此,使用鼓励重用大部分内存的持久策略也不是一个胜利; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers. 你所做的就是让你的垃圾收集器变慢,因为现在它不得不担心处理内部指针。
If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. 如果人们通常对字符串执行的子字符串操作完全不同,那么使用持久方法是有意义的。 If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; 如果人们通常拥有百万字符的字符串,并且正在提取数千个大小在十万字符范围内的重叠子字符串,并且这些子字符串在堆上存在很长时间,那么使用持久子字符串将是完全合理的办法; it would be wasteful and foolish not to. 不要浪费和愚蠢。 But most line-of-business programmers do not do anything even vaguely like those sorts of things . 但是, 大多数业务线程序员甚至都不会做任何事情 。 .NET is not a platform that is tailored for the needs of the Human Genome Project; .NET不是专为满足人类基因组计划需求而定制的平台; DNA analysis programmers have to solve problems with those string usage characteristics every day; DNA分析程序员必须每天解决这些字符串使用特征的问题; odds are good that you do not. 你不这样做的几率很高。 The few who do build their own persistent data structures that closely match their usage scenarios. 少数人建立自己的持久数据结构,与他们的使用场景紧密匹配。
For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. 例如,我的团队编写的程序可以在您键入时对C#和VB代码进行实时分析。 Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. 其中一些代码文件非常庞大 ,因此我们无法进行O(n)字符串操作来提取子字符串或插入或删除字符。 We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. 我们构建了一组持久的不可变数据结构,用于表示对文本缓冲区的编辑,允许我们快速有效地重复使用大量现有字符串数据以及典型编辑时现有的词法和语法分析。 This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. 这是一个难以解决的问题,其解决方案针对C#和VB代码编辑的特定领域进行了狭窄的定制。 It would be unrealistic to expect the built-in string type to solve this problem for us. 期望内置字符串类型为我们解决这个问题是不现实的。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
