Space Complexity Example
So I am wondering when objects (or primitives) are created inside a for loop, how does that contribute to the space complexity ?
For instance, heres a code example:
public boolean checkUnique(String p){
int term = -1;
int len = p.length();
for (int i =0; i<len; i++) {
char c = p.charAt(i);
StringBuilder sb = new StringBuilder(p.substring(0, i));
sb.append(p.substring(i+1, len));
String str = sb.toString();
if (str.indexOf(c) != term) {
return false;
}
}
return true;
}
So I'm trying to analyze the space complexity of this algorithm. It looks like it is O(n). Heres my reasoning: the number of iterations is equivalent to the input size, and in each iteration we are creating a StringBuilder object, hence we are creating StringBuilder objects proportional to the input size. The same reasoning can be applied to the fact we are creating String objects and char at each iteration. Is this reasoning correct ?
The reason I ask is because I've encountered an algorithm where the following assignment is made after each iteration:
int val = str.charAt(i);
Yet this algorithm has O(1) space complexity. So my understanding must be incorrect then ? In which case, the checkUnique algorithm has space complexity of O(1) as well.
I'm going to go over the bad design decisions in this algorithm, and suggest a better one.
The existing answers answer the complexity-class question very well, but don't point out the error in your question: Your space complexity is O(N), because you make a copy of the entire input (minus one char).
If your loop held onto the temp copy from every iteration, the space complexity would match the time complexity: O(N*M), where M is the length of the shortest prefix that contains a duplicate. (Or M = N
if there are no duplicates).
The pigeonhole principle ensures M <= 216 (the number of unique values a char
can have).
An optimization for any algorithm is to always return true if the input.length() > Character.MAX_VALUE
. (Or for codepoints, input.length() > Character.MAX_CODE_POINT
, which is 1114111)
If much of your input is ASCII, M will be more like 80 at most. Actually, most languages don't use many different codepoints, even if the range doesn't start at 0. Non-alphabetic languages can have a couple thousand glyphs, I think. But anyway, the point is, it's useful to check for repeats in the early part of the string, rather than doing anything that scans the entire potentially-huge string if the first character happens to be unique.
Spoiler: adding characters to a Set is the best way by far to find repeats. O(M) time and space, with low constant-factor overhead.
Besides performance, keep in mind that Java char
s are utf16, but some Unicode codepoints have a multi-char encoding. It's really unfortunate for Java that it got the worst of both worlds: doubled space usage compared to utf8 for ASCII, but still having to deal with multi-element "characters". When Java was designed, 16bits was enough to hold any Unicode character, so utf16 did avoid the difficulties of multi-byte character encodings like utf8. Wide characters were popular for a while, and maybe still are on Windows, IDK. Unix / POSIX / Internet protocols have pretty much standardized on utf8 for everything.
It looks like the best way to iterate over codepoints is with
int cp = str.codePointAt(pos);
pos+=Character.charCount(cp);
Looping i=0..N and doing codePointAt(i)
would probably have to scan from the beginning of the string every iteration to find surrogate pairs. A smart JVM might notice the redundancy and optimize, but I wouldn't count on it.
Analysis of the original algo
This basic design, of looping over every character, and checking all the other characters, makes sense and is easy to think of. There's a lot of redundant work (checking both a==c
and b==c
when we already know a!=b
), so it's going to be O(N2) (see below for a diff algo), but we can implement it with a lot less constant overhead than your version.
i
is very cheap. There are so many ways to avoid re-copying the string every time:
copy it once to a char[]
array, and take the current char out of consideration by modifying it. (eg c = tmpbuf[i]++;
search tmpbuf
for c
, tmpbuf[i]--
).
Copy it once into a StringBuffer, and modify the current position like in step 1 with buf.setCharAt(int, char)
. Then you can use StringBuffer.indexOf()
like you were doing before. Hrm, only String
has specializations of indexOf for single chars, so .toString().indexOf()
might be better. Same goes for StringBuilder: only indexOf(String)
. (Don't be tempted to use deleteCharAt() and insert(), because they're probably implemented by shuffling the remaining elements).
Since the main library-function suggestion for array searching doesn't work on arrays of primitive types (like char
), you could just loop manually and skip i
. Depending on the JVM, looping over the input string manual with charAt
might be just as fast.
Use one of the multi-arg versions of indexOf
: String.indexOf(int ch, int fromIndex)
to search from the beginning (expecting the search to stop at position i
), and then from i+1
(expecting a not-found).
Use String.lastIndexOf(int ch)
to search backwards. Expecting it return i
. Or use lastIndexOf(ch, i-1)
to search backwards towards the beginning of the string, and indexOf(ch, i+1)
to search forwards.
p.indexOf(c, i+1)
is the most obvious choice. Doesn't stop very quickly if there is a duplicate pair early in the string. If performance for large N and small M matters, searching only the characters from 0 .. i-1 won't every touch the memory holding the input characters beyond the first repeat. In the no-match case, you're just doing at the start what happens at the end of the other algorithm.
ASCII text input is probably going to be very common, and there are only about 100 different ascii characters, so there will very often be a repeat somewhere in the first 100 chars. However, the first few characters may not be one of those duplicates.
An even better quickstart might be pick a tunable parameter like 256 or something (much smaller than CPU cache, but plenty big enough to have repeats), and search out to that point before starting to look at the whole array.
final int fastlen = 256;
int i=0;
while (++i < fastlen) {
char c = p.charAt(i);
if (p.lastIndexOf(c, fastlen) != i) return true;
// maybe lastIndexOf(c, i + fastlen)? We're going to repeat work anyway, so what's a little more?
}
// i == fastlen if we haven't returned yet
for ( ; i < N ; i++ ){
char c = p.charAt(i);
if (p.lastIndexOf(c, fastlen) != -1 ||
p.indexOf(c, i + 1) != -1 )
return true;
}
You could probably get even more sophisticated, and keep working in blocks, but let's stop optimizing there because the whole algorithm is fundamentally slower than it needs to be.
A better algo
We can actually do it using O(M) temporary storage and O(M) time
for (final char c : myarray) { // loop over chars
// add c to a HashSet<char>. If it was already present, return true
}
return false;
You can speed this up even more by using a simple array for the ASCII range, and a HashMap only for c > 127
. See my answer on a question about finding the most frequent character. This will work with Unicode codepoints. I don't see a String.indexOf(int codepoint)
, so the search-based methods might have to use indexOf(String str)
, which may be slower.
A bitmap is also an option for implementing a Set that detects repeats. 2^16 bits is only 2^13 bytes, so you could test and set bits in an 8k bitmap until you found one that was already set. (This approach isn't good for codepoints, though.)
Alternative: Get the char array out of the string. Sort it. Then loop through, looking for buf[i] == buf[i-1]
. This is O(N log n), though, and will be much slower than using a hashset. (The Hashset method is like doing a RadixSort or BucketSort and detecting duplicates on the fly).
This has problem with utf16 multi-char codepoints though. IDK how to efficiently sort an array of char[]
while preserving them.
In order to do complexity analysis, you have to be very clear about how your machine works. How is the machine going to run your code? What are the machine's capabilities?
There are at least two very similar ways a machine running that code could work, each of which would lead to a distinct answer to your question.
Suppose every new variable declaration results in a unique bit of memory being assigned and, once assigned, that memory cannot be reused. This might be like a tape memory, or like you writing down the steps in ink on paper. If you're doing it this way, the space complexity will indeed be proportional to the number of loop iterations, since you allocate memory in the body of the loop.
Suppose, instead, that new variable declarations use the first available bit of memory being assigned, and that this memory is released and free to be reassigned as soon as the variable goes out of scope. In this case, by the end of the function, all but a constant number of variables have gone out of scope, so the space complexity is constant.
Java has automatic garbage collection, so we might reasonable say we're in the second scenario even wrt heap-allocated memory (stack-allocated memory, like primitives, definitely work in the second way). In reality, the garbage collection may not happen instantaneously in all cases, so we might be somewhere between the cases. But in the happy case, we can probably safely say that in Java, this is O(1).
In C++, the story would be different. There, we'd need to new
and delete
(or equivalent) our heap-allocated memory to be in the second scenario; otherwise, we'd be in the first!
As you can see, a great deal depends on what the code really means, which can only be fully understood in terms of the system on which it executes.
I put aside the fact that this implementation of the algorithm is very poor.
You say:
The number of iterations is equivalent to the input size, and in each iteration we are creating a StringBuilder object...
So far so good.
... hence we are creating StringBuilder objects proportional to the input size.
Yes it's also true. BUT. You didn't keep thoose created objects from an iteration to the other. They disapear gradually.
In fact the compiler probably detect the object with a scope limited to the body of the loop and optimize memory usage so it's always the same place that is used (could be a register for small object like c
in your code).
In conclusion, if the compiler work well, your algorithm is O(1).
If you had, at each iteration, put c
or str
in a list things would have been different.
下一篇: 空间复杂性的例子