Adding Utf8String class to directly hold utf8 strings passed via JSRT#5348
Adding Utf8String class to directly hold utf8 strings passed via JSRT#5348MSLaguana wants to merge 2 commits intochakra-core:masterfrom
Conversation
|
Is this the same as the previous PR, or slightly reduced? |
|
This is the same as the previous PR except I split out the |
d5ac264 to
f22a24b
Compare
| } | ||
| const bool isUtf8 = !isString && !(parseAttributes & JsParseScriptAttributeArrayBufferIsUtf16Encoded); | ||
|
|
||
| *script = isExternalArray ? |
There was a problem hiding this comment.
@boingoing I've refactored this method and added a use in CompileRun. As part of that I noticed that this method previously didn't cater for the case of an ExternalArrayBuffer with utf16 encoding; my change supports that, but I'm not sure if that was intentional or not.
There was a problem hiding this comment.
I think that case was an oversight. Thanks for cleaning this up, looks good.
Also refactoring JSRT script handling behavior to detect Utf8Strings and pass them through without conversion to functions which can already deal with utf8 sources. Ensuring that OOM handling is present when calling GetSz which may allocate
5e6374f to
7715a6a
Compare
|
|
||
| private: | ||
|
|
||
| void SetUtf8Buffer(_In_reads_(utf8Length) char* buffer, size_t utf8Length) |
There was a problem hiding this comment.
I very much prefer reading the code with it all in the header like this, but I think we're meant to put most function definitions in cpp files unless they're templates
| FieldNoBarrier(size_t) length; | ||
| Field(char*) buffer; | ||
| } PrefixedUtf8String; | ||
| Field(PrefixedUtf8String*) utf8String; |
There was a problem hiding this comment.
If PrefixedUtf8String is only two pointers big, could we just put it directly in the object rather than requiring a separate allocation (and a separate pointer dereference every time we use it)?
There was a problem hiding this comment.
Part of the original motivation of this was to allow other kinds of strings to convert themselves into a Utf8String, and when I was doing that one of the candidate string types only had 1 pointer's worth of space available, which is why I went down this path (see the ConvertString method which implicitly assumes that there is space for the one pointer available). However right now that's not happening, so I could undo that for now and re-do it later on if necessary.
There was a problem hiding this comment.
What had only one pointer's worth of space? All allocations are bumped up to the nearest 16 bytes anyway, so a 40-byte object actually gets allocated as 48
In reply to: 205622702 [](ancestors = 205622702)
There was a problem hiding this comment.
I believe it was specifically on 32bit one of the types was otherwise full... I don't recall which at the moment though.
| { | ||
| private: | ||
| typedef struct { | ||
| FieldNoBarrier(size_t) length; |
There was a problem hiding this comment.
Why size_t not charcount_t?
There was a problem hiding this comment.
This is the length of the utf8 string, which can be up to 3 times as long as the same string in utf16. Since a utf16 string can be up to ~2^31 in chakra, the corresponding utf8 string may be more than 2^32 in length, so utf8 lengths need to be size_t
There was a problem hiding this comment.
| return this->UnsafeGetBuffer(); | ||
| } | ||
|
|
||
| // TODO: This is currently wrong in the presence of unmatched surrogate pairs. |
There was a problem hiding this comment.
Nit: it might be nice to move this comment down to right before DecodeUnitsIntoAndNullTerminateNoAdvance, because in this position I originally though it was talking about the allocation size being wrong (which would be Very Bad).
There was a problem hiding this comment.
I think that's also not the 'correct' place for this comment; been a while since I wrote it. The actual issue is with the encoding into utf8, and determining how to decode it. As things stand I'm using 'actual' utf8 which doesn't allow lone surrogate halves, but we could instead use 'cesu-8' which does allow that.
There was a problem hiding this comment.
As far as I know canonical utf-8 doesn’t encode surrogate halves at all (paired or otherwise)—the character is encoded directly. If you need to preserve the surrogates independently I think cesu-8 is your only option.
|
|
||
| Assert(decodeLength == this->GetLength()); | ||
|
|
||
| buffer[this->GetLength()] = 0; |
There was a problem hiding this comment.
Seems to already be done by the AndNullTerminate portion of the call above; maybe just assert that it is zero?
| SetUtf8Buffer(buffer, utf8Length); | ||
|
|
||
| this->SetLength(originalString->GetLength()); | ||
| this->SetBuffer(originalString->UnsafeGetBuffer()); |
There was a problem hiding this comment.
This worries me. Maybe originalString->GetSz() instead?
- SubString's buffer is not guaranteed to be null-terminated
- SubString and PropertyString both rely on holding other pointers to keep their buffers alive, because the buffer itself is not recycler allocated
There was a problem hiding this comment.
My thinking was that this should keep the finalized-ness of the source string: if the original string isn't finalized, it won't have a buffer so we would be setting it to nullptr (and then if someone asks for it, we generate it from the utf8 version), and otherwise it was finalized so we can remain finalized. I hadn't considered the case when a string would have an invalid buffer like a substring would.
Using getSz would be safe, but could maybe flatten more strings than expected... but it avoids having to decode the utf8 in future anyway. I think I will go with that.
There was a problem hiding this comment.
I'd be fine with something that said explicitly "if finalized, GetSz, else null", but UnsafeGetBuffer is marked unsafe for a reason :)
In reply to: 205623819 [](ancestors = 205623819)
There was a problem hiding this comment.
Actually even with GetSz the PropertyString problem persists. It returns an interior pointer to a PropertyRecord, which is kept alive by the string itself. We risk it getting collected from under us with this setup.
In reply to: 205624092 [](ancestors = 205624092,205623819)
There was a problem hiding this comment.
Buffer ownership is also a very real problem with the conversion code: we swap out a vtable, overwrite our owning pointer with some other data, and then our data might get collected.
In reply to: 205624399 [](ancestors = 205624399,205624092,205623819)
| template <typename StringType> | ||
| static Utf8String * ConvertString(StringType * originalString, _In_reads_(utf8Length) char* buffer, size_t utf8Length) | ||
| { | ||
| VirtualTableInfo<Utf8String>::SetVirtualTable(originalString); |
There was a problem hiding this comment.
I kind of expected SetVirtualTable to be a template that static-asserted that the source type was big enough for the destination type, but I don't see any such assertion. Could you please add one? Otherwise this is concerning because LiteralString is not big enough to convert successfully.
| #include "Library/PropertyString.h" | ||
| #include "Library/SingleCharString.h" | ||
| #include "Library/Utf8String.h" | ||
| #include "Library/LazyJSONString.h" |
There was a problem hiding this comment.
Ah, this is left over from when it was combined with another change. Didn't notice that before.
| #include "Library/SingleCharString.h" | ||
| #include "Library/SubString.h" | ||
| #include "Library/BufferStringBuilder.h" | ||
| #include "Library/Utf8String.h" |
There was a problem hiding this comment.
This is already included in Runtime.h above, can we remove here?
| { | ||
| if (this->IsFinalized()) | ||
| { | ||
| return this->UnsafeGetBuffer(); |
There was a problem hiding this comment.
If this was the result of ConvertString from a SubString, then this buffer might not be null terminated, violating the rules of GetSz. This part I think is not actually dangerous, because any substring should have a null character eventually in the source string it came from, but we might end up reading a lot more data than intended.
|
|
||
| return ContextAPINoScriptWrapper([&](Js::ScriptContext *scriptContext, TTDRecorder& _actionEntryPopper) -> JsErrorCode { | ||
|
|
||
| Js::JavascriptString *stringValue = Js::LiteralStringWithPropertyStringPtr:: |
There was a problem hiding this comment.
Do we have any data on how many strings from JsCreateString end up used as property keys? We might possibly be regressing some scenarios by changing what is essentially a heuristic here.
| LoadScriptFlag_ExternalArrayBuffer); | ||
| } | ||
| else | ||
| if (error != JsNoError) |
There was a problem hiding this comment.
Surely we have some fancy macro for this
| { | ||
| PARAM_NOT_NULL(bufferVal); | ||
| const WCHAR *url; | ||
| JsErrorCode errorCode = ContextAPINoScriptWrapper_NoRecord([&](Js::ScriptContext *scriptContext) -> JsErrorCode { |
There was a problem hiding this comment.
Does our usual rule about { on new line not apply in lambdas? Just curious, I see some of both in this file so either way can be argued to be consistent with the surrounding style.
Last I looked this had a minor perf impact, but it lays some of the groundwork which could be expanded in future to handle more cases.