Adding Utf8String class to directly hold utf8 strings passed via JSRT by MSLaguana · Pull Request #5348 · chakra-core/ChakraCore

MSLaguana · 2018-06-22T01:13:36Z

Last I looked this had a minor perf impact, but it lays some of the groundwork which could be expanded in future to handle more cases.

dilijev · 2018-06-22T02:11:38Z

Is this the same as the previous PR, or slightly reduced?

MSLaguana · 2018-06-22T02:42:42Z

This is the same as the previous PR except I split out the JSONString changes to another PR. Looks like I'll need to fix up a couple of test failures too

MSLaguana · 2018-07-25T18:28:09Z

lib/Jsrt/Jsrt.cpp

    }
    const bool isUtf8 = !isString && !(parseAttributes & JsParseScriptAttributeArrayBufferIsUtf16Encoded);

-    *script = isExternalArray ?


@boingoing I've refactored this method and added a use in CompileRun. As part of that I noticed that this method previously didn't cater for the case of an ExternalArrayBuffer with utf16 encoding; my change supports that, but I'm not sure if that was intentional or not.

I think that case was an oversight. Thanks for cleaning this up, looks good.

Also refactoring JSRT script handling behavior to detect Utf8Strings and pass them through without conversion to functions which can already deal with utf8 sources. Ensuring that OOM handling is present when calling GetSz which may allocate

sethbrenith · 2018-07-26T22:12:35Z

lib/Runtime/Library/Utf8String.h

+
+    private:
+
+        void SetUtf8Buffer(_In_reads_(utf8Length) char* buffer, size_t utf8Length)


I very much prefer reading the code with it all in the header like this, but I think we're meant to put most function definitions in cpp files unless they're templates

sethbrenith · 2018-07-26T22:15:07Z

lib/Runtime/Library/Utf8String.h

+            FieldNoBarrier(size_t) length;
+            Field(char*) buffer;
+        } PrefixedUtf8String;
+        Field(PrefixedUtf8String*) utf8String;


If PrefixedUtf8String is only two pointers big, could we just put it directly in the object rather than requiring a separate allocation (and a separate pointer dereference every time we use it)?

Part of the original motivation of this was to allow other kinds of strings to convert themselves into a Utf8String, and when I was doing that one of the candidate string types only had 1 pointer's worth of space available, which is why I went down this path (see the ConvertString method which implicitly assumes that there is space for the one pointer available). However right now that's not happening, so I could undo that for now and re-do it later on if necessary.

What had only one pointer's worth of space? All allocations are bumped up to the nearest 16 bytes anyway, so a 40-byte object actually gets allocated as 48

In reply to: 205622702 [](ancestors = 205622702)

I believe it was specifically on 32bit one of the types was otherwise full... I don't recall which at the moment though.

sethbrenith · 2018-07-26T22:15:25Z

lib/Runtime/Library/Utf8String.h

+    {
+    private:
+        typedef struct {
+            FieldNoBarrier(size_t) length;


Why size_t not charcount_t?

This is the length of the utf8 string, which can be up to 3 times as long as the same string in utf16. Since a utf16 string can be up to ~2^31 in chakra, the corresponding utf8 string may be more than 2^32 in length, so utf8 lengths need to be size_t

good point, thanks

In reply to: 205622942 [](ancestors = 205622942)

sethbrenith · 2018-07-26T22:23:49Z

lib/Runtime/Library/Utf8String.h

+                return this->UnsafeGetBuffer();
+            }
+
+            // TODO: This is currently wrong in the presence of unmatched surrogate pairs.


Nit: it might be nice to move this comment down to right before DecodeUnitsIntoAndNullTerminateNoAdvance, because in this position I originally though it was talking about the allocation size being wrong (which would be Very Bad).

I think that's also not the 'correct' place for this comment; been a while since I wrote it. The actual issue is with the encoding into utf8, and determining how to decode it. As things stand I'm using 'actual' utf8 which doesn't allow lone surrogate halves, but we could instead use 'cesu-8' which does allow that.

As far as I know canonical utf-8 doesn’t encode surrogate halves at all (paired or otherwise)—the character is encoded directly. If you need to preserve the surrogates independently I think cesu-8 is your only option.

sethbrenith · 2018-07-26T22:25:24Z

lib/Runtime/Library/Utf8String.h

+
+            Assert(decodeLength == this->GetLength());
+
+            buffer[this->GetLength()] = 0;


Seems to already be done by the AndNullTerminate portion of the call above; maybe just assert that it is zero?

sethbrenith · 2018-07-26T22:28:57Z

lib/Runtime/Library/Utf8String.h

+            SetUtf8Buffer(buffer, utf8Length);
+
+            this->SetLength(originalString->GetLength());
+            this->SetBuffer(originalString->UnsafeGetBuffer());


This worries me. Maybe originalString->GetSz() instead?

SubString's buffer is not guaranteed to be null-terminated

SubString and PropertyString both rely on holding other pointers to keep their buffers alive, because the buffer itself is not recycler allocated

My thinking was that this should keep the finalized-ness of the source string: if the original string isn't finalized, it won't have a buffer so we would be setting it to nullptr (and then if someone asks for it, we generate it from the utf8 version), and otherwise it was finalized so we can remain finalized. I hadn't considered the case when a string would have an invalid buffer like a substring would.

Using getSz would be safe, but could maybe flatten more strings than expected... but it avoids having to decode the utf8 in future anyway. I think I will go with that.

I'd be fine with something that said explicitly "if finalized, GetSz, else null", but UnsafeGetBuffer is marked unsafe for a reason :)

In reply to: 205623819 [](ancestors = 205623819)

Actually even with GetSz the PropertyString problem persists. It returns an interior pointer to a PropertyRecord, which is kept alive by the string itself. We risk it getting collected from under us with this setup.

In reply to: 205624092 [](ancestors = 205624092,205623819)

Buffer ownership is also a very real problem with the conversion code: we swap out a vtable, overwrite our owning pointer with some other data, and then our data might get collected.

In reply to: 205624399 [](ancestors = 205624399,205624092,205623819)

sethbrenith · 2018-07-26T22:32:35Z

lib/Runtime/Library/Utf8String.h

+        template <typename StringType>
+        static Utf8String * ConvertString(StringType * originalString, _In_reads_(utf8Length) char* buffer, size_t utf8Length)
+        {
+            VirtualTableInfo<Utf8String>::SetVirtualTable(originalString);


I kind of expected SetVirtualTable to be a template that static-asserted that the source type was big enough for the destination type, but I don't see any such assertion. Could you please add one? Otherwise this is concerning because LiteralString is not big enough to convert successfully.

sethbrenith · 2018-07-26T22:33:44Z

lib/Runtime/Runtime.h

 #include "Library/PropertyString.h"
 #include "Library/SingleCharString.h"
+#include "Library/Utf8String.h"
+#include "Library/LazyJSONString.h"


Ah, this is left over from when it was combined with another change. Didn't notice that before.

sethbrenith · 2018-07-26T22:34:09Z

lib/Runtime/Library/RuntimeLibraryPch.h

 #include "Library/SingleCharString.h"
 #include "Library/SubString.h"
 #include "Library/BufferStringBuilder.h"
+#include "Library/Utf8String.h"


This is already included in Runtime.h above, can we remove here?

sethbrenith · 2018-07-26T22:48:47Z

lib/Runtime/Library/Utf8String.h

+        {
+            if (this->IsFinalized())
+            {
+                return this->UnsafeGetBuffer();


If this was the result of ConvertString from a SubString, then this buffer might not be null terminated, violating the rules of GetSz. This part I think is not actually dangerous, because any substring should have a null character eventually in the source string it came from, but we might end up reading a lot more data than intended.

sethbrenith · 2018-07-26T22:55:35Z

lib/Jsrt/Jsrt.cpp


    return ContextAPINoScriptWrapper([&](Js::ScriptContext *scriptContext, TTDRecorder& _actionEntryPopper) -> JsErrorCode {

-        Js::JavascriptString *stringValue = Js::LiteralStringWithPropertyStringPtr::


Do we have any data on how many strings from JsCreateString end up used as property keys? We might possibly be regressing some scenarios by changing what is essentially a heuristic here.

sethbrenith · 2018-07-26T22:57:37Z

lib/Jsrt/Jsrt.cpp

-            LoadScriptFlag_ExternalArrayBuffer);
-    }
-    else
+    if (error != JsNoError)


Surely we have some fancy macro for this

sethbrenith · 2018-07-26T22:59:25Z

lib/Jsrt/Jsrt.cpp

 {
    PARAM_NOT_NULL(bufferVal);
    const WCHAR *url;
+    JsErrorCode errorCode = ContextAPINoScriptWrapper_NoRecord([&](Js::ScriptContext *scriptContext) -> JsErrorCode {


Does our usual rule about { on new line not apply in lambdas? Just curious, I see some of both in this file so either way can be argued to be consistent with the surrounding style.

sethbrenith

🕐

MSLaguana force-pushed the jsUtf8String_rebase branch from d5ac264 to f22a24b Compare July 25, 2018 16:22

MSLaguana requested a review from boingoing July 25, 2018 18:26

MSLaguana commented Jul 25, 2018

View reviewed changes

MSLaguana added 2 commits July 26, 2018 09:19

Adding Js::Utf8String as a string type which maintains a UTF8 buffer.

191fd92

MSLaguana force-pushed the jsUtf8String_rebase branch from 5e6374f to 7715a6a Compare July 26, 2018 16:23

MSLaguana requested a review from digitalinfinity July 26, 2018 20:22

sethbrenith reviewed Jul 26, 2018

View reviewed changes

sethbrenith suggested changes Jul 26, 2018

View reviewed changes

rhuanjl mentioned this pull request Jan 9, 2021

Consider UTF-8? #6570

Open


		private:

		void SetUtf8Buffer(_In_reads_(utf8Length) char* buffer, size_t utf8Length)


		Assert(decodeLength == this->GetLength());

		buffer[this->GetLength()] = 0;


		return ContextAPINoScriptWrapper([&](Js::ScriptContext *scriptContext, TTDRecorder& _actionEntryPopper) -> JsErrorCode {

		Js::JavascriptString *stringValue = Js::LiteralStringWithPropertyStringPtr::

Conversation

MSLaguana commented Jun 22, 2018

Uh oh!

dilijev commented Jun 22, 2018

Uh oh!

MSLaguana commented Jun 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethbrenith left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants