Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAST(VARCHAR as JSON) should escape Unicode characters #7118

Closed
mbasmanova opened this issue Oct 18, 2023 · 0 comments
Closed

CAST(VARCHAR as JSON) should escape Unicode characters #7118

mbasmanova opened this issue Oct 18, 2023 · 0 comments
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@mbasmanova
Copy link
Contributor

Bug description

Velox currently doesn't escape Unicode characters when casting to JSON.

presto> select U&'\+01F64F';
 _col0
-------
 🙏
(1 row)

presto> select cast(U&'\+01F64F' as json);
     _col0
----------------
 "\uD83D\uDE4F"
(1 row)

Velox:

testCastToJson<StringView>(VARCHAR(), {"\U0001F64F"}, {"\"\\ud83d\\ude4f\""});

at 0: expected "\ud83d\ude4f", but got "🙏"

Velox uses folly::json::escapeString to case string to json. This function allows to specify configuration options including

  // If true, non-ASCII utf8 characters would be encoded as \uXXXX:
  // - if the code point is in [U+0000..U+FFFF] => encode as a single \uXXXX
  // - if the code point is > U+FFFF => encode as 2 UTF-16 surrogate pairs.
  bool encode_non_ascii{false};

The only difference with Presto is that folly::json::escapeString uses lowercase hex digits, while Presto uses uppercase.

CC: @zacw7 @aditi-pandit @amitkdutta @kagamiori @kevinwilfong

Related: FasterXML/jackson-core#717

System information

n/a

Relevant logs

No response

@mbasmanova mbasmanova added bug Something isn't working triage Newly created issue that needs attention. labels Oct 18, 2023
mbasmanova added a commit to mbasmanova/velox-1 that referenced this issue Oct 19, 2023
)

Summary:
Partially fixes facebookincubator#7118

Folly uses lowercase hex digits, but Jackson library used in Presto uses uppercase hex digits.


Reviewed By: xiaoxmeng

Differential Revision: D50406129

Pulled By: mbasmanova
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant