CAN A GREEK QUESTION MARK CAUSE A BUG CODE
For example, German „Ö“ and Swedish ”Ö” were unified into a single code point (U+00D6), despite being located in different places in their respective alphabets. The unification of similar looking characters only happened when their functions were similar and when backwards compatibility with other character encodings was not a significant issue. Unicode was created to prevent this thing from occurring:Ĭharacters that look the same should encode to the same value irrespective of the language. However, the strong left-to-right characters 'r e t u r n' are still rendered as left-to-right, because that’s how they naturally associate even in text that is predominantly right-to-left.Īnd here’s the documentation for the Unicode Bidi Algorithm.
Thus, the following characters, which are punctuation, adopt right-to-left ordering because they are all in the “neutral” bidi category and the renderer has been told that the layout is now right-to-left dominant. In the example, the RLI control code changes the current layout intent of the line to “right-to-left” from that point onward in the code stream. To allow this to work right, Unicode includes a bidirectional classification of “neutral” - which means that the character follows the already established line ordering. Latin letters are “Strong Left-to-Right”, but punctuation is not strongly ordered: it follows the rule of surrounding text, so an exclamation-mark in Arabic would be left of the word it followed, in English it will be to the right, but both symbols are coded as U+0033. To help, Unicode assigns each character a bidirectional class. To properly display text, the text renderer needs to figure out the line direction, but it only has a stream of characters to work with. Basically, an English word within Arabic will be shown in its proper order: letters of the word running from left to right, so you see “mouse mat ” not “tam esuom”, for example. The idea of line layout defaults is tricky to get your head around if you don’t already know the rules for typesetting text in a mix of right-to-left and left-to-right scripts. There’s a different code (Right-to-left Override U+202E) which would force the renderer to ignore the character’s native bidi mode and treat it as Right-to-left, regardless. The RLI code changes the layout default of the following characters to right-to-left, but it doesn’t override the behaviour of characters that want to associate left-to-right. If you use a terminal, your results are compounded further by how the terminal deals with bidirectional text. Most syntax-highlighters will catch this (I just checked, and VS Code does), and you’d see a “code-coloured” word inside the comment, but it can be subtle - especially in editors that style the text within comments (e.g., for doc-strings). But if you drag your selection cursor over the second line, you’ll see something isn’t quite what it seems. That second line is the exact sequence of codepoints they talk about in the paper, with the Right-to-Left Isolate code (+2067) in place.Īs you can see, the codes themselves are invisible, but they affect the display of the text following them, and that is how this vulnerability works, by making compilable code appear to be within a comment. ''' Subtract funds from account then ''' return ''' Subtract funds from account then ''' return The example in the paper is this line of Python: This is the mechanism by which the vulnerability allows human reviewers to be fooled.
However, there are codepoints in Unicode which tell the text renderer to override that native ordering and instead render LTR as RTL and vice versa. That isn’t the vulnerability: Arabic and Hebrew text is automatically laid out right-to-left because that is the default line ordering for those codes - Latin text (including the keywords for programming languages) within larger blocks of Arabic and Hebrew is always laid out left-to-right by default.