Josh Goldberg

TypeScript Contribution Diary: Identifiers after Numeric Literals

Dec 7, 201820 minute read

How I made TypeScript's parsing of name characters after numbers conform to the ECMAScript specification just a bit more strictly.

Problem Statement It’s been a while since I poked into TypeScript and I had some free time, so I wanted to challenge myself. This contribution involved going a bit deeper into TypeScript parsing and scanning. To find the issue, I had searched for issues with both the Effort: Moderate and help wanted tags, then sorted by oldest (issues query link). One from 2015 seemed interesting:

#4702: IdentifierStart cannot occur following a NumericLiteral

Rephrasing the issue post: according to the ECMA specification, section 11.8.3 — NumericLiterals, the SourceCharacter after a NumericLiteral must not be an IdentifierStart. For example:

tl;dr

You need a space between number literals (3) and identifiers or keywords (a, in). 3in is bad; 3 in is allowed.

Terminology

What on Earth do those terms mean?

For NumericLiteral and IdentifierStart, you can see the equivalent NumericLiteral and Identifier nodes by copy & pasting into ts-ast-viewer:

let value = 7;

It shows that value is an Identifier node (in this case, the name of a variable) and 7 is a NumericLiteral.

If you copy a simple example like 3in[null] into ts-ast-explorer, it shows a tree with nodes (and syntax highlighting) equivalent to 3 in [null]. Looks like the goal for this issue is to add a check after the numeric literal to make sure there’s not what could be the start of an identifier.

Enter the Compiler

How does TypeScript parse in a source file and convert that into an AST?

According to the Basarat Gitbook, that’s the well-named src/parser.ts and src/scanner.ts in the TypeScript source.

Starting Point

How does TypeScript launch the parsing and scanning of a file?

Skimming through parser.ts in order, we see…

  1. A few type declarations, such as enum Signature Flags
  2. Public node utilities such as visitNodes and forEachChild
  3. A createSourceFile function that takes file name, text, and some other things to create a SourceFile

…that seems promising! It also has some performance.markers around parsing, which we can assume help log timing around how long important operations take. Seems like a good place to look into. In order, this function:

  1. Calls to Parser.parseSourceFile (that’s a good sign too!), which…
  2. Calls parseSourceFileWorker for non-JSON files, which…
  3. Creates a new blank sourceFile with createSourceFile, processes pragmas, creates sourceFile.statements with parseList, and sets some metadata on the source file.

I don’t know or care about pragmas or miscellaneous source file metadata, but based on AST viewing websites, the statements property contains the root-level nodes of the file. parseList must therefore be the place that creates these nodes, so it must call things that parse nodes.

parseList takes in a parseElement lambda that it passes to parseListElement, which will either call some consumeNode function or parseElement.

Question: which is relevant? Do we care about consumeNode or parseElement?

Aside: Debugging

At this point, someone capable with JavaScript tooling might attach debugger breakpoints and debug through TypeScript to answer that question.

Not me!

I just wanted to quickly modify the compiled TypeScript code and re-run it to print out which things are being used without having to worry about setting up debugging (about which I am clueless). My setup instead consisted of:

index.ts contained a simple test case:

let hasFour = 3in[null]

With this setup, running tsc index.ts in C:/Code/tsdevtest uses C:/Code/typescript/lib/tsc.js.

I also added the following line at the beginning of createSourceFile:

global.wat = fileName === "index.ts";

That line creates a global variable named wat which is only true for files named index.ts. Later on in the code, running wat && console.log(":)") would only print while parsing index.ts.

This is useful because TypeScript will also compile files containing built-in types, such as lib.d.ts. Running with --noLib removes them but results in “Cannot find global type Array” and similar errors if parsing succeeds.

Back to the Hunt

…anyway, we wanted to see how nodes are parsed out of the source file, and were looking at parseListElement to see which sub-function it calls. I put a wat && console.log("consumeNode") before return consumeNode(node); and a wat && console.log("parseElement") before return parseElement();.

😲 Only parseElement was being printed, so only parseStatement was being called to. It contains a switch statement that, depending on the value of token(), can call a bunch of different parse* functions, or default to parseExpressionOrLabeledStatement.

The token function returns a variable named currentToken, of type SyntaxKind, so it seems safe to assume this is the current syntax kind being parsed. It’s parsed with scanner.scan, which we know from the Basarat Gitbook Scanner page starts at the starting index of a node and finds the end of it. scan contains another switch statement over a ch variable, which is the character code of the SourceCharacter being looked at.

Given that the relevant contents for parsing start with 3in, ch would have to be "3".charCodeAt(0), or 51, for the start of parsing 3 as a NumericLiteral. tsc.js showed a case 51: as compiled from an equivalent line in scanner.ts: case CharacterCodes._3:. That line calls to scanNumber.

…so the function that scans a number to be parsed is scanNumber. 💪

Number Scanning

How does scanNumber scan source code into NumericLiterals?

scanNumber was about 50 lines of real code and kind of hard to understand. My first guesses squinting through its first three root-level if statements were that they, in order:

  1. Check if the number starts with a ., such as with .5
  2. Check if the number is a “scientific” number, meaning it starts with E or e, such as with 1e3 (1000)
  3. Check if the number has a _ separator

The last if/else pair checks whether the number is a decimal or scientific:

Awesome! Here is the one place where a NumericLiteral may be created. Just be sure, I added a wat && console.log("Result:", result) at the end:

Confirmed. Great.

Checking for Identifiers

How can we know whether the next token is an identifier?

In theory, we could check if the character code of the next thing is an a-Z letter, but there are almost certainly esoteric rules around that. How does the scanner scan in an identifier, rather than a numeric literal? Presumably there must be some internal TypeScript logic that checks whether we’re starting with an identifier?

Going back to scan, the default case for any character code not handled as a special character immediately calls to an isIdentifierStart function. isIdentifierStart checks if the character code is within A-Z or a-z, is $ or _ (which are valid name starters), is a high-valued ASCII character, or is a unicode identifier. I don’t know what those last two mean but that looks like the right place!

Reinforced Number Scanning

Per a previous post of mine that included how to add diagnostics, I made a new diagnostic:

"An identifier cannot follow a numeric literal.": {
	"category": "Error",
	"code": 1351
},

…and in scanNumber, before the if/else returns, added:

if (isIdentifierStart(text.charCodeAt(pos), languageVersion)) {
	error(Diagnostics.An_identifier_cannot_follow_a_numeric_literal, pos, 1);
}

Upon running tsc index.ts, voila! The error showed!

Test Time

Per my previous post on verifying parser changes, TypeScript has a “baselines” system where you check in source files with expected types and errors. I added a tests/cases/compiler/identifierStartAfterNumericLiteral.ts to explicitly test these changes:

let valueIn = 3in[null]

…but after re-running jake baseline, there were new errors found in a bunch of files, such as bigIntIndex.errors.txt. 😢

…which reminded me, BigInts such as 1n are valid numbers I should verify still work. Adding a 3in[null], 3nin[null], 1e9, and 1n to index.ts:

😕. It looks like TypeScript thought in the second and third cases that n was starting a new identifier instead of whatever was after the number.

A few minutes of cursing and poking around scanNumber later, I noticed that checkBigIntSuffix increments pos for the case of it finding an n suffix. Brain blast! 💡 Aha! We can’t know what the end position of the number is until after we’ve scanned past any last n; checking for a subsequent IdentifierStart therefore has to come after.

I extracted the new logic into a checkForIdentifierAfterNumericLiteral method and called it immediately before each return in scanNumber.

The Pull Request

Added error for IdentifierStart immediately after a NumericLiteral by Josh Goldberg · Pull Request #28857:

A couple pieces of feedback were given on the pull request:

Sounds reasonable. I updated the error message pretty simply. Expanding the red squiggly was a little less straightforward…

Full Identifier Complaints

In order to have the error messages’ span length contain the full identifier, it’s necessary to know how long the identifier is. That necessitates reading in the identifier using whatever logic TypeScript would normally use.

I didn’t already know where to find this, but numbers are scanned in with scanNumber, it made sense to first check for scanIdentifier.

…which doesn’t exist…

…but a scanIdentifierParts does exist, which consists of a while loop that continuously advances as long as it finds characters that match a function named isIdentifierPart or some complicated things following a character matching CharacterCodes.backslash. Seems legit.

One issue with scanning for an identifier is that the scanIdentifierParts function updates pos normally, even though we didn’t want to actually skip parsing the new identifier. I solved this by keeping track of pos’ original value and restoring it after the function was called.

I added a few test cases for identifiers of longer lengths and updated the PR.

The PR was merged in and a followup issue filed to help improve the diagnostic messages even more. Hooray! 🎉

Maybe I’ll tackle that other issue another day…

Key Takeaways

It’s worth it to help your users understand their exact errors. Thanks danielrossenwaser and sheetalkamat for the rapid reviews!

A Note on Streamlining

It was commented to me after my last two posts that the progression of steps documented in them could make it appear that I effortlessly glided through the investigations linearly, without any real “I’m stumped.” moments. Not so! I intentionally omit dead ends in the from these blog posts because they would be seven times as long. I have little to no idea what I’m doing. 💯