Amit's Thoughts: Emacs Tree-sitter custom highlighting, part 3

Wednesday, March 05, 2025

In the last post I described how, for a particular project with certain naming conventions, I wanted to use emacs tree-sitter font-locking to highlight variables that might be used in the wrong context:

Screenshot showing two subscript expressions, one that seems suspect because it's an array indexed by t values that has an index that is named r

I found a way to make it work for simple expressions but it was relying on regular expression matching rather than tree-sitter parse trees. I set a goal of handling these types of expressions:

elevation_t[t]           // good
elevation_t[r]           // bug
elevation_t[t_from_r[r]] // good
elevation_t[r_from_t[t]] // bug
elevation_t[obj.t]       // good
elevation_t[obj.r]       // bug
elevation_t[t_from_r(r)] // good
elevation_t[r_from_t(t)] // bug
elevation_t[x.t_fn(x)]   // good
elevation_t[x.r_fn(x)]   // bug
elevation_t[x.t_arr[i]]  // good
elevation_t[x.r_arr[i]]  // bug
elevation_t[(t)]         // good
elevation_t[(r)]         // bug

What are my options for using tree-sitter to handle more cases? I decided to try the :pred feature of tree-sitter font lock. Using M-x treesit-explore-mode, I learned that the subscript expression looks like:

(subscript_expression 
   object: (_) 
   [
   index: (_)
   ]
)

That means I can match it with a rule that captures both the array and index as separate nodes:

(defun amitp-check-subscript (array index)
  (message "debug: %S %S" array index))

(defvar amitp/treesit-font-lock-typescript-indexing
  (treesit-font-lock-rules
   :language 'typescript
   :feature 'semantic-identifier
   :override t
   `(((subscript_expression
       object: (identifier) @array 
       index: (expression_statement) @index 
       (:pred amitp-check-subscript @array @index))))))

Those two arguments are treesit parse nodes. What can I do with them? The emacs manual has a page about how to navigate the parse tree and another page about querying the parse tree. In the docs found a function treesit-query-capture that can pattern match on a node and extract the parts.

I used that function to look for the identifier that most likely tells me what geometry type is being returned (r for region, s for side, t for triangle). Examples:

in t_foo_r(r), the identifier t_foo_r tells me this is likely a t (triangle)
in t_obj[i].r_elevation, the identifier r_elevation tells me this is likely an r (region)
in map[i].s_inner_t(t_queue[0]), the identifier s_inner_t tells me this is likely an s (side)

I think this is where tree-sitter can do better than regular expression based matching. I wrote a function to find the key identifier:

(defun amitp/treesit-top-typescript-node (node)
  "Try to find the identifier that represents a node's return value"
  (let ((results
         (treesit-query-capture
          node
          '(((identifier) @identifier)
            ((member_expression property: (property_identifier) @property))
            ((subscript_expression object: (_) @expr))
            ((call_expression function: (_) @expr))
            ((parenthesized_expression (_) @expr))))))
    (if (assoc 'expr results)
        (amitp/treesit-top-typescript-node (cdr (assoc 'expr results)))
      ;; prefer the property if tree-sitter matches both
      (or (cdr (assoc 'property results)) (cdr (assoc 'identifier results))))))

I wrote a function detect a mismatch:

(defun amitp-check-subscript-mismatch (array index)
  ;; note: my usual naming convention is amitp/foo but tree-sitter won't accept that
  (let* ((array-name (buffer-substring (treesit-node-start array) (treesit-node-end array)))
         (top (amitp/treesit-top-typescript-node index))
         (index-name (buffer-substring (treesit-node-start top) (treesit-node-end top))))
    (and
     (or (= (length array-name) 1) (equal "_" (substring array-name -2 -1)))
     (or (= (length index-name) 1) (equal "_" (substring index-name 1 2)))
     (not (equal (substring array-name -1) (substring index-name 0 1))))))

Then I used the tree-sitter :pred feature to only highlight if there's a mismatch:

(defface array-error)
(defface index-error)

(defvar amitp/treesit-font-lock-typescript-indexing
  (treesit-font-lock-rules
   :language 'typescript
   :feature 'semantic-identifier
   :override t
   '(((subscript_expression
       object: (_) @array-error
       index: (_) @index-error
       (:pred amitp-check-subscript-mismatch @array-error @index-error))))))
     ;; todo: add call_expression

Does it work? Yes!

Screenshot showing subscript expressions flagged if they seem like they might have a bug

But there's something a little weird here. Why do r+1 and 1+r get flagged even though I never actually handle binary operators in my amitp/treesit-top-typescript-node function? Something to investigate later.

The next feature I wanted was to highlight only the "type" character instead of the entire identifier. I couldn't figure out how, so I asked on Reddit, where /u/eleven_cupfuls gave me a solution: the @face-name can be a @function-name instead. Thank you!

Instead of using :pred to tell font-lock whether to highlight something, I need to highlight it myself. I changed the treesit font lock rules to call my function instead of applying a face:

(defvar amitp/treesit-font-lock-typescript-indexing
  (treesit-font-lock-rules
   :language 'typescript
   :feature 'semantic-identifier
   :override t
   '(((subscript_expression) @amitp-highlight-subscript-mismatch)
     ((call_expression) @amitp-highlight-subscript-mismatch))))

My function now receives the entire expression instead of receiving the array and index parts separately like the :pred version. So I need to find a way to extract the right information. I can use treesit-query-capture like before, right?

It was at this point that I learned treesit-query-capture pattern matches against the entire tree and not only the top node. That explains why it was finding r+1 and 1+r. It was looking for any identifier anywhere in the subtree. Oops.

I rewrote amitp/treesit-top-typescript-node to use treesit-node-type instead of treesit-query-capture:

(defun amitp/treesit-top-typescript-node (node)
  "Try to find the identifier that represents a node's return value"
  (pcase (treesit-node-type node)
    ("identifier" node)
    ("member_expression" (treesit-node-child-by-field-name node "property"))
    ("subscript_expression" (amitp/treesit-top-typescript-node (treesit-node-child-by-field-name node "object")))
    ("call_expression" (amitp/treesit-top-typescript-node (treesit-node-child-by-field-name node "function")))
    ("parenthesized_expression" (amitp/treesit-top-typescript-node (treesit-node-child node 1)))
    (_ nil)))

It would've been cleaner if pcase could pattern match against these nodes, but it's not too bad.

And then I wrote a new amitp-highlight-subscript-mismatch function to analyze subscript nodes (and function calls too) to make sure the names match:

(defun amitp-highlight-subscript-mismatch (node override start end &rest _)
  "Highlight mismatching variable names in amitp's geometry mesh code"
  (let* ((parts (pcase (treesit-node-type node)
                  ("subscript_expression" (list (treesit-node-child-by-field-name node "object")
                                               (treesit-node-child-by-field-name node "index")))
                  ("call_expression" (list (treesit-node-child-by-field-name node "function")
                                           (treesit-node-child
                                            (treesit-node-child-by-field-name node "arguments")
                                            1))))))
    (when parts
      (let* ((array (nth 0 parts))
             (index (nth 1 parts))
             (array-name (treesit-node-text array))
             (top (amitp/treesit-top-typescript-node index))
             (index-name (or (treesit-node-text array) ""))
             (face (if (and ;; todo also check (>= (length index-name) 1)
                        (or (= (length array-name) 1) (equal "_" (substring array-name -2 -1)))
                        (or (= (length index-name) 1) (s-match "[_0-9]" (substring index-name 1 2))))
                       (if (equal (substring array-name -1) (substring index-name 0 1))
                           'index-good
                         'index-bad)
                     nil)))
        (when face
          (treesit-fontify-with-override (- (treesit-node-end array) 2) (treesit-node-end array) face override)
          (treesit-fontify-with-override (treesit-node-start top) (min (treesit-node-end top) (+ 2 (treesit-node-start top))) face override))))))

It's a little bit long but it seems to work:

Screenshot showing subscript expressions in green or red depending on whether the names match

However, there are things I haven't figured out, like why certain expressions don't trigger the (subscript_expression) font lock rule:

r => mesh.foo_r[r]);     // matches
(r => mesh.foo_r[r]);    // matches
foo(mesh.foo_r[r]);      // matches
foo(r => mesh.foo_r[r]); // doesn't match

That's a problem for another day. I also want to:

apply these rules only to files in which I use that naming convention
tighten the matching to apply only to the letters r, s, t
remove the green "it's ok" color and only leave the red "warning" color

This was a fun adventure. I'm not sure this is the best approach but it was a nice excuse to play with emacs tree-sitter.

Labels: emacs

– Amit – Wednesday, March 05, 2025

Emacs Tree-sitter custom highlighting, part 3
From Amit’s Thoughts

0 comments: