Update on tree-sitter structure navigation

#307817

Author: Yuan Fu
Date: Fri, 01 Sep 2023 22:01

71 lines
3483 bytes

Hey guys,

In the months after wrapping up tree-sitter stuff in emacs-29, I was thinking about how to implement structural navigation and extracting information from the parser with tree-sitter. In emacs-29 we have things like treesit-beginning/end-of-defun, and treesit-defun-name. I was thinking maybe we can generalize this to support getting arbitrary “thing” at point, move around them, and getting information like the name of a defun, its arglist, parent of a class, type of an variable declaration, etc, in a language-agnostic way.

Also, at the time, we only support defining things by a regexp matching a node’s type, which is often not enough.

And it would be nice to somehow take advantage of the tree-sitter queries for the features I mentioned above. Tree-sitter query is what every other editor are using for virtually all tree-sitter related features. But in Emacs, we mostly only use it for font-lock.

Here’s the progress as of now:

- Functions like treesit-search-forward, treesit-induce-sparse-tree, treesit-thing-at-point, treesit--navigate-thing, etc, support a richer set of predicates now. Besides regexp matching the type, the predicate can also be a predication function, or (REGEP . FUNC), or compound predicates like (or PRED PRED) or (not PRED).

- There’s now a variable treesit-thing-settings, which holds definition for things. Then, instead of passing the predicate to the functions I mentioned above, you can save the predicate in treesit-thing-settings under a symbol, say ‘sexp', and pass the symbol instead, just like thing-at-point.el. (We’ll work on integrating with thing-at-point.el later.)

- I can’t think of a good way to integrate tree-sitter queries with the navigation functions we have right now. Most importantly, tree-sitter query always search top-down, and you can’t limit the depth it searches. OTOH, our navigation functions work by traversing the tree node-to-node.

- There’s no progress on getting information like name and type, etc, in a language-agnostic way. I haven’t come up with a good interface and/or implementation. I encourage interested folks to give it some thought. Bonus points for reusing the query files neovim folks has accumulated :-)

Some other things on the TODO list that people can take a jab at:

- Query-based indentation (neovim’s implementation can be a source of inspiration)
- Improve c-ts-mode (indentation styles, other cc-mode features, etc) and other tree-sitter modes
- Solve the grammar versioning/breaking-change problem: tree-sitter grammar don’t have a version number, so every time the author changes the grammar, our queries break, and loading the mode only produces a giant error.
- Major mode fallback/inheritance, this has been discussed many times, no good solution emerged.
- Isolated ranges. For many embedded languages, each blocks should be independent from another, but currently all the embedded blocks are connected together and parsed by a single parser. We probably need to spawn a parser for each block. I’ll probably work on this one next.

Finally, feel free to send me an email or send to emacs-devel and CC me, if there are things treesit.c and treesit.el can do better, or when there are nice things in neovim and other editors and Emacs ought to have, too.

Yuan

Re: Update on tree-sitter structure navigation

#307820

Author: Ihor Radchenko
Date: Sat, 02 Sep 2023 06:52

114 lines
4946 bytes

Yuan Fu <casouri@gmail.com> writes:

> In the months after wrapping up tree-sitter stuff in emacs-29, I was
> thinking about how to implement structural navigation and extracting
> information from the parser with tree-sitter. In emacs-29 we have
> things like treesit-beginning/end-of-defun, and treesit-defun-name. I
> was thinking maybe we can generalize this to support getting arbitrary
> “thing” at point, move around them, and getting information like the
> name of a defun, its arglist, parent of a class, type of an variable
> declaration, etc, in a language-agnostic way.

Note that Org mode also does all of these using
https://orgmode.org/worg/dev/org-element-api.html

It would be nice if we could converge to more consistent interface
across all the modes. For example, by extending `thing-at-point' to handle
parsed elements, not just simplistic regexp-based "thing" boundaries
exposed by `thing-at-point' now.

Org approaches getting name/begin/end/arguments using a common API:

(org-element-property :begin NODE)
(org-element-property :end NODE)
(org-element-property :contents-begin NODE)
(org-element-property :contents-end NODE)
(org-element-property :name NODE)
(org-element-property :args NODE)

Language-agnostic "thing"s will certainly be welcome, especially given
that tree-sitter grammars use inconsistent naming schemes, which have to
be learned separately, and may even change with grammar versions.

I think that both NODE types and attributes can be standardized.

> Also, at the time, we only support defining things by a regexp
> matching a node’s type, which is often not enough.
>
> And it would be nice to somehow take advantage of the tree-sitter
> queries for the features I mentioned above. Tree-sitter query is what
> every other editor are using for virtually all tree-sitter related
> features. But in Emacs, we mostly only use it for font-lock.

I recall one user asking about something like VIM's textobjects via
tree-sitter queries. Example:
https://github.com/nvim-treesitter/nvim-treesitter-textobjects/blob/master/queries/cpp/textobjects.scm

> Here’s the progress as of now:
>
> - Functions like treesit-search-forward, treesit-induce-sparse-tree,
> treesit-thing-at-point, treesit--navigate-thing, etc, support a richer
> set of predicates now. Besides regexp matching the type, the predicate
> can also be a predication function, or (REGEP . FUNC), or compound
> predicates like (or PRED PRED) or (not PRED).

Slightly unrelated, but do you have any idea if it can be faster to use
Emacs' regexp search combined with treesit-thing-at-point vs. pure
tree-sitter query?

> - There’s now a variable treesit-thing-settings, which holds
> definition for things. Then, instead of passing the predicate to the
> functions I mentioned above, you can save the predicate in
> treesit-thing-settings under a symbol, say ‘sexp', and pass the symbol
> instead, just like thing-at-point.el. (We’ll work on integrating with
> thing-at-point.el later.)

This sounds similar to textobjects I linked above.
One question: how will it integrate with multiple parsers in one buffer?

> - I can’t think of a good way to integrate tree-sitter queries with
> the navigation functions we have right now. Most importantly,
> tree-sitter query always search top-down, and you can’t limit the
> depth it searches. OTOH, our navigation functions work by traversing
> the tree node-to-node.

May you elaborate about the difficulties you encountered?

> Some other things on the TODO list that people can take a jab at:
>
> - Solve the grammar versioning/breaking-change problem: tree-sitter grammar don’t have a version number, so every time the author changes the grammar, our queries break, and loading the mode only produces a giant error.

May we somehow get a hash of the library? That way, we can at least
detect if something has changed.

> - Major mode fallback/inheritance, this has been discussed many times, no good solution emerged.

I think that integration of tree-sitter with navigation functions might
be a step towards solving this problem. If common Emacs commands can
automatically choose between tree-sitter and classic implementations, it
might become easier to unify foo-ts-mode with foo-mode.

> - Isolated ranges. For many embedded languages, each blocks should be independent from another, but currently all the embedded blocks are connected together and parsed by a single parser. We probably need to spawn a parser for each block. I’ll probably work on this one next.

Do you mean that a single parser sees subsequent block as a continuation
of the previous?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

Re: Update on tree-sitter structure navigation

#307831

Author: Hugo Thunnissen
Date: Sat, 02 Sep 2023 10:50

54 lines
2260 bytes

Ihor Radchenko <yantar92@posteo.net> writes:

> Yuan Fu <casouri@gmail.com> writes:
>
>> In the months after wrapping up tree-sitter stuff in emacs-29, I was
>> thinking about how to implement structural navigation and extracting
>> information from the parser with tree-sitter. In emacs-29 we have
>> things like treesit-beginning/end-of-defun, and treesit-defun-name. I
>> was thinking maybe we can generalize this to support getting arbitrary
>> “thing” at point, move around them, and getting information like the
>> name of a defun, its arglist, parent of a class, type of an variable
>> declaration, etc, in a language-agnostic way.
>
> Note that Org mode also does all of these using
> https://orgmode.org/worg/dev/org-element-api.html
>
> It would be nice if we could converge to more consistent interface
> across all the modes. For example, by extending `thing-at-point' to handle
> parsed elements, not just simplistic regexp-based "thing" boundaries
> exposed by `thing-at-point' now.
>
> Org approaches getting name/begin/end/arguments using a common API:
>
> (org-element-property :begin NODE)
> (org-element-property :end NODE)
> (org-element-property :contents-begin NODE)
> (org-element-property :contents-end NODE)
> (org-element-property :name NODE)
> (org-element-property :args NODE)
>
> Language-agnostic "thing"s will certainly be welcome, especially given
> that tree-sitter grammars use inconsistent naming schemes, which have to
> be learned separately, and may even change with grammar versions.
>
> I think that both NODE types and attributes can be standardized.
>

It would be great to see standardization that can work with more than
just tree-sitter.  Depending on how extensive such a generic NODE type
and accompanying API are, I could see standardization of a lot of things
that are currently implemented in major modes, to name a few:

- indentation
- fontification
- thing-at-point
- imenu
- simple forms of completion (variables, function names in buffer)

I have some idea of the underpinnings, but I have never implemented a
full major mode so it is hard for me to judge the practicality of
this. How much would be practical to standardize, without needlessly
complicated/resource-heavy abstractions?

Re: Update on tree-sitter structure navigation

#307887

Author: Yuan Fu
Date: Sat, 02 Sep 2023 15:09

167 lines
6459 bytes


> On Sep 1, 2023, at 11:52 PM, Ihor Radchenko <yantar92@posteo.net> wrote:
> 
> Yuan Fu <casouri@gmail.com> writes:
> 
>> In the months after wrapping up tree-sitter stuff in emacs-29, I was
>> thinking about how to implement structural navigation and extracting
>> information from the parser with tree-sitter. In emacs-29 we have
>> things like treesit-beginning/end-of-defun, and treesit-defun-name. I
>> was thinking maybe we can generalize this to support getting arbitrary
>> “thing” at point, move around them, and getting information like the
>> name of a defun, its arglist, parent of a class, type of an variable
>> declaration, etc, in a language-agnostic way.
> 
> Note that Org mode also does all of these using
> https://orgmode.org/worg/dev/org-element-api.html
> 
> It would be nice if we could converge to more consistent interface
> across all the modes. For example, by extending `thing-at-point' to handle
> parsed elements, not just simplistic regexp-based "thing" boundaries
> exposed by `thing-at-point' now.
> 
> Org approaches getting name/begin/end/arguments using a common API:
> 
> (org-element-property :begin NODE)
> (org-element-property :end NODE)
> (org-element-property :contents-begin NODE)
> (org-element-property :contents-end NODE)
> (org-element-property :name NODE)
> (org-element-property :args NODE)
> 
> Language-agnostic "thing"s will certainly be welcome, especially given
> that tree-sitter grammars use inconsistent naming schemes, which have to
> be learned separately, and may even change with grammar versions.
> 
> I think that both NODE types and attributes can be standardized.

If we come up with a thing-at-point interface that provides more information than the current (BEG . END), tree-sitter surely can support it as a backend. Just need SomeOne to come up with it :-) But I don’t see how this interface can support semantic information like arglist of a defun, or type of a declaration—these things are not universal to all “nodes”.

> 
>> Also, at the time, we only support defining things by a regexp
>> matching a node’s type, which is often not enough.
>> 
>> And it would be nice to somehow take advantage of the tree-sitter
>> queries for the features I mentioned above. Tree-sitter query is what
>> every other editor are using for virtually all tree-sitter related
>> features. But in Emacs, we mostly only use it for font-lock.
> 
> I recall one user asking about something like VIM's textobjects via
> tree-sitter queries. Example:
> https://github.com/nvim-treesitter/nvim-treesitter-textobjects/blob/master/queries/cpp/textobjects.scm

I think that’s something that can be implemented with thing definitions.


>> Here’s the progress as of now:
>> 
>> - Functions like treesit-search-forward, treesit-induce-sparse-tree,
>> treesit-thing-at-point, treesit--navigate-thing, etc, support a richer
>> set of predicates now. Besides regexp matching the type, the predicate
>> can also be a predication function, or (REGEP . FUNC), or compound
>> predicates like (or PRED PRED) or (not PRED).
> 
> Slightly unrelated, but do you have any idea if it can be faster to use
> Emacs' regexp search combined with treesit-thing-at-point vs. pure
> tree-sitter query?

Not really.

> 
>> - There’s now a variable treesit-thing-settings, which holds
>> definition for things. Then, instead of passing the predicate to the
>> functions I mentioned above, you can save the predicate in
>> treesit-thing-settings under a symbol, say ‘sexp', and pass the symbol
>> instead, just like thing-at-point.el. (We’ll work on integrating with
>> thing-at-point.el later.)
> 
> This sounds similar to textobjects I linked above.
> One question: how will it integrate with multiple parsers in one buffer?

This only concerns with checking if a node satisfies the definition of a “thing”, and doesn’t care how you get the node. Retrieving node through either treesit-node-at or other functions already works with multiple parsers.

Also the “thing” definition is language-specific.

> 
>> - I can’t think of a good way to integrate tree-sitter queries with
>> the navigation functions we have right now. Most importantly,
>> tree-sitter query always search top-down, and you can’t limit the
>> depth it searches. OTOH, our navigation functions work by traversing
>> the tree node-to-node.
> 
> May you elaborate about the difficulties you encountered?

Ideally I’d like to pass a query and a node to treesit-node-match-p, which returns t if the query matches the node. But queries don’t work like that. They search the node and returns all the matches within that node, which could be potentially wasteful.

> 
>> Some other things on the TODO list that people can take a jab at:
>> 
>> - Solve the grammar versioning/breaking-change problem: tree-sitter grammar don’t have a version number, so every time the author changes the grammar, our queries break, and loading the mode only produces a giant error.
> 
> May we somehow get a hash of the library? That way, we can at least
> detect if something has changed.

All we get is a binary dynamic library. So I don’t think so.

> 
>> - Major mode fallback/inheritance, this has been discussed many times, no good solution emerged.
> 
> I think that integration of tree-sitter with navigation functions might
> be a step towards solving this problem. If common Emacs commands can
> automatically choose between tree-sitter and classic implementations, it
> might become easier to unify foo-ts-mode with foo-mode.

Unifying tree-sitter and non-tree-sitter modes creates many problems. I’m rather thinking about some way to share some configuration between two modes. We’ve had many discussions before with no fruitful conclusion.

> 
>> - Isolated ranges. For many embedded languages, each blocks should be independent from another, but currently all the embedded blocks are connected together and parsed by a single parser. We probably need to spawn a parser for each block. I’ll probably work on this one next.
> 
> Do you mean that a single parser sees subsequent block as a continuation
> of the previous?

Exactly.

Yuan

Re: Update on tree-sitter structure navigation

#307889

Author: Yuan Fu
Date: Sat, 02 Sep 2023 15:12

75 lines
2785 bytes


> On Sep 2, 2023, at 1:50 AM, Hugo Thunnissen <devel@hugot.nl> wrote:
> 
> Ihor Radchenko <yantar92@posteo.net> writes:
> 
>> Yuan Fu <casouri@gmail.com> writes:
>> 
>>> In the months after wrapping up tree-sitter stuff in emacs-29, I was
>>> thinking about how to implement structural navigation and extracting
>>> information from the parser with tree-sitter. In emacs-29 we have
>>> things like treesit-beginning/end-of-defun, and treesit-defun-name. I
>>> was thinking maybe we can generalize this to support getting arbitrary
>>> “thing” at point, move around them, and getting information like the
>>> name of a defun, its arglist, parent of a class, type of an variable
>>> declaration, etc, in a language-agnostic way.
>> 
>> Note that Org mode also does all of these using
>> https://orgmode.org/worg/dev/org-element-api.html
>> 
>> It would be nice if we could converge to more consistent interface
>> across all the modes. For example, by extending `thing-at-point' to handle
>> parsed elements, not just simplistic regexp-based "thing" boundaries
>> exposed by `thing-at-point' now.
>> 
>> Org approaches getting name/begin/end/arguments using a common API:
>> 
>> (org-element-property :begin NODE)
>> (org-element-property :end NODE)
>> (org-element-property :contents-begin NODE)
>> (org-element-property :contents-end NODE)
>> (org-element-property :name NODE)
>> (org-element-property :args NODE)
>> 
>> Language-agnostic "thing"s will certainly be welcome, especially given
>> that tree-sitter grammars use inconsistent naming schemes, which have to
>> be learned separately, and may even change with grammar versions.
>> 
>> I think that both NODE types and attributes can be standardized.
>> 
> 
> It would be great to see standardization that can work with more than
> just tree-sitter.  Depending on how extensive such a generic NODE type
> and accompanying API are, I could see standardization of a lot of things
> that are currently implemented in major modes, to name a few:
> 
> - indentation
> - fontification
> - thing-at-point
> - imenu
> - simple forms of completion (variables, function names in buffer)
> 
> I have some idea of the underpinnings, but I have never implemented a
> full major mode so it is hard for me to judge the practicality of
> this. How much would be practical to standardize, without needlessly
> complicated/resource-heavy abstractions?

I don’t know which level of standardization you are thinking about, but aren’t they already standardized?

- indentation: indent-line/region-function
- fontification: font-lock-defaults
- thing-at-point: thing-at-point function
- imenu: imenu-create-index-function
- completion: completion-at-point-function

Yuan

Re: Update on tree-sitter structure navigation

#307895

Author: Dmitry Gutov
Date: Sun, 03 Sep 2023 03:56

23 lines
1276 bytes

Hi Yuan,

On 02/09/2023 08:01, Yuan Fu wrote:
> - Solve the grammar versioning/breaking-change problem: tree-sitter grammar don’t have a version number, so every time the author changes the grammar, our queries break, and loading the mode only produces a giant error.

I don't have a better idea than basically copying NeoVim and others: to
maintain the urls to parser repositories and the ref of the latest known
good revision, for the current version of the major mode. That info
could be filled in by major modes themselves, e.g. in an autoload block
(similarly to how auto-mode-alist is appended to).

> Finally, feel free to send me an email or send to emacs-devel and CC me, if there are things treesit.c and treesit.el can do better, or when there are nice things in neovim and other editors and Emacs ought to have, too.

Something I mentioned previously, there is notion of scopes in
tree-sitter docs, see the Local Variables section here:
https://tree-sitter.github.io/tree-sitter/syntax-highlighting#local-variables

Basically to know which symbols are defined and for how long, the parser
needs additional help from the major mode author.

Neovim's definition here:
https://github.com/nvim-treesitter/nvim-treesitter/blob/master/queries/ruby/locals.scm

🚀 go-pugleaf

Thread View: gmane.emacs.devel

Update on tree-sitter structure navigation

Re: Update on tree-sitter structure navigation

Re: Update on tree-sitter structure navigation

Re: Update on tree-sitter structure navigation

Re: Update on tree-sitter structure navigation

Re: Update on tree-sitter structure navigation

Thread Navigation