Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start and end properties to WikiText AST nodes for all elements. #7866

Closed
wants to merge 6 commits into from

Conversation

Gk0Wk
Copy link
Contributor

@Gk0Wk Gk0Wk commented Nov 29, 2023

Enhancements to WikiText Parser AST

Summary

This pull request introduces a significant enhancement to the WikiText parser by adding start and end properties to the AST nodes that did not previously have them. This update includes nodes such as lists, codeblocks, and various other elements that were lacking these properties.

Background

The ability to identify the exact substring range of a syntax subtree within the source file is crucial for advanced functionalities like section editing. While some progress had been made in the past by adding start and end properties to paragraph elements, other nodes were still missing this information.

Changes

With this PR, I've implemented a uniform check across all syntax rules during the AST generation process. Now, every node produced by the parser will have its corresponding start and end properties, marking the range of the substring it represents in the source text.

Benefits

  • Enables the implementation of more complex features such as segment editing.
  • Improves the accuracy and usefulness of the AST for further processing and manipulation.
  • Provides consistency across different types of nodes in the AST.

Testing

Before:

image

After:

image

I look forward to your feedback on these enhancements!

Copy link

vercel bot commented Nov 29, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
tiddlywiki5 ✅ Ready (Inspect) Visit Preview Dec 6, 2023 3:37am

@saqimtiaz
Copy link
Member

@Gk0Wk I have not reviewed the code yet but this is a long overdue improvement, thank you.

I note that the tests are failing and need to be updated.

@Jermolene
Copy link
Member

Thanks @Gk0Wk this would indeed be a welcome improvement.

Given the scale of the changes needed to the tests, I wonder if we might also think about a further modification that has been discussed in the past: adding a new parse tree node property that (for example) distinguishes a <$list> widget that was encountered in its usual widget syntax from one generated by the conditional shortcut syntax.

Without the ability to distinguish them, it would remain impossible to reconstruct the original wikitext from the parse tree. The property could be origin, variant, form, or perhaps parserule.

Would that be useful in the applications you have in mind?

@Gk0Wk
Copy link
Contributor Author

Gk0Wk commented Nov 29, 2023

@Jermolene Yes, I completely agree. In fact, today I also wanted to do the same thing, that is, to consider the rule as an attribute of the node:

image

However, I'm not sure if this alone is sufficient to describe everything completely. What I mean is, can we reconstruct the original wiki string without any discrepancies solely based on these? Is it possible that we still need more context?

Nevertheless, this would indeed be better than before. The previous AST was geared towards widget generation, or rather, it was oriented towards rendering, and not for describing all the information.

I've also noticed that some rules do not generate any nodes, such as comment and parserule, whose syntactic information is discarded. styleblock does not contain information about itself but only returns the block it is wrapped in, while quoteblock, prettylink are also missing some information about the substring part.

Meanwhile, plain text is treated as a paragraph element and does not correspond to any rule.

image

@Gk0Wk
Copy link
Contributor Author

Gk0Wk commented Nov 29, 2023

@Gk0Wk I have not reviewed the code yet but this is a long overdue improvement, thank you.

I note that the tests are failing and need to be updated.

I've looked at the error messages from the testing process, but I can't understand what they mean. For example:

1) WikiText parser tests should parse inline macro calls
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:208:56)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:213:1[24](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:25))
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:218:48)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:223:35)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:2[28](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:29):37)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:234:[30](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:31))
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:2[39](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:40):[55](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:56))
        at <Jasmine>

Can someone help explain them to me?

Is this because I need to further modify the expected results in test-wikitext-parser.js? It looks like there's a significant amount of work.

@linonetwo
Copy link
Contributor

linonetwo commented Nov 30, 2023

distinguishes a <$list> widget

We can do that in another PR, I think that is difficult.

Also, I've been long waiting for macro call, distinguish from transclusion #7646

@Jermolene
Copy link
Member

@Jermolene Yes, I completely agree. In fact, today I also wanted to do the same thing, that is, to consider the rule as an attribute of the node:

I think it would be more useful if the "rule" property contained a string. It's very useful to be able to save parse trees in JSON.

However, I'm not sure if this alone is sufficient to describe everything completely. What I mean is, can we reconstruct the original wiki string without any discrepancies solely based on these? Is it possible that we still need more context?

I think we would indeed still have some more work to do, but it would be a huge improvement.

Nevertheless, this would indeed be better than before. The previous AST was geared towards widget generation, or rather, it was oriented towards rendering, and not for describing all the information.

Yes, exactly.

I've also noticed that some rules do not generate any nodes, such as comment and parserule, whose syntactic information is discarded. styleblock does not contain information about itself but only returns the block it is wrapped in, while quoteblock, prettylink are also missing some information about the substring part.

Yes, we would need to retain some ore information in the parse tree. That may well have performance implications of course.

Meanwhile, plain text is treated as a paragraph element and does not correspond to any rule.

I think a reasonable initial goal might be to be able to convert parse trees to a kind of normalised wikitext translation, where we don't have 100% fidelity of the source, but the rendered output is identical.

Is this because I need to further modify the expected results in test-wikitext-parser.js? It looks like there's a significant amount of work.

Yes, I am afraid so. It is a lot of work. It may well be that some significant refactoring might be needed.

For example, the quickest way to get the existing tests running again might be to hack the parse function that it uses to remove the new attributes. Of course then we'd need another set of tests for the new attributes, but we could use the newer wikitext test format for those.

@pmario
Copy link
Member

pmario commented Nov 30, 2023

For example, the quickest way to get the existing tests running again might be to hack the parse function that it uses to remove the new attributes. Of course then we'd need another set of tests for the new attributes, but we could use the newer wikitext test format for those.

I would be not happy with a test-hack. We do not have the best test coverage for a project that size and hacking tests IMO makes the unreliable.

I did not run the tests with this PR yet, but I think basically all of the parser related tests fail now. -> Because they should.

I'll have a closer look

@pmario
Copy link
Member

pmario commented Nov 30, 2023

distinguishes a <$list> widget

We can do that in another PR, [. . .]

I also think so. IMO this PR should be merged as is after the tests have been fixed and in a second step we can add more info to the AST

@linonetwo
Copy link
Contributor

linonetwo commented Dec 1, 2023

convert parse trees to a kind of normalised wikitext translation

This is already done in https://github.com/tiddly-gittly/wikiast , should I PR it to the core? But it rely on unist library, and due to complicity of AST operation, I can't do this without TS or JSDoc.

@Gk0Wk
Copy link
Contributor Author

Gk0Wk commented Dec 2, 2023

I have modified all the expected results in test-wikitext-parser.js and verified the correctness of the expected results; the tests have passed.

@Jermolene
Copy link
Member

Thanks @Gk0Wk. Looks good, I'm going to add this PR to the list to be merged after v5.3.2. See #7856

@Gk0Wk
Copy link
Contributor Author

Gk0Wk commented Dec 5, 2023

@Jermolene @pmario New changes for html aprser:

Overview

Fixing a parsing issue within the WikiText HTML syntax parser. The original implementation incorrectly set the start and end properties of an HTML AST node to only cover the opening tag's substring, neglecting the full extent of the HTML element.

Changes Made

  • The start and end properties have been redefined to represent the boundaries of the entire HTML node substring, from the start of the opening tag to the end of the closing tag.
  • Introduced four new properties to accurately track positions of opening and closing tags:
    • openTagStart: The index where the opening tag begins.
    • openTagEnd: The index where the opening tag ends.
    • closeTagStart: The index where the closing tag begins.
    • closeTagEnd: The index where the closing tag ends.

These modifications ensure that the AST node for HTML elements now correctly encapsulates the complete range of the HTML snippet, allowing for more precise manipulation and analysis.

@Jermolene
Copy link
Member

Great stuff thank you @Gk0Wk

@Jermolene
Copy link
Member

Tagging @flibbles – would these changes to the parse tree be helpful and/or problematic for Uglify/Relink?

@flibbles
Copy link
Contributor

Tests all pass for this. It's additive rather than changing, so shouldn't be a problem.

I actually considered proposing something like this myself a while back. It would have made it much easier for my plugins to figure things out instead of having to count out start and stop for themselves after-the-fact.

@kookma
Copy link
Contributor

kookma commented Jan 17, 2024

Hi @Gk0Wk
This feature lets have a more powerful yet simpler section editor.

@Jermolene
Is there any plan to be merged in 5.3.4?

@saqimtiaz
Copy link
Member

Fixing a parsing issue within the WikiText HTML syntax parser. The original implementation incorrectly set the start and end properties of an HTML AST node to only cover the opening tag's substring, neglecting the full extent of the HTML element.

@Gk0Wk can we please add some tests for these cases?

@saqimtiaz
Copy link
Member

saqimtiaz commented Feb 28, 2024

@Gk0Wk I would love to see this get merged and I think all that is missing now is some more tests for the issues that you fixed.

Copy link
Member

@saqimtiaz saqimtiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed this PR, and the tests look sufficient however we should make sure that there is not any significant performance regression introduced here.

There are a few code style issues that need attention.

var styleStart = templateEnd + 2;
var styleEnd = styleStart + (this.match[4] ? this.match[4].length : 0);
var classesStart = styleEnd + 1;
var classesEnd = classesStart + (this.match[5] ? this.match[5].length : 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always collapse multiple var statements into one:

var filterStart = this.parser.pos + 3,
	filterEnd = .....

This applies to all changes throughout this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply while there is assignment?

Hope this can be done by eslint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would need to be resolved manually.

@@ -42,21 +52,21 @@ exports.parse = function() {
var node = {
type: "list",
attributes: {
filter: {type: "string", value: filter}
filter: {type: "string", value: filter, start: filterStart, end: filterEnd},

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean use {type: "string", value: filter, filterStart, filterEnd} here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linonetwo I must have misread the code this morning, this code is fine and needs no further changes, and this comment can be ignored.

}
// Treat it as a paragraph if we didn't find a block rule
var start = this.pos;
var children = this.parseInlineRun(terminatorRegExp);
var end = this.pos;
return [{type: "element", tag: "p", children: children, start: start, end: end }];
return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jermolene flagging for your review whether there might be a more appropriate value than null for the rule property here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct would be to omit the "rule" property entirely in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to be a bad idea now, because I need this info to restore \n\n in #8258

if (tag.closeTagStart < closeTagMinPos) {
tag.closeTagStart = tag.end;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gk0Wk @Jermolene we should make sure that this PR does not introduce any significant performance regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't cost much, because it just while from > to the < in <div>, won't step more than <blockquote>

But why would we need open/closeTagStart/End ? Can this feature be turn on when needed?

Copy link
Member

@Jermolene Jermolene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gk0Wk much appreciated. There are a few coding style issues that need cleaning up, and then I am keen to merge this as soon as possible.

}
// Treat it as a paragraph if we didn't find a block rule
var start = this.pos;
var children = this.parseInlineRun(terminatorRegExp);
var end = this.pos;
return [{type: "element", tag: "p", children: children, start: start, end: end }];
return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct would be to omit the "rule" property entirely in this case.

var start = this.pos;
var subTree = nextMatch.rule.parse();
// Set the start and end positions of the first and last child if they're not already set
if (subTree.length > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have whitespace between if and (

// Set the start and end positions of the first and last child if they're not already set
if (subTree.length > 0) {
// Set the start and end positions of the first and last child if they're not already set
if (subTree[0].start === undefined) subTree[0].start = start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use braces and a new line for the body of if statements.

tag.closeTagStart = -1;
break;
}
if (char === '<') break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use double quotes for string constants.

Jermolene added a commit that referenced this pull request Jun 8, 2024
commit 5687d9f
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Wed Dec 6 11:33:43 2023 +0800

    Fix for html parser

commit df0a1b1
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Wed Dec 6 02:47:47 2023 +0800

    Fix HTML AST node boundary parsing in WikiText

commit ac8dda0
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Sat Dec 2 13:02:52 2023 +0800

    update test-wikitext-parser.js, change for-const-of -to .utils.each, add more range attributes

commit e2b9a4e
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Wed Nov 29 22:35:39 2023 +0800

    Add more start-end range attributes for AST

commit d3e62ec
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Wed Nov 29 20:45:00 2023 +0800

    Add rule attribute for WikiText AST nodes

commit 4200495
Author: Gk0Wk <nmg_wk@yeah.net>
Date:   Wed Nov 29 15:48:38 2023 +0800

    Add start and end properties to AST nodes for list, codeblock, and all other elements
Jermolene added a commit that referenced this pull request Jun 8, 2024
@Jermolene
Copy link
Member

Merged in e4c682d, with some minor tweaks in 1a57d08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants