Add start and end properties to WikiText AST nodes for all elements. #7866

Gk0Wk · 2023-11-29T07:55:56Z

Enhancements to WikiText Parser AST

Summary

This pull request introduces a significant enhancement to the WikiText parser by adding start and end properties to the AST nodes that did not previously have them. This update includes nodes such as lists, codeblocks, and various other elements that were lacking these properties.

Background

The ability to identify the exact substring range of a syntax subtree within the source file is crucial for advanced functionalities like section editing. While some progress had been made in the past by adding start and end properties to paragraph elements, other nodes were still missing this information.

Changes

With this PR, I've implemented a uniform check across all syntax rules during the AST generation process. Now, every node produced by the parser will have its corresponding start and end properties, marking the range of the substring it represents in the source text.

Benefits

Enables the implementation of more complex features such as segment editing.
Improves the accuracy and usefulness of the AST for further processing and manipulation.
Provides consistency across different types of nodes in the AST.

Testing

Before:

After:

I look forward to your feedback on these enhancements!

…l other elements

vercel · 2023-11-29T07:55:59Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
tiddlywiki5	✅ Ready (Inspect)	Visit Preview	Dec 6, 2023 3:37am

saqimtiaz · 2023-11-29T08:00:03Z

@Gk0Wk I have not reviewed the code yet but this is a long overdue improvement, thank you.

I note that the tests are failing and need to be updated.

Jermolene · 2023-11-29T08:59:15Z

Thanks @Gk0Wk this would indeed be a welcome improvement.

Given the scale of the changes needed to the tests, I wonder if we might also think about a further modification that has been discussed in the past: adding a new parse tree node property that (for example) distinguishes a <$list> widget that was encountered in its usual widget syntax from one generated by the conditional shortcut syntax.

Without the ability to distinguish them, it would remain impossible to reconstruct the original wikitext from the parse tree. The property could be origin, variant, form, or perhaps parserule.

Would that be useful in the applications you have in mind?

Gk0Wk · 2023-11-29T11:39:19Z

@Jermolene Yes, I completely agree. In fact, today I also wanted to do the same thing, that is, to consider the rule as an attribute of the node:

However, I'm not sure if this alone is sufficient to describe everything completely. What I mean is, can we reconstruct the original wiki string without any discrepancies solely based on these? Is it possible that we still need more context?

Nevertheless, this would indeed be better than before. The previous AST was geared towards widget generation, or rather, it was oriented towards rendering, and not for describing all the information.

I've also noticed that some rules do not generate any nodes, such as comment and parserule, whose syntactic information is discarded. styleblock does not contain information about itself but only returns the block it is wrapped in, while quoteblock, prettylink are also missing some information about the substring part.

Meanwhile, plain text is treated as a paragraph element and does not correspond to any rule.

core/modules/parsers/wikiparser/wikiparser.js

Gk0Wk · 2023-11-29T14:31:45Z

@Gk0Wk I have not reviewed the code yet but this is a long overdue improvement, thank you.

I note that the tests are failing and need to be updated.

I've looked at the error messages from the testing process, but I can't understand what they mean. For example:

1) WikiText parser tests should parse inline macro calls
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:208:56)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:213:1[24](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:25))
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:218:48)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:223:35)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:2[28](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:29):37)
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:234:[30](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:31))
        at <Jasmine>
  Message:
    Expected $[0] not to have properties
        parseRule: null
  Stack:
    Error: Expected $[0] not to have properties
        parseRule: null
        at <Jasmine>
        at UserContext.<anonymous> (test-wikitext-parser.js:2[39](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:40):[55](https://github.com/Jermolene/TiddlyWiki5/actions/runs/7032732431/job/19137077975?pr=7866#step:4:56))
        at <Jasmine>

Can someone help explain them to me?

Is this because I need to further modify the expected results in test-wikitext-parser.js? It looks like there's a significant amount of work.

linonetwo · 2023-11-30T09:41:51Z

distinguishes a <$list> widget

We can do that in another PR, I think that is difficult.

Also, I've been long waiting for macro call, distinguish from transclusion #7646

Jermolene · 2023-11-30T09:52:49Z

@Jermolene Yes, I completely agree. In fact, today I also wanted to do the same thing, that is, to consider the rule as an attribute of the node:

I think it would be more useful if the "rule" property contained a string. It's very useful to be able to save parse trees in JSON.

However, I'm not sure if this alone is sufficient to describe everything completely. What I mean is, can we reconstruct the original wiki string without any discrepancies solely based on these? Is it possible that we still need more context?

I think we would indeed still have some more work to do, but it would be a huge improvement.

Nevertheless, this would indeed be better than before. The previous AST was geared towards widget generation, or rather, it was oriented towards rendering, and not for describing all the information.

Yes, exactly.

I've also noticed that some rules do not generate any nodes, such as comment and parserule, whose syntactic information is discarded. styleblock does not contain information about itself but only returns the block it is wrapped in, while quoteblock, prettylink are also missing some information about the substring part.

Yes, we would need to retain some ore information in the parse tree. That may well have performance implications of course.

Meanwhile, plain text is treated as a paragraph element and does not correspond to any rule.

I think a reasonable initial goal might be to be able to convert parse trees to a kind of normalised wikitext translation, where we don't have 100% fidelity of the source, but the rendered output is identical.

Is this because I need to further modify the expected results in test-wikitext-parser.js? It looks like there's a significant amount of work.

Yes, I am afraid so. It is a lot of work. It may well be that some significant refactoring might be needed.

For example, the quickest way to get the existing tests running again might be to hack the parse function that it uses to remove the new attributes. Of course then we'd need another set of tests for the new attributes, but we could use the newer wikitext test format for those.

pmario · 2023-11-30T14:21:00Z

For example, the quickest way to get the existing tests running again might be to hack the parse function that it uses to remove the new attributes. Of course then we'd need another set of tests for the new attributes, but we could use the newer wikitext test format for those.

I would be not happy with a test-hack. We do not have the best test coverage for a project that size and hacking tests IMO makes the unreliable.

I did not run the tests with this PR yet, but I think basically all of the parser related tests fail now. -> Because they should.

I'll have a closer look

pmario · 2023-11-30T14:23:06Z

distinguishes a <$list> widget

We can do that in another PR, [. . .]

I also think so. IMO this PR should be merged as is after the tests have been fixed and in a second step we can add more info to the AST

linonetwo · 2023-12-01T05:40:42Z

convert parse trees to a kind of normalised wikitext translation

This is already done in https://github.com/tiddly-gittly/wikiast , should I PR it to the core? But it rely on unist library, and due to complicity of AST operation, I can't do this without TS or JSDoc.

…add more range attributes

Gk0Wk · 2023-12-02T05:07:53Z

I have modified all the expected results in test-wikitext-parser.js and verified the correctness of the expected results; the tests have passed.

Jermolene · 2023-12-02T12:24:34Z

Thanks @Gk0Wk. Looks good, I'm going to add this PR to the list to be merged after v5.3.2. See #7856

Gk0Wk · 2023-12-05T18:50:15Z

@Jermolene @pmario New changes for html aprser:

Overview

Fixing a parsing issue within the WikiText HTML syntax parser. The original implementation incorrectly set the start and end properties of an HTML AST node to only cover the opening tag's substring, neglecting the full extent of the HTML element.

Changes Made

The start and end properties have been redefined to represent the boundaries of the entire HTML node substring, from the start of the opening tag to the end of the closing tag.
Introduced four new properties to accurately track positions of opening and closing tags:
- openTagStart: The index where the opening tag begins.
- openTagEnd: The index where the opening tag ends.
- closeTagStart: The index where the closing tag begins.
- closeTagEnd: The index where the closing tag ends.

These modifications ensure that the AST node for HTML elements now correctly encapsulates the complete range of the HTML snippet, allowing for more precise manipulation and analysis.

core/modules/parsers/wikiparser/rules/html.js

Jermolene · 2023-12-06T18:04:05Z

Great stuff thank you @Gk0Wk

Jermolene · 2023-12-10T22:10:07Z

Tagging @flibbles – would these changes to the parse tree be helpful and/or problematic for Uglify/Relink?

flibbles · 2023-12-10T23:10:02Z

Tests all pass for this. It's additive rather than changing, so shouldn't be a problem.

I actually considered proposing something like this myself a while back. It would have made it much easier for my plugins to figure things out instead of having to count out start and stop for themselves after-the-fact.

kookma · 2024-01-17T04:06:11Z

Hi @Gk0Wk
This feature lets have a more powerful yet simpler section editor.

@Jermolene
Is there any plan to be merged in 5.3.4?

saqimtiaz · 2024-01-17T08:22:05Z

Fixing a parsing issue within the WikiText HTML syntax parser. The original implementation incorrectly set the start and end properties of an HTML AST node to only cover the opening tag's substring, neglecting the full extent of the HTML element.

@Gk0Wk can we please add some tests for these cases?

saqimtiaz · 2024-02-28T11:34:56Z

~~@Gk0Wk I would love to see this get merged and I think all that is missing now is some more tests for the issues that you fixed.~~

saqimtiaz

I have reviewed this PR, and the tests look sufficient however we should make sure that there is not any significant performance regression introduced here.

There are a few code style issues that need attention.

saqimtiaz · 2024-06-05T07:23:39Z

core/modules/parsers/wikiparser/rules/filteredtranscludeblock.js

+	var styleStart = templateEnd + 2;
+	var styleEnd = styleStart + (this.match[4] ? this.match[4].length : 0);
+	var classesStart = styleEnd + 1;
+	var classesEnd = classesStart + (this.match[5] ? this.match[5].length : 0);


We always collapse multiple var statements into one:

var filterStart = this.parser.pos + 3, filterEnd = .....

This applies to all changes throughout this PR

Does this apply while there is assignment?

Hope this can be done by eslint.

I think this would need to be resolved manually.

linonetwo · 2024-06-05T12:37:42Z

core/modules/parsers/wikiparser/rules/filteredtranscludeblock.js

@@ -42,21 +52,21 @@ exports.parse = function() {
 	var node = {
 		type: "list",
 		attributes: {
-			filter: {type: "string", value: filter}
+			filter: {type: "string", value: filter, start: filterStart, end: filterEnd},


You mean use {type: "string", value: filter, filterStart, filterEnd} here?

@linonetwo I must have misread the code this morning, this code is fine and needs no further changes, and this comment can be ignored.

saqimtiaz · 2024-06-05T07:28:37Z

core/modules/parsers/wikiparser/wikiparser.js

 	}
 	// Treat it as a paragraph if we didn't find a block rule
 	var start = this.pos;
 	var children = this.parseInlineRun(terminatorRegExp);
 	var end = this.pos;
-	return [{type: "element", tag: "p", children: children, start: start, end: end }];
+	return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }];


@Jermolene flagging for your review whether there might be a more appropriate value than null for the rule property here.

My instinct would be to omit the "rule" property entirely in this case.

This seem to be a bad idea now, because I need this info to restore \n\n in #8258

saqimtiaz · 2024-06-05T07:29:48Z

core/modules/parsers/wikiparser/rules/html.js

+			if (tag.closeTagStart < closeTagMinPos) {
+				tag.closeTagStart = tag.end;
+			}
+		}


@Gk0Wk @Jermolene we should make sure that this PR does not introduce any significant performance regression.

I think this won't cost much, because it just while from > to the < in <div>, won't step more than <blockquote>

But why would we need open/closeTagStart/End ? Can this feature be turn on when needed?

Jermolene

Thanks @Gk0Wk much appreciated. There are a few coding style issues that need cleaning up, and then I am keen to merge this as soon as possible.

Jermolene · 2024-06-06T10:10:57Z

core/modules/parsers/wikiparser/wikiparser.js

 	}
 	// Treat it as a paragraph if we didn't find a block rule
 	var start = this.pos;
 	var children = this.parseInlineRun(terminatorRegExp);
 	var end = this.pos;
-	return [{type: "element", tag: "p", children: children, start: start, end: end }];
+	return [{type: "element", tag: "p", children: children, start: start, end: end, rule: null }];


My instinct would be to omit the "rule" property entirely in this case.

Jermolene · 2024-06-06T10:11:26Z

core/modules/parsers/wikiparser/wikiparser.js

+		var start = this.pos;
+		var subTree = nextMatch.rule.parse();
+		// Set the start and end positions of the first and last child if they're not already set
+		if (subTree.length > 0) {


We don't have whitespace between if and (

Jermolene · 2024-06-06T10:11:52Z

core/modules/parsers/wikiparser/wikiparser.js

+		// Set the start and end positions of the first and last child if they're not already set
+		if (subTree.length > 0) {
+			// Set the start and end positions of the first and last child if they're not already set
+			if (subTree[0].start === undefined) subTree[0].start = start;


We always use braces and a new line for the body of if statements.

Jermolene · 2024-06-06T10:12:43Z

core/modules/parsers/wikiparser/rules/html.js

+					tag.closeTagStart = -1;
+					break;
+				}
+				if (char === '<') break;


We always use double quotes for string constants.

commit 5687d9f Author: Gk0Wk <nmg_wk@yeah.net> Date: Wed Dec 6 11:33:43 2023 +0800 Fix for html parser commit df0a1b1 Author: Gk0Wk <nmg_wk@yeah.net> Date: Wed Dec 6 02:47:47 2023 +0800 Fix HTML AST node boundary parsing in WikiText commit ac8dda0 Author: Gk0Wk <nmg_wk@yeah.net> Date: Sat Dec 2 13:02:52 2023 +0800 update test-wikitext-parser.js, change for-const-of -to .utils.each, add more range attributes commit e2b9a4e Author: Gk0Wk <nmg_wk@yeah.net> Date: Wed Nov 29 22:35:39 2023 +0800 Add more start-end range attributes for AST commit d3e62ec Author: Gk0Wk <nmg_wk@yeah.net> Date: Wed Nov 29 20:45:00 2023 +0800 Add rule attribute for WikiText AST nodes commit 4200495 Author: Gk0Wk <nmg_wk@yeah.net> Date: Wed Nov 29 15:48:38 2023 +0800 Add start and end properties to AST nodes for list, codeblock, and all other elements

… null Also make tests pass

Jermolene · 2024-06-08T15:54:01Z

Merged in e4c682d, with some minor tweaks in 1a57d08

Add start and end properties to AST nodes for list, codeblock, and al…

4200495

…l other elements

vercel bot had a problem deploying to Preview November 29, 2023 07:56 Failure

Add rule attribute for WikiText AST nodes

d3e62ec

vercel bot had a problem deploying to Preview November 29, 2023 12:45 Failure

pmario reviewed Nov 29, 2023

View reviewed changes

core/modules/parsers/wikiparser/wikiparser.js Outdated Show resolved Hide resolved

Add more start-end range attributes for AST

e2b9a4e

vercel bot had a problem deploying to Preview November 29, 2023 14:36 Failure

update test-wikitext-parser.js, change for-const-of -to .utils.each, …

ac8dda0

…add more range attributes

vercel bot deployed to Preview December 2, 2023 05:05 View deployment

Jermolene mentioned this pull request Dec 2, 2023

Tickets for consideration after release of v5.3.2 and v5.3.3 #7856

Open

Fix HTML AST node boundary parsing in WikiText

df0a1b1

vercel bot deployed to Preview December 5, 2023 18:51 View deployment

pmario reviewed Dec 5, 2023

View reviewed changes

core/modules/parsers/wikiparser/rules/html.js Outdated Show resolved Hide resolved

Fix for html parser

5687d9f

vercel bot deployed to Preview December 6, 2023 03:37 View deployment

pmario mentioned this pull request Dec 10, 2023

Check missing start/end parser ranges for "element, set, entity" #5739

Open

This was referenced Dec 10, 2023

pragmas don't produce "start / end" markers in the parse-tree #6822

Open

wikitext transclusions don't produce "start / stop" info in parse-tree #7259

Open

flibbles mentioned this pull request Dec 10, 2023

Ensure {{}} doesn't cause a recursion error #7768

Merged

Jermolene mentioned this pull request May 2, 2024

Planning for release of v5.3.4 #8174

Closed

saqimtiaz reviewed Jun 5, 2024

View reviewed changes

Jermolene reviewed Jun 6, 2024

View reviewed changes

Jermolene added a commit that referenced this pull request Jun 8, 2024

Tweak #7866 to omit "rule" property when unset, instead of leaving it…

1a57d08

… null Also make tests pass

Jermolene closed this Jun 8, 2024

hoelzro mentioned this pull request Aug 7, 2024

Add start/end properties for table row/cell elements #8494

Merged

Add start and end properties to WikiText AST nodes for all elements. #7866

Add start and end properties to WikiText AST nodes for all elements. #7866

Conversation

Gk0Wk commented Nov 29, 2023

Enhancements to WikiText Parser AST

Summary

Background

Changes

Benefits

Testing

vercel bot commented Nov 29, 2023 • edited Loading

saqimtiaz commented Nov 29, 2023

Jermolene commented Nov 29, 2023

Gk0Wk commented Nov 29, 2023 • edited Loading

Gk0Wk commented Nov 29, 2023 • edited Loading

linonetwo commented Nov 30, 2023 • edited Loading

Jermolene commented Nov 30, 2023

pmario commented Nov 30, 2023

pmario commented Nov 30, 2023

linonetwo commented Dec 1, 2023 • edited Loading

Gk0Wk commented Dec 2, 2023

Jermolene commented Dec 2, 2023

Gk0Wk commented Dec 5, 2023

Overview

Changes Made

Jermolene commented Dec 6, 2023

Jermolene commented Dec 10, 2023

flibbles commented Dec 10, 2023

kookma commented Jan 17, 2024

saqimtiaz commented Jan 17, 2024

saqimtiaz commented Feb 28, 2024 • edited Loading

saqimtiaz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jermolene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jermolene commented Jun 8, 2024

vercel bot commented Nov 29, 2023 •

edited

Loading

Gk0Wk commented Nov 29, 2023 •

edited

Loading

Gk0Wk commented Nov 29, 2023 •

edited

Loading

linonetwo commented Nov 30, 2023 •

edited

Loading

linonetwo commented Dec 1, 2023 •

edited

Loading

saqimtiaz commented Feb 28, 2024 •

edited

Loading

saqimtiaz left a comment •

edited

Loading