Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding v5 support #2

Closed
petdance opened this issue Nov 18, 2019 · 41 comments
Closed

Adding v5 support #2

petdance opened this issue Nov 18, 2019 · 41 comments

Comments

@petdance
Copy link
Collaborator

I've created a new v5 branch. Let's work against that, and use this ticket for discussion.

Why do you want to hide HTML::Tagset::v[45] from PAUSE? I don't think we do.

As to tests:

  • Make a list of tags that are in v4 and not v5. Verify that. (font, i, center, etc)
  • Make a list of tags that are in v5 and not v4. Verify that. (audio, video, mark, etc)
  • Make a list of tags that are in both. Verify that. (table, div, etc)

We should probably test the differences in attributes. https://www.w3.org/TR/html5-diff/ notes, for example, "A new placeholder attribute can be specified on the input and textarea elements." I think it should be pretty exhaustive. We only have to do this once.

Finally, as I read it, you have v5 as the default, which is what we should do. We just need to make it dead simple, one line of code ideally, for someone to change back to v4 in their existing code.

@castaway
Copy link

castaway commented Nov 18, 2019

I was hiding v4 / v5 just so we didnt have 3x lots of the essentially same docs, and was being lazy about just removing the POD.. I would class those modules as internal (tho I guess folks might want to load them instead of the top layer?)

Sounds good re test ideas, I'll have a poke tomorrow unless someone else gets to it.

Current default is v4, unless v5 is requested.. I'm happy with either variant .

@petdance
Copy link
Collaborator Author

I think v5 should be the default. If we're going to change the default, then the v5.0.0 release of the module is the time to do it. I think it's OK to break things in a big version leap like that, and HTML5 is what folks want in this day & age.

@castaway
Copy link

Makes sense to me

@castaway
Copy link

Btw in case it wasn't clear (and the labelling is somewhat misleading, I borrowed it from DBIx::Class) .. the technique doesn't hide the modules from PAUSE, but more from metacpan / cpan websites.

@castaway
Copy link

Looking at writing the tests of course makes me think of more things.. HTML5 has been through several iterations, some of which have added, then later removed, elements (grrr).. What approach should we take?

My initial thought would be:
a) Include all the HTML5 elements we can find, later removed or not (as I doubt writers of sites are updating everything as new changes come out)
b) Add another %isObseleted or similar which we attempt to keep uptodate with the removed ones? (would also include the removed v4 ones etc)

@petdance
Copy link
Collaborator Author

I guess I'm not seeing it as a problem to have the v4/v5 searchable in metacpan et al. I've removed them: 9d62e33

@petdance
Copy link
Collaborator Author

Do you have examples of things that were added to HTML5 and then removed? I didn't think there were any.

@castaway
Copy link

Do you have examples of things that were added to HTML5 and then removed? I didn't think there were any.

So far I've found "keygen" and "menuitem", from the mozilla obseleted list here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element (bottom of page)

@castaway
Copy link

castaway commented Dec 3, 2019

Anything else we need to be doing?

@petdance
Copy link
Collaborator Author

petdance commented Dec 3, 2019

I was under the impression you were still working on things. Did I misunderstand?

Right now I've got a v5 branch from what you've submitted so far, and I thought you were going to be adding more to it.

@castaway
Copy link

castaway commented Dec 3, 2019

Oh, oops, miscommunication then. I was done after I submitted the tests update.

@petdance
Copy link
Collaborator Author

petdance commented Jan 3, 2020

@castaway have you used this code anywhere? Have you used it with existing code that needs HTML::Tagset and tried it with the new 5.0.0?

@petdance
Copy link
Collaborator Author

petdance commented Jan 3, 2020

I'm making the minimum version for HTML::Tagset be Perl 5.10.1, which came out in 2009.

@petdance
Copy link
Collaborator Author

petdance commented Jan 3, 2020

I've cleaned up a bunch of stuff on formatting in the new files. All my changes have been pushed back to github on the v5 branch.

What else do we need to do in order to release it?

@castaway
Copy link

castaway commented Jan 5, 2020

I did run it yes, I can retry the current content

@petdance
Copy link
Collaborator Author

petdance commented Jan 5, 2020

There are some places that we need more documentation. I've marked them with XXX in the code and Changes. Can you please fill in some text and examples that will be meaningful to the reader?

@castaway
Copy link

Added docs: https://github.com/castaway/html-tagset-1/tree/v5

I also ran brewbuild --revdeps (tests reverse dependencies) - and got several which have tests containing html4 style html, so I'm about to write them patches and link back here.

@PhilterPaper
Copy link

PhilterPaper commented Feb 27, 2024

I also forked HTML::Tagset to Github (PhilterPaper/HTML-Tagset) and have made a bunch of changes to clean up the code, add all the HTML 5 tags and their attributes list, and fixed a few bugs. It's still waiting for a consensus on how to deal with different HTML versions (see my issues PhilterPaper/HTML-Tagset/issues/1 and PhilterPaper/HTML-Tagset/issues/2).

I would have no problem with someone merging my changes into some "official" repository and then into the CPAN release. I have my hands full with PDF::Builder, which uses HTML::TreeBuilder, which uses HTML::Tagset; I could take on managing HTML::Tagset if no one else wants the job, but would rather that someone else do it. Somebody, please?

@petdance
Copy link
Collaborator Author

petdance commented Mar 1, 2024

I would be glad to fold in any changes that add v5 support without breaking existing behavior. Does your fork do that?

@PhilterPaper
Copy link

PhilterPaper commented Mar 2, 2024

I added all the tags and attributes I could find up through HTML 5, and very limited testing suggests nothing broke. One thing needed is some thorough testing. I also added to the "phrase" (inline) tags list and added the "block" tags list, but it's not clear exactly what the criteria are for inclusion in either list (discussion needed!). The POD should probably list everything available.

My changes don't address v4 versus XHTML versus v5 (see my issues 1 and 2), which might break existing usage. That needs to be settled, not necessarily in the next release. New methods that take the HTML version (and whether to discard removed or deprecated tags) as input may be the answer. We should consolidate discussion of all these issues in one Github repository. If you want to pull my changes over to your repository, and conduct discussions there, I would be willing to erase my repository at Github as redundant.

At the very minimum <ins> and <del> tags need to be added immediately. I think you (Andy at petdance) own the CPAN entry, so you should be able to issue a new release.

@petdance
Copy link
Collaborator Author

petdance commented Mar 6, 2024

At the very minimum and tags need to be added immediately

I don't know what you're asking for there. Please make a separate ticket for it if something is wrong with the HTML 4 tags that can be updated, separate from any potential HTML 5 overhaul.

@PhilterPaper
Copy link

PhilterPaper commented Mar 6, 2024

https://rt.cpan.org/Public/Bug/Display.html?id=151970

Yes, these are missing HTML 4 tags.

Also (should all be HTML 4):

  • 'ol' in %boolean_attr add 'reversed' attribute. Note that (per suggested TODO) I updated the entire list to be formatted for consistency in the manner of 'input'... no idea who uses this list or if that will break someone's code.
  • Added 'svg' to %isPhraseMarkup, in addition to 'ins', 'del', 'bdi', 'button', 'mark', 'meter', 'progress', 'iframe', 'object'; I think the other new ones are HTML 5.
  • Added 'basefont' and 'noscript' to %isHeadElement, and removed 'bgsound'.
  • Added 'fieldset', 'legend', 'datalist', 'output', 'keygen' to %isFormElement.
  • Corrected POD %isBodyMarkup s/b %isBodyElement.

Some stuff in the lists are deprecated/removed after HTML 3 and should be taken care of with "HTML version" control.

@PhilterPaper
Copy link

I see that you have released 3.22, with some of my requested changes (ins and del, POD typo fixed). Thank you -- PDF::Builder now runs correctly when it encounters <ins> and <del> tags. I now only have to remind users to check if their HTML::Tagset has been updated to at least 3.22.

I guess that besides the HTML 5 handling, there are a number of HTML 3 (only ?) tags such as bgsound and plaintext, that we have to decide what to do with. There are also some HTML 4 tags and attributes that you probably want to default to, and clean up %isPhraseMarkup and %isHeadElement. I see also that Strawberry Perl now installs HTML::Tagset in /Strawberry/perl/site instead of /Strawberry/perl/vendor (if that was intentional -- it didn't seem to break anything).

Throughout the life of HTML, there have been a number of tags (elements) added, and some removed. For a given HTML version, we would need a switch to specify "everything that has accumulated to this point" versus "just the official, supported items" (and maybe separate deprecated and removed switches for those). In order not to break compatibility with existing applications using HTML::Tagset, I think you should add accessor methods that return about the same thing (hash) as the raw variable access, but permit input switches to detail exactly what level of HTML to provide. HTML 4.01 seems to be approximately what the current product is, so that would make a good default. So, in addition to %isPhraseMarkup, which would return a 4.01 level HTML list, there would be a isPhraseElement($HTML_level, %opts) method (function) that by default returns HTML 5 full (hash), and permits a user to detail exactly what level they want. I don't know if you could call the method isPhraseMarkup(), or if that would collide with the variable %isPhraseMarkup. It might be nice to have some consistency in names: some are Markup and some are Element.

I'm not sure what to do about the <svg> tag -- it has its own complex set of tags, (children of <svg>) that might call for an SVG::Tagset package all by itself.

@petdance
Copy link
Collaborator Author

there are a number of HTML 3 (only ?) tags such as bgsound and plaintext,

Ooops, I missed those. Sorry.

Please make an RT ticket that includes any changes that need to make HTML::Tagset correct for HTML 4, and I will update for it. Let's keep that separate from anything to be done to be able to handle HTML5.

Also, please comment on https://rt.cpan.org/Public/Bug/Display.html?id=74627 if you want. I don't see any reason to NOT include it.

@castaway
Copy link

castaway commented Mar 11, 2024

hey both, reading all the notes and wondering if this chatter has gotten a bit off topic. where are we on the actual PR i submitted to handle HTML5 explicitly, while not confusing folks expecting 4?

Looks like i volunteered to write test patches for a bunch of other dists, that was daft of me..

@PhilterPaper
Copy link

Looks like Andy closed it years ago, with the request that you migrate it to the v5 branch. Is having a separate v5 branch a good idea at this point? It seems to be inactive. Ultimately there should be a single product release, with a way to switch among desired HTML levels.

@petdance
Copy link
Collaborator Author

First, thank you for your work with getting HTML::Tagset to handle HTML5. That said...

where are we on the actual PR i submitted to handle HTML5 explicitly, while not confusing folks expecting 4?

We are at zero right now. I ask for your patience.

I have handed off HTML::Tagset to the libwww-perl group for stewardship. (Yes, I'm part of the group, but for this discussion pretend I'm not) A big part of that is that I think other folks besides only me should steer the future of HTML::Tagset.

I don't know what libwww-perl will want to do as far as HTML5. Three options immediately come to mind:

  • Abandon the idea of back-compatibility with existing HTML::Tagset users, because it's been 16 years and HTML4 is dead, and so just overhaul it for HTML5 tags.
  • Come up with a compatibility system (perhaps based on the work @castaway has done) to handle both.
  • Start a new HTML::Tagset like HTML::Tagset5 or something.
  • There are undoubtedly others as well.

I'm not the best person to steer this, and this past week or two has made this clear. I'm not doing any work any more that relies on HTML::Tagset. Clearly, other people will have more informed opinions. HTML::Tagset is really part of an ecosystem, and so I'm very happy that it's moved under the libwww-perl umbrella.

@petdance
Copy link
Collaborator Author

Looks like Andy closed it years ago, with the request that you migrate it to the v5 branch. Is having a separate v5 branch a good idea at this point?

Yes, I started that branch and no, it hasn't had anything done on it years.

Ultimately there should be a single product release, with a way to switch among desired HTML levels.

As I said up above, that's an option, and the one that I initially was hoping for, but at this point I don't know that it's the best, and am leaving it to others to steer.

@PhilterPaper
Copy link

...and am leaving it to others to steer.

Hmm. If libwww-perl is now running the show, whose hand is on the steering wheel? It's got to be someone. I don't think you need to get permission to get v4 up to date with a full set of tags, cleaned-up formatting, etc., but we should reach a consensus on how to handle v5 (as well as 3.2 and XHTML) before adding in v5 tags/attributes and doing anything else in that area.

I think there's general agreement that whatever is done, it should not break existing applications using HTML::Tagset (especially HTML::TreeBuilder, but there are others). Beyond that, what sort of compatibility should be maintained? Should v4 continue to be the default tag set, as everyone uses v5 now? What is the best way to introduce HTML level switches, and possibly [switches for] removal of deprecated and withdrawn tags? There are a number of tags which are perfectly functional, but whose function should better be done with CSS (e.g., <tt>).

@petdance
Copy link
Collaborator Author

If libwww-perl is now running the show, whose hand is on the steering wheel? It's got to be someone.

Nobody right now. I just handed it over 12 hours ago. Patience, please.

I think there's general agreement that ...

I don't think there's any general agreement of anything yet.

@oalders
Copy link
Member

oalders commented Mar 11, 2024

On behalf of the libwww-perl org, I'm happy to release new versions etc, but since I don't really use this module directly, my main concern is not having it break anything that depends on it. I'd like to defer to people who are familiar with the internals, but I'm happy to help keep things moving along.

@castaway
Copy link

Looks like Andy closed it years ago, with the request that you migrate it to the v5 branch. Is having a separate v5 branch a good idea at this point? It seems to be inactive. Ultimately there should be a single product release, with a way to switch among desired HTML levels.

Oops: I meant my copy of the v5 one (there's a link around here somewhere..) ah here: https://github.com/castaway/html-tagset-1/tree/v5

@castaway
Copy link

(currently re-running brewbuild to see where I got to re rev deps, man installing that was fun)

Personally: My interest is in parsing HTML, generally from whole pages from active websites, so they are 99% likely to be v5. I also use TreeBuilder (which I sent a test fix for, see above.. untouched so we may need to prod em /release one), TreeBuilder uses HTML::Parser and TagSet. For a whole page TreeBuilder could catch the Parser event "declaration", figure out which html version is being stated, load the correct part of TagSet, and bob's yer uncle. ( Patch for TreeBuilder goes here? https://metacpan.org/module/HTML::TreeBuilder/source#L1371 )

Somewhat more tricky is when we parse chunks of html without declarations, my suggestion would be: v5 has existed 13+ years, default to HTML5, "ignoring" tags not in v4 (like we do now for v5 tags), document well in Parser and co how to enforce use of v4 if required.

FWIW that's what the above linked v5 branch of mine does, so imo "rebase that on current main branch, retest, call done". :)

@castaway
Copy link

Mmmm, only 25 reverse dep test fails.. (out of 30)

@petdance
Copy link
Collaborator Author

For posterity, here are the original tickets requesting HTML5 support.

https://rt.cpan.org/Ticket/Display.html?id=67299

https://rt.cpan.org/Ticket/Display.html?id=63059

@PhilterPaper
Copy link

Items for discussion:

  1. We should avoid breaking any existing application using HTML::Tagset. We need to keep the current set of variables around permanently, although their content may change a bit (see number 2 and 5.).
  2. tag lists (and attribute lists) need to be brought fully up to date for a given HTML level. Note that online lists of both tend to vary a bit in what they contain, so search multiple listings!
  3. How to offer different levels of HTML: 3.2, 4.01, XHTML 1.0, 5, etc.:
    1. inserting a 'V3', 'V4', 'X', 'V5' in the name sequence,
    2. setting some sort of global version number that the variables can then use (possible?),
    3. using methods with the HTML level as the parameter,
    4. something else?
  4. How to offer "variations" within an HTML level, such as including/excluding withdrawn tags, deprecated tags, niche tags (Netscape Navigator only, Internet Explorer only, etc.). Some browsers include every tag ever released, and others don't. Some tags are deprecated or withdrawn because they were truly awful (e.g., <blink>, <bgsound>), while others because there are "better" ways to do it (e.g., <tt>).
  5. What should the default level be? V4, as the current HTML::Tagset, or V5, since practically everyone supports that now? It's possible that someone may have an application where they want a specific set of tags to be allowed, and not later ones.

PhilterPaper/HTML-Tagset should be pretty up to date, as far as v4 goes, but will need to accommodate different levels (v5 are commented out).

Finally, we need to address having a consistent architecture of the lists regarding permissible/required children/parents and ensuring that a tag or attribute gets listed only once in a given list. For example, <tr> appears where?

@castaway
Copy link

Hi Phil,

I feel like we're talking past each other a bit, so I've created a draft PR of the current work state, as last seen in 2022 or so. I hadn't realised the local v5 branch had been removed, which makes it difficult to refer to!

See #12

Items for discussion:

  1. We should avoid breaking any existing application using HTML::Tagset. We need to keep the current set of variables around permanently, although their content may change a bit (see number 2 and 5.).

Agreed. Mostly the plan here was to a) patch any "uses Tagset" CPAN modules to not use non-v5 tags in their tests. b) release this as a shiny new (version 5.0.0) module, and document breaking changes. (see POD)

  1. tag lists (and attribute lists) need to be brought fully up to date for a given HTML level. Note that online lists of both tend to vary a bit in what they contain, so search multiple listings!

W3C / WHATG lists should be used in my opinion.

There's been a lot of change over the lifetime or v5 already. Do we put in all the tags etc that existed over its lifetime? (probably easier than making users try subsets ad-infinitum)

  1. How to offer different levels of HTML: 3.2, 4.01, XHTML 1.0, 5, etc.:

    1. inserting a 'V3', 'V4', 'X', 'V5' in the name sequence,2. setting some sort of global version number that the variables can then use (possible?),3. using methods with the HTML level as the parameter,4. something else?

We picked a way to do this, see PR.

  1. How to offer "variations" within an HTML level, such as including/excluding withdrawn tags, deprecated tags, niche tags (Netscape Navigator only, Internet Explorer only, etc.). Some browsers include every tag ever released, and others don't. Some tags are deprecated or withdrawn because they were truly awful (e.g., , ), while others because there are "better" ways to do it (e.g., ).

Belongs in a separate issue, I think. Aka nice to have, but not directly related to "make v5 work in general". Personally I don't need this, I guess it depends on your use of TreeBuilder et al. I want to parse existing pages, so for me "all ever existing v5 tags in the v5 set" will do.

We can ponder how to make subsets of v5 work as well as "the whole thing"

  1. What should the default level be? V4, as the current HTML::Tagset, or V5, since practically everyone supports that now? It's possible that someone may have an application where they want a specific set of tags to be allowed, and not later ones.

Twas decided to be v5, any strong arguments for it not being?

Finally, we need to address having a consistent architecture of the lists regarding permissible/required children/parents and ensuring that a tag or attribute gets listed only once in a given list. For example, <tr> appears where?

In another issue? this also doesnt feel like a direct dependency of "make v5 work"

@castaway
Copy link

Mmmm, only 25 reverse dep test fails.. (out of 30)

NB most of these are "uses HTML::Tree, which fails"

@PhilterPaper
Copy link

Maybe I missed something going by, but I was not aware that the structure had been definitely decided upon. Once Andy either says, "this is the way we'll do it" or accepts your PR, I'll believe that it's been settled once and for all. Personally, it doesn't matter all that much how it's done, so long as HTML::TreeBuilder works properly. If it were up to me, I would probably go with methods (in addition to the existing variables), but whatever works...

As far as being side issues to "make v5 work", I feel that it's all part of the whole architecture and needs to be addressed holistically. Splitting them out to separate issues raises the possibility of their getting lost and not addressed.

Some tags aren't all that supported by W3C documentation, for instance <blink> and <bgsound>, which are barely mentioned. That's why I suggested that many sources be consulted for a complete list of tags and attributes.

@petdance
Copy link
Collaborator Author

I think it's too early to talk about code. I want to get some high-level design and interface ironed out first. I've started issue #13 to discuss it.

@petdance
Copy link
Collaborator Author

Given the lack of any actual use case where someone needs to be able to handle both HTML4 and HTML5 using the same module, I don't see any reason to add HTML5 (or XHTML or HTML3) functionality to a 25-year-old module.

If there is such a case, I'm glad to hear it and we can discuss strategy from that point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants