Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetInnerText is behaving different from HTML innerText for tables #174

Open
5 tasks done
timothy3001 opened this issue Jul 29, 2024 · 2 comments
Open
5 tasks done

Comments

@timothy3001
Copy link

timothy3001 commented Jul 29, 2024

Prerequisites

  • Can you reproduce the problem in a MWE?
  • Are you running the latest version of AngleSharp.Css?
  • Did you check the FAQs to see if that helps you?
  • Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Xml for Xml support)
  • Did you perform a search in the issues?

Description

When using GetInnerText the result returned misses linebreaks for the table rows.

If I use HTMLs "innerText", the linebreaks after each tablerow are correct.

I also tried to add "
" after a "" element, but it is ignored. Everything between "" and "" seems to be ignored.

Thanks a lot for the awesome project!

Steps to Reproduce

Setup simple Anglesharp example, config like the following:

IConfiguration config = Configuration
    .Default
    .WithCss(new CssParserOptions
    {
        IsToleratingInvalidSelectors = true,
        IsIncludingUnknownDeclarations = true,
        IsIncludingUnknownRules = true,
    })
    .WithRenderDevice(new DefaultRenderDevice
    {
        DeviceHeight = 768,
        DeviceWidth = 1024,
        
    });

Then parse the following HTML:

<html>
	<head>
	</head>
	<body>
		<h2>Test</h2>
		<table>
			<tbody>
				<tr>
				</tr>
				<tr>
					<td>Titel: </td>
					<td>Herr</td>
				</tr>
				<tr>
					<td>Vorname: </td>
					<td>Horst</td>
				</tr>
				<tr>
					<td>Nachname: </td>
					<td>Hammer</td>
				</tr>
			</tbody>
		</table>
	</body>
</html>

Expected Behavior

The result when going with document.body.innerText from Chrome devtools console:

Test

Titel:	Herr
Vorname:	Horst
Nachname:	Hammer

Actual Behavior

The result from anglesharp GetInnerText:

Test





Titel: Herr Vorname: Horst Nachname: Hammer 

Possible Solution / Known Workarounds

No response

@FlorianRappl
Copy link
Contributor

The outcome to preserve the table is definitely nice - I don't think we (at the moment) respect the display set to table.

This could certainly be improved (but I am not sure if this is / should be classified as a bug - IIRC we pretty much follow the spec).

@timothy3001
Copy link
Author

Oh ok, since other browsers deal differently with tables, I thought it was out of spec.

But of course, feel free to change this to improvement or feature request or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants