Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_encoding_from_headers fails if charset name not specified #6646

Open
batterseapower opened this issue Feb 22, 2024 · 2 comments
Open

get_encoding_from_headers fails if charset name not specified #6646

batterseapower opened this issue Feb 22, 2024 · 2 comments

Comments

@batterseapower
Copy link

requests.utils.get_encoding_from_headers assumes that the charset parameter always specifies a name. In very rare cases a server can send a malformed content-type header which does not specify a name. In these cases, requests should probably just treat it as if no charset had been specified.

Expected Result

requests.utils.get_encoding_from_headers({'content-type': 'text/html; charset'}) == 'ISO-8859-1'

Actual Result

File ~/opt/anaconda3/2023.03/envs/mamba/envs/py3/lib/python3.9/site-packages/requests/utils.py:553, in get_encoding_from_headers(headers)
    550 content_type, params = _parse_content_type_header(content_type)
    552 if "charset" in params:
--> 553     return params["charset"].strip("'\"")
    555 if "text" in content_type:
    556     return "ISO-8859-1"

AttributeError: 'bool' object has no attribute 'strip'

System Information

{
  "chardet": {
    "version": "4.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.4"
  },
  "cryptography": {
    "version": "41.0.3"
  },
  "idna": {
    "version": "3.4"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.15"
  },
  "platform": {
    "release": "5.14.0-284.11.1.el9_2.x86_64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010116f",
    "version": "23.2.0"
  },
  "requests": {
    "version": "2.31.0"
  },
  "system_ssl": {
    "version": "1010117f"
  },
  "urllib3": {
    "version": "1.26.18"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}
alain-khalil pushed a commit to alain-khalil/requests that referenced this issue Mar 8, 2024
@alain-khalil
Copy link

Hello @batterseapower
I just pushed a PR to fix this issue. It is my first PR in this project. Let's wait for project mantainer to validate my fix.

Best Regards

@x11x
Copy link

x11x commented Sep 1, 2024

I wonder if _parse_content_type_header should be changed so that it ignores parameters with no equals after them. Or sets them to the empty string, or None. Setting parameter values to a bool is clearly wrong.
src/requests/utils.py#L533

I have checked RFC 2045, RFC 2616, RFC 7231, RFC 9110 and they all define a parameter as essentially parameter = parameter-name "=" parameter-value, so a parameter with no equals character is technically invalid (I think?).

Comparing what some other implementations do:
mimeparse (their implementation is taken directly from deprecated/removed built-in cgi module, so should match what built-in cgi module used to do):

>>> from mimeparse import parse_mime_type
>>> parse_mime_type("application/json; charset")
('application', 'json', {})

stdlib email.policy.EmailPolicy (tested using code from this SO answer):

>>> def parse_content_type(content_type):
...     from email.policy import EmailPolicy
...     header = EmailPolicy.header_factory('content-type', content_type)
...     return (header.content_type, dict(header.params))
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})

stdlib email.message.Message (tested using code from this SO answer):

>>> from email.message import Message
>>>
>>> _CONTENT_TYPE = "content-type"
>>>
>>> def parse_content_type(content_type: str) -> tuple[str, dict[str,str]]:
...     email = Message()
...     email[_CONTENT_TYPE] = content_type
...     params = email.get_params()
...     # The first param is the mime-type the later ones are the attributes like "charset"
...     return params[0][0], dict(params[1:])
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})

(Also, checking those implementations, you can see that they are more correct about quoted strings -- matching quotes/unquoting -- but requests' simpler version of just splitting on ";" and stripping any quote characters has been around for a long time and apparently not caused problems, so...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants