Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAK file reading: don't assume UTF-8 encoding #718

Merged

Conversation

jonathanperret
Copy link
Contributor

@jonathanperret jonathanperret commented Sep 30, 2024

Problem

Fixes #716.

Proposed solution

When extracting strings from a DAK (.stp or .pat) format file, we shouldn't assume that these strings are encoded using UTF-8.

In particular, it appears localized color names are not encoded using UTF-8, and trying to decode them as UTF-8 causes pattern loading to fail.

Another place where strings are read from the files is as part of the .stp decryption process. In that case, it seems safer to not try to interpret those byte sequences at all.

This PR changes the strings to Python bytestrings, so we don't need to guess at the encoding that DAK actually uses for color names, and we can be sure the bytes in the encryption keys are unmodified.

How to test

Try to load the file attached in #716.

Summary by CodeRabbit

  • New Features

    • Enhanced data handling by updating string types to bytes for improved compatibility and performance.
  • Bug Fixes

    • Adjusted methods to ensure proper functioning with the new byte data types, enhancing reliability.
  • Refactor

    • Improved code readability and consistency through minor formatting adjustments.

When extracting strings from a DAK (.stp or .pat) format file, we shouldn't assume that
these strings are encoded using UTF-8.
In particular, it appears localized color names are not encoded using UTF-8, and
trying to decode them as UTF-8 causes pattern loading to fail.
Another place where strings are read from the files is as part of the .stp decryption
process. In that case, it seems safer to not try to interpret those byte sequences at all.

This commit changes the strings to Python bytestrings, so we don't need to guess at the
encoding that DAK actually uses for color names, and we can be sure the bytes in the
encryption keys are unmodified.
@jonathanperret jonathanperret self-assigned this Sep 30, 2024
Copy link
Contributor

coderabbitai bot commented Sep 30, 2024

Walkthrough

The changes in the pull request primarily involve updates to the pattern_import.py file, focusing on data type modifications from str to bytes for various methods and class parameters. This includes changes to the getStringAt function, the Color class constructor, and methods in the DAKPatternConverter, PatPatternConverter, and StpPatternConverter classes. Additionally, there are formatting adjustments for improved code readability.

Changes

File Path Change Summary
src/main/python/main/ayab/pattern_import.py Updated return types and parameter types from str to bytes in several methods and class constructors. Adjusted formatting for debug print statements.

Assessment against linked issues

Objective Addressed Explanation
Error loading .stp file (Issue #716)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

sonarcloud bot commented Sep 30, 2024

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
src/main/python/main/ayab/pattern_import.py (3)

54-54: LGTM! Consistent byte string handling in the Color class.

The changes to the Color class constructor and string method are consistent with the PR objective. Using bytes for the name parameter and the !r format specifier ensures proper handling of potentially non-UTF-8 encoded color names.

For improved clarity, consider adding a comment explaining the use of !r for the name attribute:

def string(self) -> str:
    return (
        f"{hex(self.code)}, {str(self.n)}, '{chr(cast(int, self.symbol))}',"
        + f" {self.name!r}, {hex(int.from_bytes(self.rgb, 'big'))}"  # !r ensures proper representation of bytes
    )

This comment would help future maintainers understand the reasoning behind using !r.

Also applies to: 78-78


462-470: LGTM! Consistent byte string handling in key calculation.

The changes in the __calc_key method to use byte strings are consistent with the PR objective. This ensures that all string manipulations are performed on raw bytes, preventing any potential encoding issues.

Consider optimizing the repeated str(...).encode() calls by using byte string literals or f-strings with byte conversion:

keystring = b'%d' % getWordAt(data, 0x3D)
# or
keystring = f'{getWordAt(data, 0x3D)}'.encode()

This approach might be more efficient and cleaner, especially for the repeated conversions of numeric values to byte strings.

Also applies to: 488-497


566-596: Improve f-string usage in error messages.

The use of f-strings for error messages is a good improvement for readability. However, there's a minor issue with the f-string usage. When the entire string is an f-string, you don't need to use str() inside it. Consider updating the error messages as follows:

self.exit(f".cut file misspecified at row {row}", -5)  # FIXME translate
self.exit(f"row {row} is too long", -5)  # FIXME translate

This change makes the code slightly cleaner and more idiomatic.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4dab861 and 6179116.

📒 Files selected for processing (1)
  • src/main/python/main/ayab/pattern_import.py (9 hunks)
🧰 Additional context used
🪛 Ruff
src/main/python/main/ayab/pattern_import.py

386-386: Yoda condition detected

Rewrite as i == 0

(SIM300)

🔇 Additional comments (4)
src/main/python/main/ayab/pattern_import.py (4)

35-37: LGTM! Improved handling of potentially non-UTF-8 encoded data.

The change from returning str to bytes aligns well with the PR objective. By avoiding implicit decoding, this modification prevents potential issues with non-UTF-8 encoded data in DAK files. This is a good step towards more robust file handling.


386-394: LGTM! Consistent use of byte strings for color names.

The change to use b"" instead of an empty string for the name parameter in the Color constructor call is consistent with the PR objective. This ensures that color names are always stored as byte strings, preventing any potential encoding issues.

🧰 Tools
🪛 Ruff

386-386: Yoda condition detected

Rewrite as i == 0

(SIM300)


Line range hint 1-596: Overall, excellent changes addressing the UTF-8 encoding assumption issue.

The modifications throughout this file consistently move towards using byte strings instead of Unicode strings, which effectively addresses the PR objective of not assuming UTF-8 encoding. These changes should resolve the issues with loading patterns that have non-UTF-8 encoded color names.

Key improvements:

  1. getStringAt function now returns bytes, avoiding implicit decoding.
  2. Color class now handles color names as byte strings.
  3. Key calculation in StpPatternConverter uses byte string operations.
  4. Consistent use of byte strings in various methods across different converter classes.

These changes significantly improve the robustness of the DAK file handling, especially for files with localized color names. The code is now better equipped to handle various encodings without making assumptions about UTF-8.

To further improve the code:

  1. Consider adding comments explaining the byte string usage where it might not be immediately obvious.
  2. Look for opportunities to optimize repeated string-to-bytes conversions, especially in the __calc_key method.
  3. Ensure that any code that consumes the color names is updated to handle byte strings appropriately.

Great work on addressing this important issue!


450-450: LGTM! Consistent use of byte strings in __appendKeystring method.

The change in the method signature from str to bytes is consistent with the PR objective of handling non-UTF-8 encoded data. This modification ensures that the key string manipulation is performed on raw bytes rather than decoded strings.

To ensure the implementation is correct, please run the following script to check the __appendKeystring method:

✅ Verification successful

Verification Successful: __appendKeystring Method Implementation Confirmed

The __appendKeystring method is correctly implemented to handle byte strings, ensuring that key string manipulation is performed on raw bytes as intended.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of __appendKeystring method

# Test: Search for the __appendKeystring method implementation
ast-grep --lang python --pattern $'def __appendKeystring(next_string: bytes, max_size: int) -> bytes:
  $$$
'

Length of output: 338

@mathPi
Copy link
Contributor

mathPi commented Oct 3, 2024

I tested using one of my files and running a simulation and it is working.

@dl1com dl1com merged commit 205d651 into AllYarnsAreBeautiful:main Oct 3, 2024
2 checks passed
@jonathanperret jonathanperret deleted the fix-stp-string-decoding branch October 3, 2024 19:43
@dl1com
Copy link
Contributor

dl1com commented Oct 3, 2024

I tested using one of my files and running a simulation and it is working.

Thanks for your feedback.
Plus the additional feedback in Discord, this looks good for now and ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Error when loading .stp files with non-ASCII colors
3 participants