Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of FileSystemFontProvider.scanFonts() #189

Open
wants to merge 2 commits into
base: 3.0
Choose a base branch
from

Conversation

bogdiuk
Copy link

@bogdiuk bogdiuk commented Apr 10, 2024

On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums).

In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible:

  • in this mode, whole file is not read to byte[], parser uses RandomAccessRead directly, because most of the file is skipped
  • only read 5 tables needed for FSFontInfo (name, head, OS/2, CFF , gcid)
  • table parsers finish as soon as they have all needed data
  • skip checksumming because it is now faster to simply re-parse the file (gated with pdfbox.fontcache.skipchecksums system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7s

Additional fixes:

  • fixed a memory leak: RandomAccessRead passed to TrueTypeCollection constructor was never freed.
  • NamingTable: use sorted list instead of multilevel HashMap, delay-load Strings (for non-"only headers" mode)
  • TTFSubsetter: avoid bytes->string->bytes conversion
  • streamline I/O: replace readByte() with read(array)
  • consolidate read(buf, offset, len) loops into readNBytes() (allows underflow) and readExact() (throwing)

Breaking changes:

  • if pdfbox.fontcache.skipchecksums property is set, .pdfbox.cache file has dashes instead of checksums
  • .pdfbox.cache file differs for corrupted fonts (like LastResort.otf that throws IOException: Invalid character code 0xD800):
    • original cache lists it as *skipexception*|OTF||0|0|0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
    • after this change it's listed as LastResort|OTF||||0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000 because we simply do not parse it as extensively to recognize that it's corrupted
  • NameRecord.getString() is now package-private and lazy, renamed to getStringLazy()
  • new abstract method TTFDataStream.createSubView() (used a default implementation instead)

Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course, and some *skipexception*s).

Should I break it into smaller commits?

@lehmi
Copy link
Contributor

lehmi commented Apr 10, 2024

Thanks for the contribution, looks promising at least at a first sight.

And yes, please split up the patch im smaller parts so that it is easier to check the changes. We won't add breaking changes to 3.x, those are limited to the trunk. But maybe we find a way to backport at least some of them without breaking the api.

@PascalSchumacher
Copy link

Hi, any update on this? Looks like a worthwhile enhancement (as far as I can tell).

@bogdiuk
Copy link
Author

bogdiuk commented May 28, 2024

I'm sorry, I didn't have much free time, but refactoring is in progress. I am shrinking changes to only apply to LoadOnlyHeaders case. As for those changes that affect both LoadOnlyHeaders and non-LoadOnlyHeaders parsers - I will push them separately, for easier testing. When done, I will post performance numbers for this subset of changes.

@bogdiuk
Copy link
Author

bogdiuk commented Jun 18, 2024

Force-pushed the new implementation, based on 3.0 branch (rev 5c4a09a). Also I edited the 1st post to reflect changes

Performance numbers: first number is a re-run (font files are cached by the OS), second is a cold run

Windows (296 fonts, Java 8 x32):

  • 2.9s..4.0s - 3.0 branch
  • 0.8s..1.9s - performance-scanfonts branch
  • 0.3s..0.6s - performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

macOS (792 fonts, Java 22 arm64):

  • 2.2..3.6s - 3.0 branch
  • 0.4..0.5s - performance-scanfonts branch
  • 0.16s..0.2s performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

@bogdiuk
Copy link
Author

bogdiuk commented Jun 18, 2024

Reasoning behind using a separate class LoadOnlyHeaders instead of extracting information directly from TrueTypeFont:

  • easier to debug - single place to put breakpoints
  • easier to refactor:
    • for example, simple "Find All Usages" can reveal that setSomething() is not used
    • the alternative is having boolean isLoadOnlyHeaders in TTFParser, CFFParser and TrueTypeFont: it would be harder to collect usages of all 3
  • in the future, parsing some tables can be skipped completely, only setting respective fields of LoadOnlyHeaders

@THausherr
Copy link
Contributor

Could you please rename "LoadOnlyHeaders" to something that isn't a verb? E.g. "FontHeaders" or whatever.

Logging should be done like it's done in the code, not with "Logger.getLogger".

I'd also like to separate refactorings and bug fixes from the main PR, I'll do some small fixes. Please rebase your PR when you notice it.

@THausherr
Copy link
Contributor

Re "fixed a memory leak", this is tricky: your change makes sense for the TrueTypeCollection(File file) constructor only. Not for the one with InputStream. That one (which isn't used by PDFBox itself) also reads the font into memory twice as it is now.

@THausherr
Copy link
Contributor

I've done your refactorings of NamingTable. I've kept the same sequence so I hope you don't have too much work. Ideally you could just keep your own code and the diff to the existing would now be much smaller.

@THausherr
Copy link
Contributor

Why is readExact() needed under the hood? Assuming that your changes in this PR read a selection of the data that is read in the production code, the error handling on truncated / corrupted files can stay the same. Unless you think that the error handling is poor, then it should be changed everywhere. The reason I ask this is because your read() has a different behavior than other read() methods, and because I'd like to keep the code as small as possible.

@bogdiuk
Copy link
Author

bogdiuk commented Jun 25, 2024

Thank you, I will fix the issues today.

Re readExact()/readUpTo(): I will remove them for simplicity, they remained from my initial PR.

I initially saw that RandomAccessRead.read() was called in a loop and assumed that it behaves like InputStream.read() (i.e. can return <N bytes and still have more coming), so it needs to be wrapped into readUpTo().
But now I see that all implementations of RandomAccessRead.read() have loops of their own, so they already behave like InputStream.readNBytes() (can only read less than N bytes in case of EOF)? But this is not specified in Javadoc.

For the future refactoring, here are the places where read() is called in a loop (either shrink them to 1 call or add readUpTo() that guarantees ==N bytes || EOF):

  • RandomAccessReadDataStream.cctor()
  • COSParser.getStartxrefOffset() (but also throws if (readBytes < 1))
  • PDFXrefStreamParser.readNextValue()
  • PDCIDFontType2.getParser()
  • PDTrueTypeFont.getParser()
  • PDFontFactory.getFontHeader()
  • CFFParser::parse

No-loop calls (may be wrong if implementations are not required to behave like InputStream.readNBytes()):

  • BaseParser.checkForEndOfString()
  • PdfStreamParser.hasNoFollowingBinData()

@THausherr
Copy link
Contributor

Thanks; The memory leak part has been fixed in https://issues.apache.org/jira/browse/PDFBOX-5845 . I've removed one of the constructors because it's no longer needed, please update your PR.

@THausherr
Copy link
Contributor

I applied the patch for the first time, I do get differences in the pdfbox.cache file:
ARBONNIE becomes ARBON
ComicSansMS-BoldItalic becomes Com
GeorgiaPro-CondSemiboldItalic becomes Georgi
NotoNaskhArabicUI-Bold becomes NotoNaskhArabicUI-
KodchiangUPCBoldItalic becomes Kodchi

Here's the comicz file:
comicz.zip

However it might also be my fault due to my own changes, so I'll investigate.

@bogdiuk
Copy link
Author

bogdiuk commented Jun 27, 2024

Yes, I have the same Com, will investigate. Thanks for noticing
I compared caches only on macOS initially

@bogdiuk
Copy link
Author

bogdiuk commented Jun 27, 2024

Found it: RandomAccessReadUnbufferedDataStream.read() in my initial code was using readExact() and thus returned length. After reverting to read(), I forgot to return the correct read amount

@bogdiuk bogdiuk force-pushed the performance-scanfonts branch 2 times, most recently from 25f858d to 0a6ae75 Compare June 27, 2024 11:34
@THausherr
Copy link
Contributor

Thanks that works now.

@THausherr
Copy link
Contributor

Why the setException() / getException() thing? It feels weird to me because parse() might still throw an exception so why swallow and store it if it happens later? Same weird feeling that an exception for "font.getNaming()" is completely ignored, I don't understand the comment.
(Sorry if you did answer this before, I did reread the conversation, I hope I didn't miss something)

@bogdiuk
Copy link
Author

bogdiuk commented Jun 27, 2024

Re exception: 2 calls should've been setError(String) to avoid slow fillInStackTrace() in the exception constructor. Third call removed because existing exceptions should be forwarded indeed.

Re "font.getNaming()" - don't remember, looks weird, removed.

Found a potential bug in TrueTypeCollection.processAllFontHeaders() and processAllFonts(): if some font in TTC file is corrupted, an exception is thrown and all remaining fonts in TTC are skipped. Didn't fix because it's debatable (maybe whole TTC is corrupted, there's no need to waste time on parsing it).

Also, simplified addTrueTypeCollection() because all 3 methods it calls do not effectively throw.

@THausherr
Copy link
Contributor

THausherr commented Jun 29, 2024

Thanks; IOUtils.closeQuietly(parser.parse(new RandomAccessReadBufferedFile(ttfFile))); caught my attention because parser.parse() returns null in "header" mode. So the closeQuietly() isn't needed. And returning sometimes a font and sometimes null kindof contradicts the "do one thing" ideal. How about making a different method for the headers instead, which returns a FontHeader object, and refactor the common code of the two methods into a third one? Same for getFontAtIndex, IMHO this should be split in getFontAtIndex and getFontHeaderAtIndex and the common part refactored into a method accessed by both, this would be more clear.

@bogdiuk
Copy link
Author

bogdiuk commented Jun 30, 2024

While I'm at it, I think it's better to replace "calling getter just for side-effects"

font.getNaming(); // calls NamingTable.readTable();
font.getHeader(); // calls HeaderTable.readTable();
((OpenTypeFont) font).getCFF(); // calls CFFTable.readTable();

with explicit calls

font.readTableHeaders(*.TAG, outHeaders); // -> TTFTable.readHeaders(..., FontHeaders)

Then TTFParser.loadOnlyHeaders field can be removed.

The changeset will become a bit larger (like, duplicated TrueTypeFont.readTable()), but codepaths will be more clear

…ng arrays

[ADD] RandomAccessReadUncachedDataStream that doesn't read input stream to byte[]
…by adding 'only headers' mode to TTF parser:

* only read tables needed for FSFontInfo ('name', 'head', 'OS/2', 'CFF ', 'gcid')
* 'CFF ' and 'head' table parsers finish as soon as it has all needed headers
@bogdiuk
Copy link
Author

bogdiuk commented Jun 30, 2024

Or did I over-do it with CFFParser.parseFirstSubFontROS()?

@THausherr
Copy link
Contributor

I'm wondering if the onlyHeaders boolean parameter in NamingTable.read() is needed. At a first look, the only thing it does is to decide whether nameRecords is filled. When having a second look, I think it will result in more computing at the end of the method, but not more file reading, isn't it? (And we could more the 3 lines into read(TrueTypeFont ttf, TTFDataStream data) )

@bogdiuk
Copy link
Author

bogdiuk commented Jun 30, 2024

When onlyHeaders==true, it discards useless entries, making nameRecords smaller and thus readString() is called fewer times in the following loop, reducing both computing and file reading.

So it's slightly more computing in the 1st loop (isUsefulForOnlyHeaders() is just 5 comparisons), but much less computing in the second (seek > allocate+read > detect charset > new String) and fillLookupTable() that traverse nameRecords

Of course, to please branch predictor, we can extract if from the loop, but it looks much more verbose and it should be easy to predict for CPU anyway:

    if (onlyHeaders)
    {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            if (isUsefulForOnlyHeaders(nr))
            {
                nameRecords.add(nr);
            }
        }
    } else {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            nameRecords.add(nr);
        }
    }

Which 3 lines do you mean?

@bogdiuk
Copy link
Author

bogdiuk commented Jun 30, 2024

As for lookupTable, I am going to remove in the the future PR (the change was in my initial Request).
By sorting nameRecords, we can use a simple binary search in getName(), which is fast enough, and doesn't need additional memory, like current 4-level HashMap cache. Especially considering that each name is searched only once

@THausherr
Copy link
Contributor

THausherr commented Jun 30, 2024

Please ignore all I wrote today for now, I overlooked the readstring part. Don't change anything.

@THausherr
Copy link
Contributor

THausherr commented Jun 30, 2024

Thank you for your work and your patience. I'm planning to commit all next week, as soon as time permits. Ideally I'll commit to the trunk and all branches. From what I see you didn't break the API.

However there's one more thing I have to investigate, I'm getting sporadic parsing errors with a specific file, when the cache is getting rebuilt. Most likely this isn't because of the changes but I'll check that first.

Update: not because of the changes, problem happens in 3.0.2 release too.

@THausherr
Copy link
Contributor

THausherr commented Jul 1, 2024

trunk and 3.0 have been committed. 2.0 is still todo. Builds will freeze, this is a problem with the OWASP plugin, search for "owasp" in the top pom.xml, then below "" add "true". The people maintaining this plugin are aware of the problem and I'll update the version as soon as their release is done.

https://issues.apache.org/jira/browse/PDFBOX-5847

update: OWASP update has been released, change has been committed to the pom.xml

@THausherr
Copy link
Contributor

I worked on 2.0 for a few hours and I was only able to do 4 files in that time, and not the big ones. In 2.0 we read the fonts into memory using the MemoryTTFDataStream class. RandomAccessRead isn't available in fontbox. So unless we would add double code that inflates the jar file, our advantage wouldn't be as much as in 3.0 and higher, we'd only save some reading from memory. Even if we did have some extra io classes, the advantage wouldn't be as much as in 3.0 because in 2.0 we don't read so many GSUB tables.

Opinions?

@bogdiuk
Copy link
Author

bogdiuk commented Jul 2, 2024

Thank you. I will look at 2.0 on the weekend or maybe next week

@THausherr
Copy link
Contributor

THausherr commented Jul 5, 2024

Thank you; here's the work I did, maybe it's useful.
ttf.zip
cff.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants