Improve performance of FileSystemFontProvider.scanFonts() #189

bogdiuk · 2024-04-10T01:08:52Z

On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums).

In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible:

in this mode, whole file is not read to byte[], parser uses RandomAccessRead directly, because most of the file is skipped
only read 5 tables needed for FSFontInfo (name, head, OS/2, CFF , gcid)
table parsers finish as soon as they have all needed data
skip checksumming because it is now faster to simply re-parse the file (gated with pdfbox.fontcache.skipchecksums system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7s

Additional fixes:

fixed a memory leak: RandomAccessRead passed to TrueTypeCollection constructor was never freed.
~~NamingTable: use sorted list instead of multilevel HashMap, delay-load Strings (for non-"only headers" mode)~~
~~TTFSubsetter: avoid bytes->string->bytes conversion~~
~~streamline I/O: replace readByte() with read(array)~~
~~consolidate read(buf, offset, len) loops into readNBytes() (allows underflow) and readExact() (throwing)~~

Breaking changes:

if pdfbox.fontcache.skipchecksums property is set, .pdfbox.cache file has dashes instead of checksums
.pdfbox.cache file differs for corrupted fonts (like LastResort.otf that throws IOException: Invalid character code 0xD800):
- original cache lists it as *skipexception*|OTF||0|0|0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
- after this change it's listed as LastResort|OTF||||0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000 because we simply do not parse it as extensively to recognize that it's corrupted
~~NameRecord.getString() is now package-private and lazy, renamed to getStringLazy()~~
~~new abstract method TTFDataStream.createSubView()~~ (used a default implementation instead)

Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course, and some *skipexception*s).

Should I break it into smaller commits?

lehmi · 2024-04-10T07:03:16Z

Thanks for the contribution, looks promising at least at a first sight.

And yes, please split up the patch im smaller parts so that it is easier to check the changes. We won't add breaking changes to 3.x, those are limited to the trunk. But maybe we find a way to backport at least some of them without breaking the api.

PascalSchumacher · 2024-05-21T16:08:53Z

Hi, any update on this? Looks like a worthwhile enhancement (as far as I can tell).

bogdiuk · 2024-05-28T18:58:10Z

I'm sorry, I didn't have much free time, but refactoring is in progress. I am shrinking changes to only apply to LoadOnlyHeaders case. As for those changes that affect both LoadOnlyHeaders and non-LoadOnlyHeaders parsers - I will push them separately, for easier testing. When done, I will post performance numbers for this subset of changes.

bogdiuk · 2024-06-18T09:24:35Z

Force-pushed the new implementation, based on 3.0 branch (rev 5c4a09a). Also I edited the 1st post to reflect changes

Performance numbers: first number is a re-run (font files are cached by the OS), second is a cold run

Windows (296 fonts, Java 8 x32):

2.9s..4.0s - 3.0 branch
0.8s..1.9s - performance-scanfonts branch
0.3s..0.6s - performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

macOS (792 fonts, Java 22 arm64):

2.2..3.6s - 3.0 branch
0.4..0.5s - performance-scanfonts branch
0.16s..0.2s performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

bogdiuk · 2024-06-18T10:18:49Z

Reasoning behind using a separate class LoadOnlyHeaders instead of extracting information directly from TrueTypeFont:

easier to debug - single place to put breakpoints
easier to refactor:
- for example, simple "Find All Usages" can reveal that setSomething() is not used
- the alternative is having boolean isLoadOnlyHeaders in TTFParser, CFFParser and TrueTypeFont: it would be harder to collect usages of all 3
in the future, parsing some tables can be skipped completely, only setting respective fields of LoadOnlyHeaders

THausherr · 2024-06-25T10:24:02Z

Could you please rename "LoadOnlyHeaders" to something that isn't a verb? E.g. "FontHeaders" or whatever.

Logging should be done like it's done in the code, not with "Logger.getLogger".

I'd also like to separate refactorings and bug fixes from the main PR, I'll do some small fixes. Please rebase your PR when you notice it.

THausherr · 2024-06-25T10:58:19Z

Re "fixed a memory leak", this is tricky: your change makes sense for the TrueTypeCollection(File file) constructor only. Not for the one with InputStream. That one (which isn't used by PDFBox itself) also reads the font into memory twice as it is now.

THausherr · 2024-06-25T12:29:04Z

I've done your refactorings of NamingTable. I've kept the same sequence so I hope you don't have too much work. Ideally you could just keep your own code and the diff to the existing would now be much smaller.

THausherr · 2024-06-25T12:54:38Z

Why is readExact() needed under the hood? Assuming that your changes in this PR read a selection of the data that is read in the production code, the error handling on truncated / corrupted files can stay the same. Unless you think that the error handling is poor, then it should be changed everywhere. The reason I ask this is because your read() has a different behavior than other read() methods, and because I'd like to keep the code as small as possible.

bogdiuk · 2024-06-25T14:05:40Z

Thank you, I will fix the issues today.

Re readExact()/readUpTo(): I will remove them for simplicity, they remained from my initial PR.

I initially saw that RandomAccessRead.read() was called in a loop and assumed that it behaves like InputStream.read() (i.e. can return <N bytes and still have more coming), so it needs to be wrapped into readUpTo().
But now I see that all implementations of RandomAccessRead.read() have loops of their own, so they already behave like InputStream.readNBytes() (can only read less than N bytes in case of EOF)? But this is not specified in Javadoc.

For the future refactoring, here are the places where read() is called in a loop (either shrink them to 1 call or add readUpTo() that guarantees ==N bytes || EOF):

RandomAccessReadDataStream.cctor()
COSParser.getStartxrefOffset() (but also throws if (readBytes < 1))
PDFXrefStreamParser.readNextValue()
PDCIDFontType2.getParser()
PDTrueTypeFont.getParser()
PDFontFactory.getFontHeader()
CFFParser::parse

No-loop calls (may be wrong if implementations are not required to behave like InputStream.readNBytes()):

BaseParser.checkForEndOfString()
PdfStreamParser.hasNoFollowingBinData()

THausherr · 2024-06-26T10:05:38Z

Thanks; The memory leak part has been fixed in https://issues.apache.org/jira/browse/PDFBOX-5845 . I've removed one of the constructors because it's no longer needed, please update your PR.

fontbox/src/main/java/org/apache/fontbox/ttf/CFFTable.java

THausherr · 2024-06-27T10:39:05Z

I applied the patch for the first time, I do get differences in the pdfbox.cache file:
ARBONNIE becomes ARBON
ComicSansMS-BoldItalic becomes Com
GeorgiaPro-CondSemiboldItalic becomes Georgi
NotoNaskhArabicUI-Bold becomes NotoNaskhArabicUI-
KodchiangUPCBoldItalic becomes Kodchi

Here's the comicz file:
comicz.zip

However it might also be my fault due to my own changes, so I'll investigate.

bogdiuk · 2024-06-27T10:52:32Z

Yes, I have the same Com, will investigate. Thanks for noticing
I compared caches only on macOS initially

bogdiuk · 2024-06-27T11:20:52Z

Found it: RandomAccessReadUnbufferedDataStream.read() in my initial code was using readExact() and thus returned length. After reverting to read(), I forgot to return the correct read amount

THausherr · 2024-06-27T11:50:04Z

Thanks that works now.

THausherr · 2024-06-27T13:29:16Z

Why the setException() / getException() thing? It feels weird to me because parse() might still throw an exception so why swallow and store it if it happens later? Same weird feeling that an exception for "font.getNaming()" is completely ignored, I don't understand the comment.
(Sorry if you did answer this before, I did reread the conversation, I hope I didn't miss something)

bogdiuk · 2024-06-27T20:54:09Z

Re exception: 2 calls should've been setError(String) to avoid slow fillInStackTrace() in the exception constructor. Third call removed because existing exceptions should be forwarded indeed.

Re "font.getNaming()" - don't remember, looks weird, removed.

Found a potential bug in TrueTypeCollection.processAllFontHeaders() and processAllFonts(): if some font in TTC file is corrupted, an exception is thrown and all remaining fonts in TTC are skipped. Didn't fix because it's debatable (maybe whole TTC is corrupted, there's no need to waste time on parsing it).

Also, simplified addTrueTypeCollection() because all 3 methods it calls do not effectively throw.

THausherr · 2024-06-29T10:57:49Z

Thanks; IOUtils.closeQuietly(parser.parse(new RandomAccessReadBufferedFile(ttfFile))); caught my attention because parser.parse() returns null in "header" mode. So the closeQuietly() isn't needed. And returning sometimes a font and sometimes null kindof contradicts the "do one thing" ideal. How about making a different method for the headers instead, which returns a FontHeader object, and refactor the common code of the two methods into a third one? Same for getFontAtIndex, IMHO this should be split in getFontAtIndex and getFontHeaderAtIndex and the common part refactored into a method accessed by both, this would be more clear.

bogdiuk · 2024-06-30T01:00:31Z

While I'm at it, I think it's better to replace "calling getter just for side-effects"

font.getNaming(); // calls NamingTable.readTable();
font.getHeader(); // calls HeaderTable.readTable();
((OpenTypeFont) font).getCFF(); // calls CFFTable.readTable();

with explicit calls

font.readTableHeaders(*.TAG, outHeaders); // -> TTFTable.readHeaders(..., FontHeaders)

Then TTFParser.loadOnlyHeaders field can be removed.

The changeset will become a bit larger (like, duplicated TrueTypeFont.readTable()), but codepaths will be more clear

…ng arrays [ADD] RandomAccessReadUncachedDataStream that doesn't read input stream to byte[]

…by adding 'only headers' mode to TTF parser: * only read tables needed for FSFontInfo ('name', 'head', 'OS/2', 'CFF ', 'gcid') * 'CFF ' and 'head' table parsers finish as soon as it has all needed headers

bogdiuk · 2024-06-30T01:32:36Z

Or did I over-do it with CFFParser.parseFirstSubFontROS()?

THausherr · 2024-06-30T12:56:53Z

I'm wondering if the onlyHeaders boolean parameter in NamingTable.read() is needed. At a first look, the only thing it does is to decide whether nameRecords is filled. When having a second look, I think it will result in more computing at the end of the method, but not more file reading, isn't it? (And we could more the 3 lines into read(TrueTypeFont ttf, TTFDataStream data) )

bogdiuk · 2024-06-30T13:30:55Z

When onlyHeaders==true, it discards useless entries, making nameRecords smaller and thus readString() is called fewer times in the following loop, reducing both computing and file reading.

So it's slightly more computing in the 1st loop (isUsefulForOnlyHeaders() is just 5 comparisons), but much less computing in the second (seek > allocate+read > detect charset > new String) and fillLookupTable() that traverse nameRecords

Of course, to please branch predictor, we can extract if from the loop, but it looks much more verbose and it should be easy to predict for CPU anyway:

    if (onlyHeaders)
    {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            if (isUsefulForOnlyHeaders(nr))
            {
                nameRecords.add(nr);
            }
        }
    } else {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            nameRecords.add(nr);
        }
    }

Which 3 lines do you mean?

bogdiuk · 2024-06-30T13:49:22Z

As for lookupTable, I am going to remove in the the future PR (the change was in my initial Request).
By sorting nameRecords, we can use a simple binary search in getName(), which is fast enough, and doesn't need additional memory, like current 4-level HashMap cache. Especially considering that each name is searched only once

THausherr · 2024-06-30T14:32:40Z

Please ignore all I wrote today for now, I overlooked the readstring part. Don't change anything.

THausherr · 2024-06-30T14:39:27Z

Thank you for your work and your patience. I'm planning to commit all next week, as soon as time permits. Ideally I'll commit to the trunk and all branches. From what I see you didn't break the API.

However there's one more thing I have to investigate, I'm getting sporadic parsing errors with a specific file, when the cache is getting rebuilt. Most likely this isn't because of the changes but I'll check that first.

Update: not because of the changes, problem happens in 3.0.2 release too.

THausherr · 2024-07-01T11:08:12Z

trunk and 3.0 have been committed. 2.0 is still todo. Builds will freeze, this is a problem with the OWASP plugin, search for "owasp" in the top pom.xml, then below "" add "true". The people maintaining this plugin are aware of the problem and I'll update the version as soon as their release is done.

https://issues.apache.org/jira/browse/PDFBOX-5847

update: OWASP update has been released, change has been committed to the pom.xml

THausherr · 2024-07-01T12:37:02Z

I worked on 2.0 for a few hours and I was only able to do 4 files in that time, and not the big ones. In 2.0 we read the fonts into memory using the MemoryTTFDataStream class. RandomAccessRead isn't available in fontbox. So unless we would add double code that inflates the jar file, our advantage wouldn't be as much as in 3.0 and higher, we'd only save some reading from memory. Even if we did have some extra io classes, the advantage wouldn't be as much as in 3.0 because in 2.0 we don't read so many GSUB tables.

Opinions?

bogdiuk · 2024-07-02T04:45:38Z

Thank you. I will look at 2.0 on the weekend or maybe next week

THausherr · 2024-07-05T11:34:16Z

Thank you; here's the work I did, maybe it's useful.
ttf.zip
cff.zip

bogdiuk force-pushed the performance-scanfonts branch from 5ec36b8 to 10d0913 Compare June 18, 2024 09:04

bogdiuk force-pushed the performance-scanfonts branch from 10d0913 to 836a39e Compare June 18, 2024 14:13

bogdiuk force-pushed the performance-scanfonts branch from 836a39e to e390fe1 Compare June 25, 2024 17:30

THausherr reviewed Jun 26, 2024

View reviewed changes

fontbox/src/main/java/org/apache/fontbox/ttf/CFFTable.java Show resolved Hide resolved

THausherr reviewed Jun 26, 2024

View reviewed changes

fontbox/src/main/java/org/apache/fontbox/ttf/CFFTable.java Show resolved Hide resolved

bogdiuk force-pushed the performance-scanfonts branch from e390fe1 to a29a003 Compare June 26, 2024 11:05

bogdiuk force-pushed the performance-scanfonts branch 2 times, most recently from 25f858d to 0a6ae75 Compare June 27, 2024 11:34

bogdiuk force-pushed the performance-scanfonts branch from 0a6ae75 to 4be17cf Compare June 27, 2024 20:54

bogdiuk added 2 commits June 30, 2024 04:15

[ADD] TTFDataStream.createSubView() to create a subview without copyi…

cb29a4a

…ng arrays [ADD] RandomAccessReadUncachedDataStream that doesn't read input stream to byte[]

[PERFORMANCE] Improve FileSystemFontProvider.scanFonts() performance …

54c4eb0

…by adding 'only headers' mode to TTF parser: * only read tables needed for FSFontInfo ('name', 'head', 'OS/2', 'CFF ', 'gcid') * 'CFF ' and 'head' table parsers finish as soon as it has all needed headers

bogdiuk force-pushed the performance-scanfonts branch from 4be17cf to 54c4eb0 Compare June 30, 2024 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of FileSystemFontProvider.scanFonts() #189

Improve performance of FileSystemFontProvider.scanFonts() #189

bogdiuk commented Apr 10, 2024 •

edited

Loading

lehmi commented Apr 10, 2024

PascalSchumacher commented May 21, 2024

bogdiuk commented May 28, 2024

bogdiuk commented Jun 18, 2024 •

edited

Loading

bogdiuk commented Jun 18, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

bogdiuk commented Jun 25, 2024 •

edited

Loading

THausherr commented Jun 26, 2024

THausherr commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

THausherr commented Jun 27, 2024

THausherr commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

THausherr commented Jun 29, 2024 •

edited

Loading

bogdiuk commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

THausherr commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

THausherr commented Jun 30, 2024 •

edited

Loading

THausherr commented Jun 30, 2024 •

edited

Loading

THausherr commented Jul 1, 2024 •

edited

Loading

THausherr commented Jul 1, 2024

bogdiuk commented Jul 2, 2024

THausherr commented Jul 5, 2024 •

edited

Loading

Improve performance of FileSystemFontProvider.scanFonts() #189

Are you sure you want to change the base?

Improve performance of FileSystemFontProvider.scanFonts() #189

Conversation

bogdiuk commented Apr 10, 2024 • edited Loading

lehmi commented Apr 10, 2024

PascalSchumacher commented May 21, 2024

bogdiuk commented May 28, 2024

bogdiuk commented Jun 18, 2024 • edited Loading

bogdiuk commented Jun 18, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

THausherr commented Jun 25, 2024

bogdiuk commented Jun 25, 2024 • edited Loading

THausherr commented Jun 26, 2024

THausherr commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

THausherr commented Jun 27, 2024

THausherr commented Jun 27, 2024

bogdiuk commented Jun 27, 2024

THausherr commented Jun 29, 2024 • edited Loading

bogdiuk commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

THausherr commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

bogdiuk commented Jun 30, 2024

THausherr commented Jun 30, 2024 • edited Loading

THausherr commented Jun 30, 2024 • edited Loading

THausherr commented Jul 1, 2024 • edited Loading

THausherr commented Jul 1, 2024

bogdiuk commented Jul 2, 2024

THausherr commented Jul 5, 2024 • edited Loading

bogdiuk commented Apr 10, 2024 •

edited

Loading

bogdiuk commented Jun 18, 2024 •

edited

Loading

bogdiuk commented Jun 25, 2024 •

edited

Loading

THausherr commented Jun 29, 2024 •

edited

Loading

THausherr commented Jun 30, 2024 •

edited

Loading

THausherr commented Jun 30, 2024 •

edited

Loading

THausherr commented Jul 1, 2024 •

edited

Loading

THausherr commented Jul 5, 2024 •

edited

Loading