Skip to content

Text support

gal kahana edited this page May 12, 2023 · 13 revisions

Text support with the PDF writer library tries to let you enjoy the flexibility of the PDF operators around text, while still maintaining the simplicity of using Unicode strings. To handle non-Unicode scenarios the library let’s you also place glyphs directly by providing their index. The library, obviously, takes care of embedding the relevant fonts used throughout the document creation process.

The main features for the library text support are:

  1. Supported Font Types:
    1. TrueType (TTF).
    2. Open Type CFF (OTF), including CID.
    3. Type 1 (PFB/PFM). non CID. Multi master fonts are also not supported
    4. DFont. Mac resource packages.
    5. True type containers (TTC).
  2. All PDF operators for text may be used
  3. Multi-language, Unicode text (UTF-8 encoded) may be placed using PDF text operators, and the library will take care of encoding.
  4. When Unicode is not sufficient (or ligatures are involved) it is possible to instruct the library to place glyphs using their index.
  5. Text may be copied and pasted to other applications. Those who know PDF know what I’m taking about.

The following paragraphs will go through text usage, both simple and advanced. and finally, for those who are interested – some technical details (or in other words – see how much work i had to do and be at awe). Text support is part of page content placement. so before continuing, if you haven’t gone through Adding Content to PDF Pages then now would be a good time.

Simple text usage

The following example will go through the process of placing simple Unicode text.

Text placement is always done in the context of a content stream, either of a page or a form xobject. in this example we’ll see a page. So let’s begin by creating a page with a page context:

PDFWriter pdfWriter; 
pdfWriter.StartPDF("C:\\SimpleTextUsage.PDF",ePDFVersion13);
PDFPage* page = new PDFPage();
page->SetMediaBox(PDFRectangle(0,0,595,842));
PageContentContext* contentContext = pdfWriter.StartPageContentContext(page);

OK. we created an A4 portrait page and then a page context for it.

Now, in order to place text we first need to have a font to work with. To do that use the PDFWriter GetFontForFile method. The GetFontForFile method receives a font file path, and creates a font object that can later be used for placing text. let’s continue our example and create the font object:

PDFUsedFont* font = pdfWriter.GetFontForFile("C:\\BrushScriptStd.otf");

The PDFUsedFont object is a representation of the font. Other than passing it to later method for placing text you shouldn’t be using it much. Specifically don’t try to delete it – the library is owning all the PDFUsedFont object created. Another last thing – no need to create more than one PDFUsedFont object for each font that you use for the document. While not harmful, creating more than one PDFUsedFont object for the same font will just waste some good space in the PDF file.
Note that GetFontForFile has an overload for Type 1 fonts, where it is possible to provide the PFM file in addition to the PFB file.

Now we can write some text. here goes:

AbstractContentContext::TextOptions textOptions(font,14,AbstractContentContext::eGray,0);
contentContext->WriteText(10,100,"Hello World",textOptions);

First, we construct a TextOptions structure with options. It defines the font that we want to use, The size of the text, and color definitions – color space and color value.
In this case it’ll be the BrushScriptStd font, at size 14, with gray colorspace value of 0 – means black.
Then, write the text using the WriteText method of the context, providing to it X and Y coordinates, then the text to place (encoded in UTF8) and then the options structure.

That’s pretty much it. for the sake of completion let’s also close this PDF:

pdfWriter.EndPageContentContext(contentContext);
pdfWriter.WritePageAndRelease(page);
pdfWriter.EndPDF();

That’s it – ended page content context, wrote the page and released it and ended the PDF. we should now have a PDF file with a single page stating the much overused statement – “Hello World”.

let’s see the whole code segment together:

PDFWriter pdfWriter; 
pdfWriter.StartPDF("C:\\SimpleTextUsage.PDF",ePDFVersion13);
PDFPage* page = new PDFPage();
page->SetMediaBox(PDFRectangle(0,0,595,842));
PageContentContext* contentContext = pdfWriter.StartPageContentContext(page);
PDFUsedFont* font = pdfWriter.GetFontForFile("C:\\BrushScriptStd.otf");
AbstractContentContext::TextOptions textOptions(font,14,AbstractContentContext::eGray,0);
contentContext->WriteText(10,100,"Hello World",textOptions);
pdfWriter.EndPageContentContext(contentContext);
pdfWriter.WritePageAndRelease(page);
pdfWriter.EndPDF();

You can see more sample code using this method here.

Measuring text

Sometimes you will want to know how much width and height would a given text placed using certain settings would take. The library provides you a methdo to get these easily. See the following code:

PDFUsedFont* font = pdfWriter.GetFontForFile("C:\\BrushScriptStd.otf");
PDFUsedFont::TextMeasures textDimensions = font->CalculateTextDimensions("Hello World",14);

the CalculateTextDimensions method calculates the text measures of the provided text, given the font size provided, with the font which is represented by the font object.
the textDimensions variable will now hold a structure of TextMeasures. This structure has the following useful measures:

  • xMin – leftmost position of the text
  • xMax – rightmost position of the text
  • yMin – heighest position of the text
  • yMax – lowest position of the text
  • width – text width
  • height – text height

These measurements can now be used to create a frame for the text or to arrange text in a frame, for example.
To see and example of how this can be used go to here.

Writing text with the PDF operators

most of the usages will be perfectly satisfied by the simple WriteText function. However, if you are looking to get more into the PDF operators to achieve some special effects, then you want to know more about the relevant operators.
The above simple example can be rewritten using the following code:

PDFWriter pdfWriter; 
pdfWriter.StartPDF("C:\\SimpleTextUsage.PDF",ePDFVersion13);
PDFPage* page = new PDFPage();
page->SetMediaBox(PDFRectangle(0,0,595,842));
PageContentContext* contentContext = pdfWriter.StartPageContentContext(page);
PDFUsedFont* font = pdfWriter.GetFontForFile("C:\\BrushScriptStd.otf");
contentContext->BT();
contentContext->k(0,0,0,1);
contentContext->Tf(font,14);
contentContext->Tm(1,0,0,1,10,100);
contentContext->Tj("Hello World");
contentContext->ET();
pdfWriter.EndPageContentContext(contentContext);
pdfWriter.WritePageAndRelease(page);
pdfWriter.EndPDF();

Note that part between contentContext->BT(); and contentContext->ET();. if you know PDF this code should make a lot of sense (note though that for Tj you are passing a regular unicode UTF 8 string…not some encoded thing). If you happen to be among the few who don’t know PDF :), then i’ll explain what went on here.

Note that beginning BT and ending ET calls. they mark the boundaries of text segments. in PDF you have to place them. just because. don’t ask. The call to k(0,0,0,1) sets the text color to a CMYK black. The Tf(font,14) call sets the current font to the font object we created and sizes it to 14 points.
Now comes Tm(1,0,0,1,10,100) which is a very elaborate way of saying to set the text position to 10,100 coordinate. you can use the rest of the matrix to rotate, scale etc.
Then the one command we all got here for – Tj("Hello World") – will show the text “Hello World” in the designated position.
so in other words, what we did was to place the string “Hello World” in 10,100 using the font Brush Script Standard in black at the size of 14 points.
Note that the text is provided in UTF-8 encoding. This means that Latin characters normally go with no special encoding. Other languages characters need to be normally encoded. You can either do it with your own encoding system, or use the library helper class. Read about it here – Unicode and UnicodeString class.

Note that for brevity i didn’t place any checking of return values whatsoever – you shouldn’t avoid this. To see a complete example of this usage of text go to here

Text placement operators

The page content context, which common implementation is in AbstractContentcontext implements all the PDF text related operators. In this section i will provide information specifically on the text placement operators, like the Tj operator used in the above example.

1. EStatusCode Tj(const string& inText)
The Tj method is an implementation of the PDF Tj operator. It simply shows the text passed as a parameter. Note though, that the library Tj method receives a Unicode UTF 8 encoding string of text. For those familiar with text PDF/PS encoding – don’t be confused, the method implementation does the encoding form Unicode to glyphs – so you don’t have to take care of that. just provide the string you want to show plain and simple.

2. EStatusCode Quote(const string& inText)
The Quote method is an implementation of the PDF ' operator. It moves the text position to the next line and shows the string. here again it’s Unicode UTF 16 text. if you look into the code you won’t find any line movement there, it simply uses the actual PDF ' operator in the emitted PDF code.

3. EStatusCode DoubleQuote(double inWordSpacing, double inCharacterSpacing, const string& inText)
The DoubleQuote method is an implementation of the PDF " operator. It moves the text position to the next line and shows the string. For word spacing it will use inWordSpacing and for character spacing it will use inCharacterSpacing.

4. EStatusCode TJ(const StringOrDoubleList& inStringsAndSpacing)
The TJ method is an implementation of the PDF TJ operator. it accepts a list of strings interleaved with numbers. The numbers are spacing adjustments between the strings. To use this method create an object of StringOrDoubleList which is actually an STD list implementation of structs of the type StringOrDouble. As its name proposes StringOrDouble can be either a string or a double. if you initialize it with a double – it will represent that double. if you initialize it with a string it will represent that string. So just build a list of these, and pass to TJ. each double will be the spacing, each string will be the next string to place. just like the TJ operator.

Other text related operators

The following section lists all the text methods provided with AbstractContentcontext (other than the ones reviewed in the previous section). All of them may be used, therefore, with either a page context or a Form XObject context. For those who are familiar with the PDF text operators – it is simply the same list, only with C++ signature, so it should all look very familiar. Most of these simply emit the relevant operator to the content stream. Nothing too elaborate. so here goes.

Text state operators

PDF has several operators for defining how text will be positioned/rendered, for those of you who use PDF operators for it, and don’t calculate the right positions on y’r own. AbstractContentContext has implementations of them:

1. void Tc(double inCharacterSpace)
Sets inter character spacing. using the Tc PDF operator.
2. void Tw(double inWordSpace)
Set inter words spacing. using the Tw PDF operator.
3. void Tz(int inHorizontalScaling)
Sets horizontal scale for text. using the Tz PDF operator.
4. void TL(double inTextLeading)
Sets up text leading, which is the line spacing, for the text operators that move between lines (such as Quote and DoubleQuote). using the TL operator.
5. void Tr(int inRenderingMode)
Sets up text rendering mode. i’m sending you here to the PDF specification definition of the Tr operator to figure out what numbers may be provided as valid rendering modes.
6. void Ts(double inFontRise)
Sets the text rise. This is just playing with the baseline position, to allow you to do superscripts or subscripts. Using the PDF Ts operator.

Text object operators

Two methods are implemented for marking the boundaries of a text object (you can read all about them in the PDF specifications). essentially text objects are simply an encapsulation of some text showing and text positioning operators.
Between different text objects the position movements are not retained.

You must include text showing and positioning operators only in the context of text objects. So before starting to place any of these call:
void BT()

This method will emit the PDF operator BT which will allow you to start using text (see the above example for simple text placement that uses BT).
After you are done with this text object (either done with writing text in this context, or want to zero the text positioning) just call this method:
void ET()

This method will emit the PDF operator ET marking the end of the text object. To place more text use the BT operator again to start a new text object.

Text positioning operators

The following presents the text positioning operators supported by the library (these are all of them, basically).

1. void Td(double inTx, double inTy)
Move to the next line. Done by taking the current line beginning and placing the position of the text in (inTx,inTy) distance from there. Using the PDF operator Td.

2. void TD(double inTx, double inTy)
Does the same thing as Td but also sets the leading to be -inTy. essentially it’s lik calling Td and TL(-inTy).
using the PDF operator TD.

3. void Tm(double inA, double inB, double inC, double inD, double inE, double inF)
Concatenates the current text matrix (position, rotation, whatnot) with the matrix made of the 6 input parameters. You’ve seen PS/PDF matrixes before, noh? if not…head your way down to PDF specification. i’ll just give you the basics: inE and inF designate translation vector. so going Tm(1,0,0,1,50,100) will just move the text with the (50,100) vector. inA and inD can be used for simple scaling (put the x factor in inA and y factor in inD).
Uses the Tm PDF operator.

4. void TStar()
Tstar represents the T* PDF operator (and is using it). It moves the text position to the next line (using the line height measurement provided by TL, or TD).

Font Setting

void Tf(PDFUsedFont* inFontReference,double inFontSize) should be used to set the current font for following text operators. It slightly resembles the PDF Tf operator, in that it receives a font reference and a font size. And indeed – sometime an actual Tf operator will be emitted…only not immediately. First i want to see that you are actually placing text :).
To acquire a PDFUsedFont object call the PDFWriter class GetFontForFile method:

PDFUsedFont* GetFontForFile(const wstring& inFontFilePath, long inFontIndex=0)
PDFUsedFont* GetFontForFile(const wstring& inFontFilePath,const wstring& inAdditionalMeticsFilePath, long inFontIndex=0)

The two overloads are used for acquiring a PDFUsedFont object for a given file path to a font. The 2nd overload allows you to insert an extra metrics file – this is fitting for Type1 usage scenarios, when an extra “.pfm” file is involved. Both versions have an optional parameter that indicates a font index in the font file. This is important for font package files (such as DFont and TTC font files) and allows you to choose the specific font that you want to use. If not provided, the font at index 0 will be used.

Extensibility

We now move to some methods of extending text support.
There are two aspects of extending text support:
1. Direct glyphs placement – you may want to place glyphs directly, instead of using the library built in unicode support. This may happen if you are looking for such features as Gai-Ji support, or simply non-unicode mapping, or even ligatures placement. You can still use the library font embedding capabilities – you’ll just have to provide the glyph index, instead of the matching unicode character. If the glyph has Unicode meaning (even if it has multiple unicode codes drawn in it) you can provide the matching unicode characters, so that PDF applications (such as the Acrobat Reader) understand it as text, and you’ll be able to use functionalities such as “Find” and “Copy & Paste”.

2. Low level text commands – you can place the PDF text operators for placement of text directly…it’s just that then there’ll be no font support from the library at all – all up to you. This may be fitting in scenarios where you want to add font implementations either for unimplemented font types, or to replace the library implementation in cases it doesn’t support an important font feature.

The following lists the extensibility operators per either of these extensibility options

Direct glyphs placement

To place glyphs directly you can use the overloads for text placement operators that receive glyph index input. In general, providing glyph index input is using the GlyphUnicodeMapping structure.
This structure represents a single Glyph. it includes a glyph index and a vector of unicode values. The glyph index is the index of the glyph in the font. The matching unicode values is an ordered vector of unicode codes that this glyph represents. at most times it would be just one. for Ligatures it may be more than one codes. If the glyph does not have matching unicode values, or you just don’t care about this glyph being understood as text, just leave this empty.

Each of the operators receives and STD list of GlyphUnicodeMapping structs…which can be understood as a string with simply glyph+unicode values as each of the characters. Just include GlyphUnicodeMapping.h to get the GlyphUnicodeMappingList list and it’s just like the regular text operators.

The following is the list of methods for direct glyph positioning:
1. EStatusCode Tj(const GlyphUnicodeMappingList& inText)
Tj is like the overload with string, it’s just that it receives a GlyphUnicodeMappingList list instead of a string. will simply show this as a string.

2. EStatusCode Quote(const GlyphUnicodeMappingList& inText)
Quote will place the list of glyphs after moving to the next line.

3. EStatusCode DoubleQuote(double inWordSpacing, double inCharacterSpacing, const GlyphUnicodeMappingList& inText)
DoubleQuote will place the list of glyphs after moving to the next line, with inWordSpacing as the word spacing, and inCharacterSpacing as the character spacing.

4. EStatusCode TJ(const GlyphUnicodeMappingListOrDoubleList& inStringsAndSpacing)
TJ is just like the string based TJ operator. it’s just that instead of a String you would provide a GlyphUnicodeMappingList.

Low level text commands

If you really wanna do you own thing with fonts and text, you can still use the library to put the PDF text operators in the content stream, for you.

for each low level operator that places text there are twin methods – one for placing text as a PDF literal string, and another for placing text as a PDF hex string. you will normally use the former for ANSI text placement, and the latter for CID text placement.

Note that all string inputs for these methods should already be encoded according to the font that is being used.

1. void TfLow(const string& inFontName,double inFontSize)
TfLow will place a Tf PDF operator. Note that you need the provided name to be mapped to a font object in the page/form xobject resources dictionary.

2. void TjLow(const string& inText)
void TjHexLow(const string& inText)

Places a Tj PDF operator. The TjLow method will place the input string as a literal string. the TjHexLw method will place the input string as a hex string.

3. void QuoteLw(const string& inText)
void QuoteHexLow(const string& inText)

Places a ' PDF operator. The QuoteLow method will place the input string as a literal string. the QuoteHexLow method will place the input string as a hex string.

3. void DoubleQuoteLow(double inWordSpacing, double inCharacterSpacing, const string& inText)
void DoubleQuoteHexLow(double inWordSpacing, double inCharacterSpacing, const string& inText)

Places a " PDF operator. The DoubleQuoteLow method will place the input string as a literal string. the DoubleQuoteHexLow method will place the input string as a hex string.

4. void TJLow(const StringOrDoubleList& inStringsAndSpacing)
void TJHexLow(const StringOrDoubleList& inStringsAndSpacing)

Places a TJ PDF operator. The TJLow method will place the input string as a literal string. the TJHexLow method will place the input string as a hex string. note that you are using the StringOrDouble struct here to represent either a string or a double, as fitting with the TJ operator.

Technical details

In here i’ll be providing some technical details on the implementation. This is mainly of interest for co-developers who wish to explore the code or extend the library functionality.

Font embedding

With the library fonts are always embedded as subset. This is unless the font is protected, in which case it will not be embedded. I’m following the Adobe applications implementation here as defined by the FontPolicy.pdf document (just google for it if you are interested).

Embedding for TrueType embeds the font as a TrueType representation.
For both OpenType CFF fonts and Type1 i’m using CFF representation when embedding. yeah…got me a really nice Type1 converter in there and some nice parsing tool for all fonts. if you are looking for some help in parsing fonts check out the code in CFFEmbeddedFontWriter for OpenType CFF embedding. Check TrueTypeEmbeddedFontWriter for True Type (TTF) embedding and Type1ToCFFEmbeddedFontWriter for Type 1 font embedding.

Unicode map

In order for PDF application to be able to understand the shown text as Text one must implement a unicode map with each font. So I did. this means that you should be able to search for text when opening the PDF, and also Copy text from the PDF and paste it in some other app. that’s not trivial u guys…you should be thankful.

FreeType usage

In the beginning of the project i started using FreeType in order to get unicode to glyph mapping and some of the font features. But then i had to parse the fonts in order to actually embed them. Even though – i still use FreeType…just cause it’s very comfortable for what i need. It does mean, however, that with a bit more coding (especially the unicode→glyph mapping) i can probably lose it. if that’s for someone interest. i’m very happy with it.

CID vs. ANSI

Well, when embedding the fonts i sometimes use CID and sometimes use ANSI. i picked up the rules pretty much from what the PDF specs seem to offer.

Basically for True Types i always use CID, unless the used characters in the file are from the AdobeStandardEncoding only, in which case i used ANSI.
For Type 1 i used ANSI always. This is because i’m supporting just non-CID Type 1s, and they are always having only up to 256 glyphs…so no need to go for CID. saves some space. [but if i will need to…it’s not really a problem to add support for type1→CID].
For Open Type CFFs i’ll use ANSI if the font is not originally CID and that there are less than 256 characters used. otherwise – if the font is CID originally or that there are more than 256 characters used…it’s gonna be CID.

Clone this wiki locally