Latin-1 Text Annotations (tEXt, zTXt)
- Status: PNG Specification
- Location: anywhere
- Multiple: yes
That brings us to PNG's original text chunks, which are perhaps its most
popular nonessential chunks. Regardless of how many words a picture is
worth, it is often useful or necessary to add a few more in order to record
pertinent information like title and author, store requisite legal
notices such as a copyright or disclaimer, or merely to transfer text
from one image to another.
PNG supports two types of Latin-1-based text chunks, uncompressed
(tEXt) and compressed (zTXt). There is also a new Unicode-based chunk
(iTXt) that I'll discuss next. For the first two, the format is
basically the same: an uncompressed keyword or key phrase, a null
(zero) byte, and the actual text. In zTXt the text is compressed; the
first byte after the null indicates the compression method, for which
only deflate is currently defined (method zero). The remainder is the
compressed stream, which for method zero must be in zlib 1.x format,
just as for image data. (The zlib 1.x format is described by revision
3.3 of the zlib specification, which is available from
http://www.zlib.org/zlib_docs.html/.)
Both keyword and raw text should be encoded with the Latin-1 (ISO/IEC 8859-1)
character set; neither may contain null bytes. Since the keyword is intended
to be recognizable by both humans
and computer programs, additional restrictions are placed on it: it may not
contain leading, trailing, or consecutive spaces, and it is restricted to
characters in the range 32-126 and 161-255 (which, in particular, rules
out both control characters and the nonbreaking space, decimal value 160).
The only other restriction on the main text of the chunk is that newlines
should be in Unix format, i.e., represented by a single line-feed character
(decimal value 10).
I mentioned in Chapter 7, "History of the Portable Network Graphics Format", that the Unicode UTF-8 character set was one of the
items in the design of PNG that was voted down. In retrospect this was,
perhaps, a lamentable decision; it was finally addressed early in 1999
with the iTXt chunk. But at the time, UTF-8 was very new and had not been
extensively tested in the field. In particular, it had little or no
operating-system support and no support in standard programming libraries,
either for encoding and decoding or for the translation and display of UTF-8
characters in the native character set(s) of existing systems. Since PNG's
design goals included both the use of well-tested technologies and the
avoidance of undue burdens on developers of PNG applications, support for
UTF-8 was dropped in favor of the more familiar Latin-1 character set.
The following list summarizes all of the keywords that are either
included in the specification itself or officially registered as extensions
to the spec:
- Author
The name of the author of the image. If the original image were a painting or
other nonelectronic medium, both the original artist and the person who
scanned the image might be listed.
- Title
A one-line title or caption. Longer captions should generally
use the Description keyword, but see the end of this section for an unofficial alternative.
- Description
A longer description of or caption for the image, perhaps including details
about the tools and settings used; the name, age, and/or location of the
subject matter; or the mood the artist was trying to convey. See also the
Software and Source keywords.
- Creation Time
The time the image was created, in whatever sense is most appropriate.
The recommended format is that prescribed by Internet RFC 822 (Section 5), as
amended by RFC 1123 (Section 5.2.14); specifically:
day month year hour:minute timezone
where day is either one or two digits; month is a three-letter
English abbreviation such as Jun; year is two or four digits
(though the latter is strongly recommended); hour and minute
are two digits each; and timezone is either a three-letter abbreviation
(e.g., PST for Pacific Standard Time), or a one-letter U.S. military designation,
or a four-digit number with a leading positive or negative sign indicating the
hour:minute offset from Coordinated Universal Time (e.g., -0800 for Pacific
Standard Time, which is eight hours and zero minutes earlier than UTC). In
addition, the entire string may optionally be preceded by a weekday
field, where weekday is a three-letter English abbreviation (e.g.,
Fri). A colon and two-digit seconds field may also be
appended to the time (that is, hour:minute:second). Note that this
is merely a recommendation; strings such as ``circa 1492'' are allowed, as is
explanatory text following an RFC-style date string.
- Copyright
The legal copyright notice for the image. For example, ``Copyright 1999
by Greg Roelofs. This image may be freely used and distributed provided that
it is not modified in any way and that this notice remains intact.''
- Disclaimer
A legal disclaimer notice for the image. This might include a company's
standard boilerplate on all copyrighted works; in particular, it might be
lengthy enough to store in a compressed (zTXt) chunk, while the copyright
notice remains uncompressed.
- Warning
A warning about the content or effects of the image. For example, certain
types of popular material may not be suitable for minors, or a random-dot
stereogram (``Magic Eye'' 3D image) may induce headaches in some people.
- Software
The name and possibly the version of the software used to create the image. This
is most often generated automatically, but it need not be. More than one
software application may be listed.
- Source
Information about the device used to generate the image, such as a digital
camera or a scanner.
- Comment
A miscellaneous comment, often converted from a GIF comment (which lacks
keywords).
In addition to these official keywords, one of the technical
reviewers of this book and I have been known to make use of a few unofficial
keywords. The Caption keyword is used to provide a brief description
of an image that is more specifically tailored for use as a
publishable caption than the generic Description keyword; it is
also generally lengthier than is appropriate for the Title keyword.
The E-mail keyword stores the email address of the author in standard
Internet format (RFC 822, Section 6, as amended by RFC 1123, Sections
5.2.15 through 5.2.19); for example,
roelofs@pobox.com . And
the URL keyword is for a standard WWW Uniform Resource Locator (RFC
2068, Section 3.2); for example,
http://www.oreilly.com/ . If
the URL is reasonably self-explanatory, it is recommended that the
chunk consist of the single URL and nothing else, but this is not a
requirement. Multiple URLs should be separated by newline characters.
Note that spaces and other white space (tabs, newlines, and so forth)
are considered unsafe by the URL standard and therefore must be
escaped within a conforming URL. For example, a space character must
be encoded as %20. This allows easy parsing
of optional explanatory text after a URL:
the URL ends when the first white space
(space, tab, or newline) is encountered.
|