Patty’s recent blog post got me thinking about a number of things. As I sit here watching black and white video of a B-52’s concert from 1980 on a 27″ desktop display, two floors away from the nearest TV, it reinforced her point for me—we invent technology to access and/or convert our old data to a more efficient modern day equivalent.
But what about all the data, books, etc. that are stored as PDF files? PDF is a great format. It allows for vector and raster representations of content. It displays text with great resolution so frequently that the pains of font issues from the pre-PDF days are dim memories.
But what happens when a better format comes along to replace PDF? Will all that content be lost forever? Or worse, will the PDF viewing experience become a 50% solution where it is reliable for some files, but less so for others? PDF has been around for 20 years, but in the computer business, nothing lasts forever.
One possible scenario is that something better than PDF will be invented down the road. In that event, we’ll have to write code to convert PDFs to that new format (in the same way that we digitize old magazines and books today).
If that doesn’t happen, there’s a chance that one or more companies providing PDF viewing applications could go out of business. In that event, there’s PDF/A.
PDF/A (PDF for Archiving) makes it possible for future viewers to have access to PDF files beyond our lifetime. Think of it like cryonics, but for PDF files instead of humans. It’s been around since at least Acrobat 8.
A conforming PDF/A file is just a PDF file, except it’s been constructed in a slightly different way. It contains all the resources it needs in order to guarantee that every page in the file can be viewed tomorrow the same way it is today. This means that all the fonts needed to view and print the text in the file are included. All the images needed are also present.
Conforming files don’t include anything which can make the content variable. These conforming files also must include a color profile to insure the colors display correctly. Conforming files must include a subsetted, embedded copy of all fonts used by the text in the file. This eliminates the variability that would occur for fauxed* fonts. So, conforming PDF/A files will be slightly larger than their non-PDF/A equivalents, but this is a small price to pay for the ability to view this content well into the future.
*The challenge of displaying text: For any application to display text, it needs to have access to one or more fonts. These can be installed in the operating system, with the application, embedded inside the document, or some combination of these. PDF files displayed by Acrobat rely on all three of these methods. Some PDF files will include all the fonts necessary for their display. Other PDF files will include just the name of a font and perhaps a few metrics. In this case, Acrobat will find a font that matches by consulting the fonts installed on the system. It may create a font on the fly from fonts in its installation. This results in text being displayed in a font that exists while the document is being displayed, hence the term “fauxed font”. Such fonts are fakes in the sense that the font doesn’t exist beyond the display of their file.
Along with detailed specifications, the ecosystem for PDF/A files includes software to produce valid PDF/A files, converters to produce PDF/A from non-conforming files, and validators to check that a given PDF file conforms to the specific PDF/A standard.
Looking Into The Future
The idea behind this effort is that in the event that Acrobat and Reader go away, we’ll be able to produce applications that can render and print files conforming to the PDF/A specification.
For companies who primarily store their documents in non-PDF formats, there are options for long-term viewing guarantees. One option is to convert the other file formats into PDF/A. Ideally, those formats provide some way to search content. For raster file formats, the OCR path to searchable content is well-traveled.
PDF seems like it should be well-suited for long term storage because the format lends itself to storing the resources needed for display. Because the PDF format evolved to be everything for everyone, there are a lot of places where variations in viewing content can occur. This reduces the likelihood that a future PDF renderer will produce the same result that occurs today.
Files which meet PDF/A conformance are not allowed to have any of the stuff that makes them variable. They are guaranteed to include all the fonts needed (so there’s no need to faux fonts). They are guaranteed not to have optional content, executable actions, or movie annotations. This removes any ambiguity about what should be displayed, whether it’s today’s PDF renderer, or tomorrow’s.
The process of converting PDF files to PDF/A could be straightforward, or difficult. In a perfect world, you would convert the original source files directly to the PDF archive format you need. (There are a number of flavors beyond the original PDF/A 1a and 1b.) If access to the original source doesn’t exist, tools within Acrobat Pro exist to convert existing PDF files to specific PDF/A variants. It’s hard to predict how simple this process will be; the universe of PDF producers has grown quite large. The higher the quality of PDF going into the process, the higher the quality of PDF/A will be on the other side.
It’s pretty clear that as the future unrolls, technology will evolve, making some of today’s file formats obsolete. The architects of PDF/A have attempted to define a standard for PDF files which increase the odds that tomorrow’s viewing applications will be able to render PDF/A files with the same results that we have today.
Mark Donohoe is a Software Engineer and Snowbound Software’s resident PDF expert. He has over 20 years of PDF experience with Adobe Systems.