PDF files - one of the most popular file formats!
PDF stands for Portable Data Format. They contain all the graphics, fonts and text for a document, as well as the logic information to display them. The most common PDF reader is Adobe AcrobatReader.
An interesting fact for you next pub outing - PDF readers are the second most popular software in the world. Can you guess what number one is? Attackers know that PDF reader software is installed on a potential victims machine. But they wouldn’t try exploit this would they…. Attackers sending malicious PDF documents is common, very very common infact. So being in InfoSec you should at least know the basics, and not soley rely on your sandbox(s) x 10.
Exploits for PDFs are very popular - check out Crimepack
A deeper look at the format can be found at Didier Stevens Blog
For now lets get to work. We’ll be using two tools PDF-ID and PDF-Parser. Both written by Didier Stevens
PDF-ID is not a pdf parser. It will scan through a PDF looking for PDF keywords and shows you how many times they appear in a file - helps to intially triage PDF documents.
PDF-PARSER tool allows you to parse the physical and logical structure of a PDF file.
EX1) Analysis of PDF2.pdf - a PDF with Hello World.
Results of running PDF-ID on PDF2.pdf.
PDF-ID dentified it is a PDF from the header. 6 Object files are present in the file. Most PDF files will contain some binary data but this one has been designed to be pure ASCII. As you can see the code is easy to read and it describes a series of objects.
Going further with PDF-Parser. Results of running PDF-parser on PDF2.pdf.
To parse out individual objects in PDF files we run the following command:
-o 5 to specify object 5.
-c forces to show the contents
And now it only takes the stream I want.
But we get back binary data, rather than simple ASCII which we would expect right? This is because of this filter - FlateDecode. The filter is compressing the text using zlib compression, which is very common in normal PDF files. Reduce size PDF = Faster to download.
Luckily PDF-PARSER has a flag to deal with this. Decoded compression.
EX2) Analysis of java.pdf. PDFID output.
Using the -s flag of PDF parser to figure out which object it is in.
EX3) Another example involving a PE file. PDF-ID output.
We have an embedded file 1 - i.e. a file embedded in the PDF. That embedded file is essentially added as an attachment. Searching for this reveals it is in Object 8.
Using -d to dump the file.
We can straight up run/sandbox at this stage. Or reverse if you know how. We’ll be covering this later.
Hope this was informative :).
When time allows I’ll post another analysis focusing on things like heap spray attacks and obfuscation.
Edit: Joe’s Sandbox used to be free!