[Apache Sanselan] Demystifying how Sanselan determines image format
Apache Sanselan is a pure Java library for reading and Writing Image formats. It has recently graduated from Incubator and is now a proud member of Apache commons proper.
In previous articles, we saw how to retrieve Image metadata and information. In this post we shall see how Sanselan guesses the Image format. We shall take it in two steps
- First we shall look at the ImageFormat class
- Then, we shall look into the implementation guessFormat() API
ImageFormat class
org.apache.sanselan.ImageFormat Class has three members name, extension and actual. Name and extension are same, however I am not sure I understand the use of actual variable.
The class has list of formats (including Unknown) supported by the library. They are all instances of ImageFormat class. The list is
- IMAGE_FORMAT_UNKNOWN
- IMAGE_FORMAT_PNG
- IMAGE_FORMAT_GIF
- IMAGE_FORMAT_ICO
- IMAGE_FORMAT_TIFF
- IMAGE_FORMAT_JPEG
- IMAGE_FORMAT_BMP
- IMAGE_FORMAT_PSD
- IMAGE_FORMAT_PBM
- IMAGE_FORMAT_PGM
- IMAGE_FORMAT_PPM
- IMAGE_FORMAT_PNM
- IMAGE_FORMAT_TGA
- IMAGE_FORMAT_JBIG2
guessFormat API – Under the hood
The guessFormat() API looks at the initial 2 to 4 bytes, also known as magic numbers, to determine the Image Format.
Algorithm is as follows:
- Read first two bytes
- Match with existing set of magic numbers and determine format
- It use byte 3 and 4 to determine format for JBig2
The list of magic numbers for different formats is as follows
| Format | Byte 1 | Byte 2 |
|---|---|---|
| IMAGE_FORMAT_GIF | 0x47 | 0x49 |
| IMAGE_FORMAT_PNG | 0x89 | 0x50 |
| IMAGE_FORMAT_JPEG | 0xff | 0xd8 |
| IMAGE_FORMAT_BMP | 0x42 | 0x4d |
| IMAGE_FORMAT_TIFF
(Motorola byte order) |
0x4D | 0x4D |
| IMAGE_FORMAT_TIFF
(Intel byte order) |
0x49 | 0x49 |
| IMAGE_FORMAT_PSD | 0x38 | 0x42 |
| IMAGE_FORMAT_PBM | 0x50 | 0x31 or 0x34 |
| IMAGE_FORMAT_PGM | 0x50 | 0x32 or 0x35 |
| IMAGE_FORMAT_PPM | 0x50 | 0x33 or 0x36 |
| IMAGE_FORMAT_JBIG2 | 0x97 | 0x4A |
In addition to this IMAGE_FORMAT_JBIG2 format, byte 3 must be equal to 0x42 and byte 4 must be equal to 0x32.
Based on this table, Sanselan recognizes the Image format. If the magic numbers don’t match from the one in the table, it returns IMAGE_FORMAT_UNKNOWN
References
