[Apache Sanselan] Demystifying how Sanselan determines image format

Apache Sanselan is a pure Java library for reading and Writing Image formats. It has recently graduated from Incubator and is now a proud member of Apache commons proper.

In previous articles, we saw how to retrieve Image metadata and information. In this post we shall see how Sanselan guesses the Image format. We shall take it in two steps

  • First we shall look at the ImageFormat class
  • Then, we shall look into the implementation guessFormat() API

ImageFormat class

org.apache.sanselan.ImageFormat Class has three members name, extension and actual. Name and extension are same, however I am not sure I understand the use of actual variable.

The class has list of formats (including Unknown) supported by the library. They are all instances of ImageFormat class. The list is

  1. IMAGE_FORMAT_UNKNOWN
  2. IMAGE_FORMAT_PNG
  3. IMAGE_FORMAT_GIF
  4. IMAGE_FORMAT_ICO
  5. IMAGE_FORMAT_TIFF
  6. IMAGE_FORMAT_JPEG
  7. IMAGE_FORMAT_BMP
  8. IMAGE_FORMAT_PSD
  9. IMAGE_FORMAT_PBM
  10. IMAGE_FORMAT_PGM
  11. IMAGE_FORMAT_PPM
  12. IMAGE_FORMAT_PNM
  13. IMAGE_FORMAT_TGA
  14. IMAGE_FORMAT_JBIG2

guessFormat API – Under the hood

The guessFormat() API looks at the initial 2 to 4 bytes, also known as magic numbers, to determine the Image Format.

Algorithm is as follows:

  • Read first two bytes
  • Match with existing set of magic numbers and determine format
  • It use byte 3 and 4 to determine format for JBig2

The list of magic numbers for different formats is as follows

Format Byte 1 Byte 2
IMAGE_FORMAT_GIF 0x47 0x49
IMAGE_FORMAT_PNG 0x89 0x50
IMAGE_FORMAT_JPEG 0xff 0xd8
IMAGE_FORMAT_BMP 0x42 0x4d
IMAGE_FORMAT_TIFF

(Motorola byte order)

0x4D 0x4D
IMAGE_FORMAT_TIFF

(Intel byte order)

0x49 0x49
IMAGE_FORMAT_PSD 0x38 0x42
IMAGE_FORMAT_PBM 0x50 0x31 or 0x34
IMAGE_FORMAT_PGM 0x50 0x32 or 0x35
IMAGE_FORMAT_PPM 0x50 0x33 or 0x36
IMAGE_FORMAT_JBIG2 0x97 0x4A

In addition to this IMAGE_FORMAT_JBIG2 format, byte 3 must be equal to 0x42 and byte 4 must be equal to 0x32.

Based on this table, Sanselan recognizes the Image format. If the magic numbers don’t match from the one in the table, it returns IMAGE_FORMAT_UNKNOWN

References

Sanselan.java

Leave a Reply

Your email address will not be published. Required fields are marked *