||[Jun. 9th, 2005|03:27 pm]
For a recent project (to be open sourced RSN) at work, we needed to extract text from PowerPoint files. There was some rather nasty code floating around to do it (scanning for the TextByteAtom record by checking every 2 bytes for 4008), but on closer inspection this code didn't get all text, and was very very nasty. It failed to get unicode text, and occasionally triggered its text extraction on stuff that wasn't text.|
So, I decided to improve things, and write a better text extractor for PowerPoint. All the code made use of the POI Apache Jakarta Project to provide access to the underlying OLE2 stream. I'd done some POI stuff before (for Excel and Word text extraction, and metadata extraction), so I was happy with how to get started.
I quite quickly came up with something better, just by looking at dumps of files to figure out how it must work. Then, I got some help from the guy who wrote the KOffice PowerPoint filter, and started on a real record based text extractor. This went down quite well on the POI lists, so it became HSLF (Horrible SLideshow Format). Torchbox allowed me to keep developing it beyond what was needed for the project, and it got to the point of being accepted into the scratchpad.
I kept going, figured out more of the file format, and the code improved. A little while ago, I was asked if I fancied becoming a committer to the POI project to take the PowerPoint code forward. I jumped at this, and last weekend I was voted into the project.
Today, I got my cvs.apache.org account, allowing me to commit code straight into the POI CVS repository. Now I just have to commit my most recent set of changes, then get to work supporting more of the PowerPoint format :)