We are getting PDF files from a client. They contain medical records with one of the items being the patient ID. My initial task was to find PDF parsing code on the web to pull the patient ID from the file then rename the file to include that as a prefix.
A quick search turned up PDF Box from Apache http://pdfbox.apache.org/ which sounded like it would do what I needed. Grab the download, whip up a simple Eclipse project and I have the text extracted. Very easy to use and understand API.
Of course the next part is where the real work comes in. Have the program respond to command line parameters and if none are provided a GUI is presented to the user. Let them pick the source and destination directories then process the files. Save the last used directories to an INI type file so they don't have to pick the directories every time they start the program.
Once that was working I was told they would really love to have the PDF pages converted to a JPG so they could be imported into our software. Turns out PDFBox can convert to PNG which we can also import. Update the code to handle the conversion. They also wanted to copy the originals of properly converted files to an archive directory. Added that to command line processing and GUI.
I asked since I know now how to use the library to convert PDF to PNG why don't we just import PDF files directly? Sure, why not, do that. I updated the main code base to allow user to choose PDF files along with other image formats. Next I changed the processing loop to handle the page count of each PDF for the import status and did all the code to write the converted pages to the temp directory and then import them as PNG files. All pretty easy stuff and it turns out this is something the clients have asked for over the years.
Before we did not want to import straight PDF files as then we would need a PDF viewer in the code. Adobe used to be pretty good for a C/C++ developer allowing you to embed the viewer using OLE and other API calls but that came to a halt I think around version 9. It has never been a fun task in Java. With the conversion to a normal image format I don't have to worry about any of that. Each page can be rotated, annotated, redacted, etc. using our basic image tools.
Now I asked about the original utility program I wrote. Seems pointless for it to convert the PDF to *page#.png files now that we can import things. This morning I am going to add a checkbox to the GUI and another command line parameter to allow you to control the export of page files which will be off by default. They still need the patient ID extract and rename but probably will not use the page extract. With the code already written it might as well stay in as an option.
Funny how a one off request can turn into something that benefits every customer. There are days you just need to do a little research to find really cool simple solutions to problems. I also got to learn about a new library that may come in handy for other projects. The JAR file is not small but what it does is pretty powerful. We added it as a lazy load JAR to our JNLP file. Hopefully QA does not find any massive issues with the conversion. It worked fine on all the PDF files I tried it on but I know there are a lot of ways for those to be messed up.