PDFBox in .NET Performance

PDFBox in .NET
PDFBox.NET is a .NET port of PDFBBox created using IKVM.NET. The latest version (1.8.9) is available for download.
PDFBox.NET 1.7.0
.NET version of PDFBox library version 1.7.0. The download includes a compiled pdfbox.dll and all IKVM.NET dependencies.
PDFBox.NET 1.8.4
.NET version of PDFBox library version 1.8.4. The download includes a compiled pdfbox.dll and all IKVM.NET dependencies.
PDFBox.NET 1.8.7
.NET version of PDFBox library version 1.8.7. The download includes a compiled pdfbox.dll and all IKVM.NET dependencies.
Calling the PDFBox through IKVM.NET brings a performance penalty. While the Java version of PDFBox parses a sample PDF file in 3 seconds it takes 10 seconds with PDFBox.NET.

Calling the PDFBox through IKVM.NET brings a performance penalty. You can get an idea about the speed of PDFBox run from .NET from the following numbers.

This experiment measures the time needed for extracting text from PDF.

  • Both tests (.NET and Java) are extracting text from the US Copyright Act PDF file. The size of the PDF file is 5,354,317 bytes. It has 336 pages. There is one big image on the first page. The other pages contain only text.
  • The same extracting routine was called three times in order to see if the time needed for warm up is significant.
  • The size of the resulting text file is 981,998 bytes (both Java and .NET). The text files produced are identical.

Speed of PDF Parsing in .NET

Times needed to parse the PDF file three times during one program run.

  1. 13,626 ms
  2. 10,332 ms
  3. 10,345 ms

The first round took more than 13 seconds. The following rounds took about 10 seconds.

Speed of PDF Parsing in Java

Times needed to parse the PDF file three times during one program run.

  • 5,260 ms
  • 3,044 ms
  • 3,006 ms

The first round took more than 5 seconds. The subsequent rounds took about 3 seconds.

Conclusion

  • Running PDFBox in .NET using IKVM.NET was about 3x slower than running it in Java.
  • It means that it is noticeably slower but it's not an order of magnitude.
  • Some time is needed for loading the libraries (in both Java and .NET). You may benefit from batching the PDF parsing tasks in both cases.