How to extract plain text from PDF file using PDFBox.NET library. Sample Visual Studio project download (VB).
Downloads
This sample requires the following dlls from the PDFBox.NET package:
As a reference:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
In addition to these libraries, it is necessary to copy the following files to the application directory:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Sample code (VB):
Private Shared Function parseUsingPDFBox(ByVal input As String) As String
Dim doc As PDDocument = Nothing
Try
doc = PDDocument.load(input)
Dim stripper As New PDFTextStripper()
Return stripper.getText(doc)
Finally
If doc IsNot Nothing Then
doc.close()
End If
End Try
End Function
See also how to how to convert PDF to text in C# (.NET).