Extract Text from DOCX

Most data appears as visual text in documents, images, and web pages, making text extraction a common requirement. You may need to pull text or images from Word or PDF files. As a C# developer, you can extract text programmatically. In this article, you will learn how to extract text from the DOC or DOCX documents using C#.

The following topics are covered:

C# API for Text Extraction

I will use the GroupDocs.Parser for .NET API to extract text from DOCX documents. It can retrieve text, metadata, and images from Word, PDF, Excel, and PowerPoint files. It also supports raw, formatted, and structured text extraction.

You can either download the DLL or install it via NuGet.

Install-Package GroupDocs.Parser

Extract Text from DOCX using C#

Follow these steps to parse any document and extract plain text:

  • Create an instance of Parser class
  • Provide the file path
  • Call the GetText method
  • Receive results in a TextReader object
  • Output the text with the ReadToEnd method

The code sample below demonstrates extracting text from a DOCX file with C#.

Extract Text from DOCX using C#

Extract Text from DOCX using C#

The Parser class provides parsing and extraction capabilities. The input file path is set in the class constructor.

The GetText() method extracts plain text from the specified document.

Get Formatted Text from DOCX using C#

Use these steps to extract text while preserving style formatting:

The code sample below shows how to extract formatted text from a DOCX file.

Extract Formatted Text from DOCX using C#

Extract Formatted Text from DOCX using C#

The FormattedTextOptions class defines extraction options, including the Mode. Setting the mode to HTML returns the document text as HTML.

The GetFormattedText() method extracts formatted text from the specified document.

Extract Formatted Text from Pages using C#

Follow these steps to extract formatted text from a specific page:

The code sample below extracts formatted text from pages one by one.

Extract Formatted Text from Pages using C#

Extract Formatted Text from Pages using C#

The Parser class exposes a Features property that indicates supported capabilities. See the “Get Supported Features” section for details.

Get a Free License

You can try the API without evaluation limits by requesting a free temporary license.

Conclusion

In this article, you learned how to extract text from Word documents using C#. Explore more about GroupDocs.Parser for .NET in the documentation. For questions, visit the forum.

See Also