Extract Text from DOC or DOCX using C#

Most data appears as visual text in documents, images, and web pages, making text extraction a common requirement. You may need to pull text or images from Word or PDF files. As a C# developer, you can extract text programmatically. In this article, you will learn how to extract text from the DOC or DOCX documents using C#.

The following topics are covered:

C# API for Text Extraction
Extract Text from DOCX using C#
Get Formatted Text from DOCX using C#
Extract Formatted Text from Pages using C#

C# API for Text Extraction

I will use the GroupDocs.Parser for .NET API to extract text from DOCX documents. It can retrieve text, metadata, and images from Word, PDF, Excel, and PowerPoint files. It also supports raw, formatted, and structured text extraction.

You can either download the DLL or install it via NuGet.

Install-Package GroupDocs.Parser

Extract Text from DOCX using C#

Follow these steps to parse any document and extract plain text:

Create an instance of Parser class
Provide the file path
Call the GetText method
Receive results in a TextReader object
Output the text with the ReadToEnd method

The code sample below demonstrates extracting text from a DOCX file with C#.

The Parser class provides parsing and extraction capabilities. The input file path is set in the class constructor.

The GetText() method extracts plain text from the specified document.

Get Formatted Text from DOCX using C#

Use these steps to extract text while preserving style formatting:

Create an instance of Parser class
Provide the file path
Define FormattedTextOptions
Set FormattedTextMode to HTML
Call the GetFormattedText method
Receive results in a TextReader object
Output the text with the ReadToEnd method

The code sample below shows how to extract formatted text from a DOCX file.

Extract Formatted Text from DOCX using C#

The FormattedTextOptions class defines extraction options, including the Mode. Setting the mode to HTML returns the document text as HTML.

The GetFormattedText() method extracts formatted text from the specified document.

Extract Formatted Text from Pages using C#

Follow these steps to extract formatted text from a specific page:

Create an instance of Parser class
Provide the file path
Verify that FormattedText is true
Call GetDocumentInfo to obtain the page count
Ensure PageCount is greater than zero
Define FormattedTextOptions
Set FormattedTextMode to HTML
Call GetFormattedText for each page index
Receive results in a TextReader object
Output the text with the ReadToEnd method

The code sample below extracts formatted text from pages one by one.

Extract Formatted Text from Pages using C#

The Parser class exposes a Features property that indicates supported capabilities. See the “Get Supported Features” section for details.

Get a Free License

You can try the API without evaluation limits by requesting a free temporary license.

Conclusion

In this article, you learned how to extract text from Word documents using C#. Explore more about GroupDocs.Parser for .NET in the documentation. For questions, visit the forum.

C# API for Text Extraction#

Extract Text from DOCX using C##

Get Formatted Text from DOCX using C##

Extract Formatted Text from Pages using C##

Get a Free License#

Conclusion#

See Also#