
Most data appears as visual text in documents, images, and web pages, making text extraction a common requirement. You may need to pull text or images from Word or PDF files. As a C# developer, you can extract text programmatically. In this article, you will learn how to extract text from the DOC or DOCX documents using C#.
The following topics are covered:
- C# API for Text Extraction
- Extract Text from DOCX using C#
- Get Formatted Text from DOCX using C#
- Extract Formatted Text from Pages using C#
C# API for Text Extraction
I will use the GroupDocs.Parser for .NET API to extract text from DOCX documents. It can retrieve text, metadata, and images from Word, PDF, Excel, and PowerPoint files. It also supports raw, formatted, and structured text extraction.
You can either download the DLL or install it via NuGet.
Install-Package GroupDocs.Parser
Extract Text from DOCX using C#
Follow these steps to parse any document and extract plain text:
- Create an instance of Parser class
- Provide the file path
- Call the GetText method
- Receive results in a TextReader object
- Output the text with the ReadToEnd method
The code sample below demonstrates extracting text from a DOCX file with C#.

Extract Text from DOCX using C#
The Parser class provides parsing and extraction capabilities. The input file path is set in the class constructor.
The GetText() method extracts plain text from the specified document.
Get Formatted Text from DOCX using C#
Use these steps to extract text while preserving style formatting:
- Create an instance of Parser class
- Provide the file path
- Define FormattedTextOptions
- Set FormattedTextMode to HTML
- Call the GetFormattedText method
- Receive results in a TextReader object
- Output the text with the ReadToEnd method
The code sample below shows how to extract formatted text from a DOCX file.

Extract Formatted Text from DOCX using C#
The FormattedTextOptions class defines extraction options, including the Mode. Setting the mode to HTML returns the document text as HTML.
The GetFormattedText() method extracts formatted text from the specified document.
Extract Formatted Text from Pages using C#
Follow these steps to extract formatted text from a specific page:
- Create an instance of Parser class
- Provide the file path
- Verify that FormattedText is true
- Call GetDocumentInfo to obtain the page count
- Ensure PageCount is greater than zero
- Define FormattedTextOptions
- Set FormattedTextMode to HTML
- Call GetFormattedText for each page index
- Receive results in a TextReader object
- Output the text with the ReadToEnd method
The code sample below extracts formatted text from pages one by one.

Extract Formatted Text from Pages using C#
The Parser class exposes a Features property that indicates supported capabilities. See the “Get Supported Features” section for details.
Get a Free License
You can try the API without evaluation limits by requesting a free temporary license.
Conclusion
In this article, you learned how to extract text from Word documents using C#. Explore more about GroupDocs.Parser for .NET in the documentation. For questions, visit the forum.