PDF2Structure

01S

I designed and developed the ML component of a system to identify the structure (chapters, paragraphs…) from an insurance document in PDF format. The extracted text and structure are saved in XML format. Paragraph extraction is based on a ML classifier that uses formal and semantic text features. The methods used and the results obtained are excellent.

Stefano Fiorucci
Stefano Fiorucci
NLP Engineer, Craftsman and Explorer 🧭 | Contributing to Haystack, the NLP/LLM Framework 🏗️