PDF2Structure

01S

Last updated on Jun 22, 2022 Work projects

I designed and developed the ML component of a system to identify the structure (chapters, paragraphs…) from an insurance document in PDF format. The extracted text and structure are saved in XML format. Paragraph extraction is based on a ML classifier that uses formal and semantic text features. The methods used and the results obtained are excellent.

text classification word embeddings conditional random fields fastText auto-sklearn sklearn-crfsuite

PDF2Structure

Stefano Fiorucci

NLP Engineer, Craftsman and Explorer 🧭 | Contributing to Haystack, the NLP/LLM Framework 🏗️