PDF2Structure
01S
I designed and developed the ML component of a system to identify the structure (chapters, paragraphs…) from an insurance document in PDF format. The extracted text and structure are saved in XML format. Paragraph extraction is based on a ML classifier that uses formal and semantic text features. The methods used and the results obtained are excellent.