Newspaper content is a valuable asset and there is strong demand for article segmentation from both non-commercial readers and the media monitoring sector. Newspapers in searchable digital format have become accessible to many readers and the media monitoring companies require content at article level in a wide variety of formats.
Data Esperto has expertise and experience of working with contemporary newspapers and converting their content for use on a variety of digital platforms.
We have amassed a wealth of technical expertise in using state-of-the-art technology and in-house software to satisfy client demand.
Content from the daily press (broadsheets and tabloid), weekly press, weekend supplements, periodicals, and trade press are all converted to digitized output in a variety of formats e.g. Page level PDF input files are converted to custom NITF XML files within 3 to 4 hours of receipt,
Our QA service ensures that critical classifications such as Headline, sub-Headline, byline (author) and body content are delivered with accuracy levels greater than 99.5%.
A typical conversion workflow is described below:
- PDF files are downloaded from publisher
- Files are processed to extract all available text, image and structural data
- Visual proofing tools are used to segment articles, classify elements, edit and correct anomalies identified in extracted content
- Article content is tagged and formatted in accordance with the output specification
- Output is parsed and validated in accordance with the DTD
- Output is checked for quality
- Output is uploaded to the delivery destination