Home World Knowledge From PDFs to Data – Elites, Networks and Power in modern China

From PDFs to Data – Elites, Networks and Power in modern China

by 9999biz.com
0 comment


Introduction

In the digital age, the field of history is undergoing a transformative shift, with an increasing reliance on digital archives and tools to conduct research and analysis. Recognizing the unique needs of historians in navigating and organizing vast quantities of historical texts and documents has therefore become a necessity.

Within the ENP-China initiative, we are developing advanced algorithms to convert extensive textual content into “Data”. The task is complex due to the varied nature of the sources and the specific requirements of historians, making the determination of a meaningful analytical unit particularly challenging. Questions arise: Should our analysis encompass an entire newspaper, or should we focus on individual articles, paragraphs, or even sentences? In the midst of this vast heterogeneity, and with the aim of ensuring rapid and efficient processing of new, smaller, data sets, we have introduced the “ENP-Corpus Creator.” This application serves as a collaborative platform where IT joins forces with historians, guiding them through the process of constructing personalized corpora from the ground up, tailored precisely to their research needs.

At the heart of ENP-Corpus Creator lies a robust suite of functionalities tailored specifically for the meticulous demands of historical research. The application begins by converting PDF documents into images, a critical first step in order to maintain the link between image and text, while still allowing easy separation of an entire collection.
Subsequently, it employs advanced Optical Character Recognition (OCR) technology, powered by Google, to meticulously convert each page into editable and searchable text, ensuring high accuracy and reliability.

Understanding the nuanced nature of historical documents, ENP-Corpus Creator is equipped with a set of features to automatically identify and eliminate elements such as page numbers, headers, and footers, which can be useless during the data creation.

At the core of ENP-Corpus Creator’s is its interactive annotation system. Recognizing that historians are best positioned to understand the significance and context of their materials, the application invites historians to actively participate in the corpus creation process. Through an intuitive interface, users can annotate their documents, specifying how they wish to segment their data—be it at the article, paragraph, or page level. This level of customization ensures that the resulting corpus aligns perfectly with the historian’s research objectives and methodologies.

In the final step of the corpus creation, ENP-Corpus Creator prompts users to input additional metadata. This metadata, along with the structured annotations, forms the backbone of the final output: meticulously organized text files and structured data in JSON format, ready for integration into a Solr server for advanced search and analysis capabilities thanks to HistText.

ENP-Corpus Creator offers historians a tool to navigate the digital frontier of research with precision, efficiency, and a deep respect for the rich complexities of their sources.

From PDFs to Texts

The application initiates the digitization process by converting PDF documents into high-resolution images. Of course, the application can also be used directly with images. Following image conversion, the application employs Google Cloud’s OCR technology to extract textual content from the images. This choice of OCR technology is notable for its high accuracy and efficiency in processing complex document images, akin to the capabilities demonstrated by Google Lens. As this service is not free, we ask users to provide their API key to perform OCR. Please note that 300 credits are free when you create your account, so you can already perform OCR on many collections.
Other tools such as Tesseract (free) will soon be available.

Annotating the source

A distinctive feature of the application is its interactive annotation module, which invites historians to engage directly with the digitization process. Users are prompted to annotate the digitized content, specifying their preferences for data segmentation—whether at the article, paragraph, or page level. This participatory approach allows for customized corpus creation and ensures that the digitized data aligns with the specific research objectives and methodologies of the historians.
To facilitate this, the application synchronizes the display of images and their corresponding textual outputs, enabling users to effortlessly pinpoint visual boundaries within the text. Leveraging keyboard shortcuts enhances this manual delineation process, allowing for swift and efficient identification. It also allows users to seamlessly identify and remove undesired elements from their data, such as OCR inaccuracies or irrelevant metadata.
This step then generates a file that identifies the delimiters needed to convert the text into data.

The example above shows the annotation of Announcement as the first level of separation. Levels can range from 1 to 9, and are used to identify the nesting of data, as with chapters and subchapters.

From Texts to Data

In the final phase, the application solicits additional metadata from the users, integrating this information with the annotated text to generate structured data outputs. The text files are organized according to the specified segmentation preferences, and the structured data is formatted into a JSON file. This structured format is particularly designed for compatibility with Solr servers, used by HistText, facilitating efficient data indexing and retrieval, which is paramount for large-scale historical research projects.



Cite this blog post
Baptiste Blouin (2024, February 27). ENP-Corpus Creator: From PDFs to Data. Elites, Networks and Power in modern China. Retrieved February 27, 2024, from https://enepchina.hypotheses.org/5482

You may also like

Leave a Comment

Introduce

Dive into a universe of news diversity and online shopping excellence with 9999Biz Worldwide.

Newsletter

Subscribe my Newsletter for new trending posts, tips & new promotion. Let's stay updated!

Latest news

@2023 – All Right Reserved. Designed and Developed by THE LANDING COMPANY LIMITED – Tax code: 0316285369

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?
-
00:00
00:00
Update Required Flash plugin
-
00:00
00:00
0