RATOM Hackathon 2020: Email Processing Tools, Scripts, and Workflows for Archives

As the current iteration of the RATOM project is coming to a close, our team wanted to provide an opportunity to foster community engagement with the software and its surrounding work. During October 19-21, 2020, we hosted a hackathon to encourage community members to explore the software across three lightly scheduled and loosely structured days. The intent was to dedicate time and digital space for participants to interact with RATOM amongst fellow archival workers and with the guidance of our team.

We provided recommended topics for those interested in the project, including CLI and scripting, the RATOM web tool, and workflow or prototype development. Work and discussions largely took place on designated Slack channels and during a check-in Zoom session. RATOM’s Technical Lead, Kam Woods, led the hackathon’s opening kick-off meeting and provided reference materials about the software, installations guides, and sample data sets for participants to use. RATOM Software Engineer Antoine de Torcy, Co-PI Camille Tyndall Watson, and Investigator Jamie Patrick-Burns worked with hackathon participants, fielding a range of software usability questions and providing firsthand technical and archival expertise.

The CLI and scripting channel saw consistent activity across all three days. Most discussions revolved around RATOM-specific capabilities, but general threads also featured remarks about, for example, the pros and cons of tools such as SQLite and DBeaver. Participants inquired into whether or not the email system type (Gmail, Microsoft, etc.) would be included in metadata outputs. Other participants identified the (in)accuracies of extracted entities through spaCy amongst various factors, such as third-party tagging, incomplete sentences, acronyms, etc. Individuals also introduced notions of speed differentials between older and newer email formats. Lengthier threads also parsed through issues of null headers resulting from notes and calendar data being extracted from the bodies of emails. Finally, workflow modifications were also suggested to the RATOM team. One participant specifically suggested a workflow for formatted JSON files to export EML files for other tasks like extracting, rendering, and indexing.

Beyond CLI and scripting, alternatives to access and delivery were considered in the context of fulfilling records requests. This included case-specific examples of current delivery means through FTP, hard drives, and public interfaces. Conversations also touched on levels of digital literacy and education as broader, yet necessary components of these workflows that could be examined and possibly integrated into future work. PDFs were brought up alongside Adobe’s redaction capabilities and the possibility of archiving to PDF, although the latter would require further development with the RATOM tool specifically. Closing remarks underscored the importance of understanding audience needs and how processes may need to function on a case-by-case basis depending on a number of researcher-related variables.

Collaborative troubleshooting with minor installation hurdles, Jupyter notebooks, mybinder, and XCode version compatibility occurred throughout the hackathon, as did conversations about a future version of the web tool that could be deployed to a local container. Although we did not intend to use the hackathon for beta testing, the RATOM team will be able to implement a number of changes based on participant feedback. General comments from participants indicated that existing documentation was, for the most part, easy to understand and libratom was simple to download. But we are working towards a number of documentation modifications after seeing the process of participants go through it individually. We also had build issues related to compatibility with XCode 12 on macOS, which have been resolved in a recent update to the codebase and are now tested automatically in our continuous integration workflow.

The RATOM team would like to note it was especially rewarding to see so many participants representing state or government archives during the hackathon. Although some identified themselves as “non-technical” people, it was exciting to see communities that are not typically part of these conversations engaging with the software and with broader discussions as they took place on Slack and Zoom.

You can learn more about the RATOM project here, or you can follow our work on GitHub here

Registration now open for the RATOM 2020 Hackathon!

Registration for the 2020 RATOM Hackathon is now open!

Click here to register

Dates: October 19-21, 2020

This is a chance to get involved with other members of community working on email preservation, dig in to state-of-the-art open source tools, learn some new skills, or simply to explore a new problems space and get feedback in a friendly and inclusive environment.

The hackathon is a free virtual event, and is open to everyone. We’ve provided three suggested areas of interest in the registration to help bring together those interested in similar features, workflows, or automation tasks.

Recently released: libratom 0.4.3

We’ve released libratom 0.4.3, which includes a new flag option (-m) for the entity and report CLI commands, allowing users to populate the message table of the sqlite3 database output with message bodies (stripped of markup and inline attachments) and headers. This option is intended to facilitate a broader range of dataset production workflows, particularly supporting downstream statistical content analysis and ML tasks.

This release also includes additional improvements to the feedback provided by existing commands, bug fixes, and dependency updates.

You can download this release on PyPI or read detailed instructions on installation and usage on GitHub.

Recently released: libratom 0.4.1

We’ve released libratom 0.4.1, which introduces a new CLI command to allow users to batch export preselected messages from PST files using machine-generated JSON. This command is intended to complement the operation of our (in development – see the ratom-deploy repository on GitHub) web interface allowing collecting organizations to select, appraise, and export messages for delivery. More info coming soon!

This release also includes additional improvements to the feedback provided by existing commands, bug fixes, and dependency updates.

You can download this release on PyPI, or read detailed instructions on installation and usage on GitHub.

Recently released: libratom 0.3.0

We’ve released libratom 0.3.0, which introduces a new CLI command to inspect locally installed spaCy models and install specific versions of models as desired. This release includes additional minor bug fixes and dependency updates.

You can download this release on PyPI, or read detailed instructions on installation and usage on GitHub.