Title

Desperately Seeking Standards: Using Text Processing to Save Your Time

Document Type

Conference Proceeding

Publication Date

7-26-2021

Publication Title

ASEE Annual Conference and Exposition, Conference Proceedings

Abstract

Purpose/Hypothesis We aim to analyze our standards-use, interlibrary loan, and document-delivery-request data on a more regular basis to inform collections management decisions. However, manually searching for standards titles within interlibrary loan and document-delivery-request data is time consuming and unlikely to occur on a regular basis. We were also interested in a method that could be applied to large blocks of text, such as theses and dissertations. Design/Method To detect the presence of engineering standards and other standards documents in tabular datasets as well as in large blocks of text, the first step was to develop a regular expression, using Python in Jupyter Notebooks. Regular expressions (or regex), used for text processing and querying, identifies patterns within written text. This pattern was tested to match a series of standards, within sample text that included known standards such as ANS 10.5-2006. In addition, it was checked against words and phrases it should not match against, including web addresses and mathematical equations. As a proof of concept, the text processing code was evaluated against a collection of sample pdf dissertations, one of which included standards documents in the text and references list. As there are many iterations of what a standard can be called, we were unable to restrict the regex matching criteria any further. This means that false-positives appeared, such as the “state name and zip code” combination, report numbers, and chemical formulas. To help identify results from false-positives, we expanded the regular expression to also pull words surrounding the match, giving context to the results. This does not prevent the false-positives but allows us to quickly distinguish a false-positive from an actual match. Once the pattern was identified, it was then applied (using Python and the pandas package) to compiled spreadsheets to identify standards in tabular collections data. We compared these results to an earlier manual search performed on the same data set. We also tested the text processing method on a set of dissertations. Results The new method required 25% less time to complete, and the outcomes were similar. While we predicted that more standards would be located using the text processing method compared to a manual search, the text processing method missed three standards that were previously detected, and located one standard that had not been previously detected. The regular expression also successfully detected standards documents mentioned in large blocks of text. Conclusion We developed and assessed an open source text processing method to flag potential standards mentioned in text and tabular datasets. This method is a substantial improvement over manual searching, providing similar results in a quarter of the time. The new method requires less than half a standard workday to analyze 10,000 interlibrary loan or document delivery requests. Our pilot test of the method on large blocks of text shows that it will also detect standards used in materials that are not regularly indexed for citations such as theses and dissertations, as well as technical reports and other gray literature.

Disciplines

Communication Technology and New Media | Programming Languages and Compilers

UNLV article access

Search your library

Share

COinS