Large data volumes (aka “big data”) coupled with the use of new technologies can greatly increase the amount of Personally Identifiable Information (PII) data collected by an organization. Correspondingly, there has been an escalation of security breaches involving PII data which has contributed to the loss of millions of records over the past few years. The recommended mitigation strategy is to assume security postures in accordance with industry best practices, which includes adequate training for technology users. However, an organization cannot properly protect PII if it does not know PII data resides on computers and servers. One solution is to purchase commercial products that scan, extract, and report PII data. Such products are often prohibitively expensive, and they tend to suffer from “feature bloat” which makes them difficult and overly complex for simple use cases. A compromise is to develop scripted components that utilize regular expressions/keyword searches to discover PII instances in textual content. A significant challenge is that most PII content is encoded in common binary file formats (such as PDF), which are not directly searchable as is text.
This webinar will discuss a CSIAC-developed prototype for detecting and extracting PII from over a thousand binary file formats by leveraging the widely used open source Apache Tika toolkit. The prototype, called “BFAS – Binary File Application Scanner”, integrates Tika through the implementation of a custom Powershell cmdlet which seamlessly injects a text extraction facility into the standard (existing) Powershell pipeline. A graphical user interface (GUI) was developed to facilitate multiprocessing and XML-based reporting and visualization. Ideas for extending the BFAS architecture to leverage machine learning (ML) methods will be discussed.