Large data volumes (aka “big data”) coupled with the use of new technologies can greatly increase the amount of Personally Identifiable Information (PII) data collected by an organization. Correspondingly, there has been an escalation of security breaches involving PII data which has contributed to the loss of millions of records over the past few years. The recommended mitigation strategy is to assume security postures in accordance with industry best practices, which includes adequate training for technology users. However, an organization cannot properly protect PII if it does not know PII data resides on computers and servers. One solution is to purchase commercial products that scan, extract, and report PII data. Such products are often prohibitively expensive, and they tend to suffer from “feature bloat” which makes them difficult and overly complex for simple use cases. A compromise is to develop scripted components that utilize regular expressions/keyword searches to discover PII instances in textual content. A significant challenge is that most PII content is encoded in common binary file formats (such as PDF), which are not directly searchable as is text.
This webinar will discuss a prototype software tool developed to detect and extract PII from over a thousand binary file formats. The prototype, called “BFAS – Binary File Application Scanner”, seamlessly injects a text extraction facility into the standard (existing) Powershell pipeline. A graphical user interface (GUI) was developed to facilitate multiprocessing and XML-based reporting and visualization. Ideas for extending the BFAS architecture to leverage machine learning (ML) methods will also be discussed.