• Home
  • Resources
    • Find Resources by Topic Tags
    • Cybersecurity Policy Chart
    • CSIAC Reports
    • Webinars
    • Podcasts
    • Cybersecurity Digest
    • Standards & Reference Docs
    • Journals
    • Certifications
    • Acronym DB
    • Cybersecurity Related Websites
  • Services
    • Free Technical Inquiry
    • Core Analysis Task (CAT) Program
    • Subject Matter Expert (SME) Network
    • Training
    • Contact Us
  • Community
    • Upcoming Events
    • Cybersecurity
    • Modeling & Simulation
    • Knowledge Management
    • Software Engineering
  • About
    • About the CSIAC
    • The CSIAC Team
    • Subject Matter Expert (SME) Support
    • DTIC’s IAC Program
    • DTIC’s R&E Gateway
    • DTIC STI Program
    • FAQs
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
Login / Register

CSIAC

Cyber Security and Information Systems Information Analysis Center

  • Resources
    • Find Resources by Topic Tags
    • Cybersecurity Policy Chart
    • CSIAC Reports
    • Webinars
    • Podcasts
    • Cybersecurity Digest
    • Standards & Reference Docs
    • Journals
    • Certifications
    • Acronym DB
    • Cybersecurity Websites
  • Services
    • Free Technical Inquiry
    • Core Analysis Task (CAT) Program
    • Subject Matter Expert (SME) Network
    • Training
    • Contact
  • Community
    • Upcoming Events
    • Cybersecurity
    • Modeling & Simulation
    • Knowledge Management
    • Software Engineering
  • About
    • About the CSIAC
    • The CSIAC Team
    • Subject Matter Expert (SME) Support
    • DTIC’s IAC Program
    • DTIC’s R&E Gateway
    • DTIC STI Program
    • FAQs
  • Cybersecurity
  • Modeling & Simulation
  • Knowledge Management
  • Software Engineering
/ Journal Issues / Cyber Science & Technology at the Army Research Laboratory (ARL) / Machine Learning and Network Intrusion Detection: Results from Grammatical Inference

Machine Learning and Network Intrusion Detection: Results from Grammatical Inference

Published in Journal of Cyber Security and Information Systems
Volume: 5 Number: 1 - Cyber Science & Technology at the Army Research Laboratory (ARL)

Author: Dr. Richard Harang
Posted: 01/23/2017 | Leave a Comment

Machine learning and Intrusion detection

The literature on machine learning and intrusion detection is vast (see references in [1] for a partial overview; also, short reviews by [18] and [19] which contain more details about specific machine learning methods that have been attempted); however, it divides broadly into the two main categories of “anomaly detection” and “signature inspired” [20] (in machine learning terminology, “supervised learning”). Anomaly detection, including such systems as Anagram [21] and McPAD [22], focuses on constructing a “normal” model of traffic and producing alerts when traffic that does not fit this model is observed. Supervised learning systems (see [23], [24], and [25] for representative examples) are provided with both malicious and benign traffic, and attempt to learn rules to distinguish them. While less common in the domain of intrusion detection, active learning (i.e. interactive) approaches for outlier detection have been presented as well, as in [26].

In all cases, the general formulation of the problem is approximately the same. Network messages are – by construction – designed to be parsed and interpreted by a machine, and hence can be characterized as formal grammars which accept and reject on specific strings. Within the space of all possible strings M received by the network service, there is the subset A⊆M of messages that are accepted by the network endpoint; this subset itself contains S⊆A of “safe” messages that represent normal use of the endpoint. In learning a function that can identify which subset of a given message lies within – in particular for some string s whether or not s∈S – we are constructing a recognizer for S, thus placing the problem exactly within the domain of grammatical inference.

Consider, for example the case of a standard machine learning approach in which we are presented with samples of normal traffic from S, and hostile traffic from the set A\S, each appropriately labeled, and we wish to train a function to determine whether a candidate string lies in one set or the other. This is precisely the problem of learning a grammar from a complete presentation, and as such we may readily apply existing results. Even if we make the simplifying assumption that the protocol under consideration (or the union of protocols if deployed on a multi-protocol endpoint) is regular, we are still attempting to learn a Discrete Finite Automata from a complete presentation. If we wish to learn it precisely (in the limit) then we have that the problem is NP-complete [10]. If we wish to learn it in a more practical sense (i.e. PAC), then we have that the identification problem is “merely” cryptographically hard [11]. This forces us to accept the conclusion that even if we obtain empirically good performance for a particular algorithm in a particular setting, we cannot be sure that it will generalize to a new domain.

If we consider (as in McPAD [22]) that only positive (‘normal’) data S is available, and continue to assume that we are observing then we are attempting to learn a grammar from its positive presentation, with all associated complexities. While specific examples of grammars are clearly at least PAC-learnable in this setting, as shown by the results of [22], it follows immediately from the difficulty of learning from a positive presentation that McPAD must fail to generalize to at least some classes of grammars; whether or not those grammars are of practical relevance to intrusion detection cannot be decided in any fashion other than empirically. We are thus left with no foundational guarantee of correctness; simply empirical observations.

Clearly, when we consider a more realistic scenario in which several protocols may be present in the same set of network traffic, the problem often becomes significantly more difficult; the problem of learning the mixture of grammars is at a minimum at least as hard as learning the most complex one, and depending upon the closure properties of that class, may in fact be more difficult. While most languages in the Chomsky hierarchy are in fact closed under unions, it is often not clear whether or not restricted classes (such as those that have finite thickness) may be.

While somewhat outside the realm of network intrusion detection, more powerful inference models such as learning from an informant have been shown to generate positive results in areas such as developing usable models of program execution (stateful typestates) [27]. This approach obtains a direct generative model of program outputs which can be examined for various security properties. Standard fuzzing techniques [28] are perhaps a more direct application within the security domain, in which a subset C⊆A of inputs that lead to crashes is learned from interaction with a program, frequently combined with additional information about the execution of code paths [29], however these methods do not typically produce formal descriptions or generative models of incorrect inputs, and rather seek to enumerate a useful subset of them, typically (in defensive settings) attempting to reduce the size of the set A. The work of [30] explores methods to leverage attacker knowledge in constructing fuzzing inputs via a descriptive language, which could be used in an iterative fashion to eventually describe a subset of the target grammar.

In many cases, we do not even require the results of grammatical inference to show that a particular classifier cannot (provably) learn a sharp distinction between malicious and benign traffic. A key step in any machine learning process is that of ‘feature extraction’ in which the raw data that is to be classified is converted into some numerical representation that can then be operated on by the learning algorithm. N-grams (and minor variations on the concept such as skip-grams) are the core feature representation used in a number of anomaly-based intrusion detection systems, including Anagram [21] and McPAD [22], in which every n-byte substring within the payload is tabulated (for instance, the string “learning” would have 3-grams of “lea”, “ear”, “arn”, “rni”, and so on).

However such representations can be shown to be insufficiently powerful to distinguish between many members of the class of regular languages. For example, the rather trivial regular languages (ab)x(ba)y(ab)z and (ba)x(ba)y(ba)z cannot be distinguished from each other on the basis of 2-grams (note that the 2-grams bb and aa both appear exactly once in each, with variable numbers of ab and ba tokens), while constructing a recognizing DFA for each is trivial. Similar counterexamples can be constructed for n-grams of arbitrary length. This immediately implies that any learning algorithm that first reduces a sequence to n-gram counts is a priori incapable of learning large subsets of the class of regular grammars, and as a consequence we may expect that – even if empirically good performance is shown on some data sets – this performance cannot be relied upon to generalize.

Other feature representations are also used. Perhaps most widely known are those of the (now severely out-of-date but still regularly studied) KDD’99 data set [31], which parses the network traffic to extract a number of features suggested by expert knowledge to be of use. In addition to “metadata” describing flow-based properties such as the duration of the connection and the number of concurrent connections to the same host, a number of content-based features are extracted from both the payload of the packets (e.g., the presence of a shell prompt, extracting FTP commands to obtain a tally of outbound ones, etc.) and headers of the packets (identifying the network protocol, various ports and addresses, protocol-specific flags, and so on). These features obviously make no attempt to model any significant portion of the content of the packets, and so make the prospect of inferring a grammar from them infeasible; at best, some of the manually extracted features act as “telltales” for specific attacks, and thus allow what is effectively signature-based detection.

And indeed, the most effective current approach in intrusion detection remains (anecdotally at least) signature-based solutions [1] such as Snort [3]. The effectiveness of such solutions can be explained precisely within the context of grammatical inference, as a well-written content-based signature is equivalent to a production that is not (or is very rarely) a production of the grammar underlying “good” traffic, and hence form a telltale for the set A\S. And indeed, Snort and Bro [2] both contain sophisticated pattern-matching rules that are capable of recognizing a wide range of malicious traffic, in effect acting as small recognizers for subsets of the malicious grammars under consideration. It is worth noting, however, that the rule generation in this case is often done by hand, and even when done in an automated fashion is typically attempting to match a finite subset of malicious traffic for a specific attack, and then tested on a set of larger normal traffic to assess false positives; this is equivalent to the bounded version of the problem posed in [10], which can take place in polynomial time. A key distinction here is that – rather than attempting to model all ‘safe’ or ‘malicious’ productions – any method that produces some form of signature is attempting to model a finite number of productions of a single protocol under heavily supervised conditions, and so does not address the question of novel attacks that machine learning-based solutions are often attempting to address [1].

Pages: Page 1 Page 2 Page 3

Previous Article:
« The Cyber Security Collaborative Research Alliance: Unifying...
Next Article:
Synergistic Architecture for Human-Machine Intrusion Detection »

References

  1. R. Sommer and V. Paxson, "Outside the closed world: On using machine learning for network intrusion detection," in IEEE symposium on security and privacy, 2010.
  2. V. Paxson, "Bro: a system for detecting network intruders in real-time," Computer networks, vol. 31, no. 23, pp. 2435-2463, 1999.
  3. Roesch, Martin, "Snort: Lightweight Intrusion Detection for Networks," in LISA , Seattle, Washington, 1999.
  4. S. Axelsson, "The base-rate fallacy and the difficulty of intrusion detection," ACM Transactions on Information and System Security (TISSEC), vol. 3, no. 3, pp. 186-205, 2000.
  5. A. Kott, "Towards fundamental science of cyber security," in Network Science and Cybersecurity, New York, Springer, 2014, pp. 1-13.
  6. M. Sipser, Introduction to the Theory of Computation., Boston, MA: Thomson Course Technology, 2006., 2006.
  7. C. De la Higuera, Grammatical inference: learning automata and grammars, Cambridge University Press, 2010.
  8. M. E. Gold, "Language identification in the limit," Information and control, vol. 10, no. 5, pp. 447-474, 1967.
  9. L. G. Valiant, "A theory of the learnable," Commun. ACM, vol. 27, no. 11, pp. 1134--1142, 1984.
  10. M. E. Gold, "Complexity of automaton identification from given data," Information and control, vol. 37, no. 3, pp. 302-320, 1978.
  11. M. Kearns and L. Valiant, "Cryptographic limitations on learning Boolean formulae and finite automata," Journal of the ACM, vol. 41, no. 1, pp. 67-95, 1994.
  12. D. Angluin, "Learning regular sets from queries and counterexamples," Information and computation, vol. 75, no. 2, pp. 87-106, 1987.
  13. V. H. Rajesh Parekh, "Learning DFA from simple examples," Machine Learning, vol. 44, pp. 9-35, 2001.
  14. D. Angluin, "Finding patterns common to a set of strings," in Proceedings of the eleventh annual ACM symposium on Theory of computing, 1979.
  15. D. Angluin, "Inductive inference of formal languages from positive data," Information and control , vol. 45, no. 2, pp. 117-135, 1980.
  16. K. N. Wood and R. E. Harang, "Grammatical Inference and Language Frameworks for LANGSEC," in 2015 IEEE Security and Privacy Workshops, 2015.
  17. T. Motoki, T. Shinohara and K. Wright, "Correct definition of finite elasticity," Research Institute of Fundamental Information Science, Kyushu University Japan, 1990.
  18. C.-F. Tsai, Y.-F. Hsu, C.-Y. Lin and W.-Y. Lin, "Intrusion detection by machine learning: A review," Expert Systems with Applications, vol. 36, no. 10, pp. 11994--12000, 2009.
  19. P. Laskov, P. Düssel, C. Schäfer and K. Rieck, "Learning intrusion detection: supervised or unsupervised?," in International Conference on Image Analysis and Processing, Heidelberg, 2005.
  20. S. Axelsson, "Intrusion detection systems: A survey and taxonomy," 2000.
  21. K. Wang, J. J. Parekh and S. J. Stolfo, "Anagram: A content anomaly detector resistant to mimicry attack," in Recent Advances in Intrusion Detection, Heidelberg, 2006.
  22. R. Perdisci, D. Ariu, P. Fogla, G. Giacinto and W. Lee, "McPAD: A multiple classifier system for accurate payload-based anomaly detection," Computer Networks , vol. 53, no. 6, pp. 864--881, 2009.
  23. W. Hu, W. Hu and S. Maybank, "Adaboost-based algorithm for network intrusion detection," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , vol. 38, no. 2, pp. 577--583, 2008.
  24. P. Sangkatsanee, N. Wattanapongsakorn and C. Charnsripinyo, "Practical real-time intrusion detection using machine learning approaches," Computer Communications , vol. 38, no. 18, pp. 2227--2235, 2011.
  25. O. Linda, T. Vollmer and M. Manic, "Neural network based intrusion detection system for critical infrastructures," in International Joint Conference on Neural Networks, 2009.
  26. N. Abe, B. Zadrozny and J. Langford, "Outlier detection by active learning," in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006.
  27. H. Xiao, J. Sun, Y. Liu, S.-W. Lin and C. Sun, "Tzuyu: Learning stateful typestates," in 28th International Conference on Automated Software Engineering , 2013.
  28. M. Sutton, A. Greene and P. Amini, Fuzzing: brute force vulnerability discovery, Pearson Education, 2007.
  29. M. Zalewski, "american fuzzy lop," [Online]. Available: http://lcamtuf.coredump.cx/afl/.
  30. D. Aitel, "The advantages of block-based protocol analysis for security testing," Immunity Inc, 2002.
  31. S. Stolfo, "The Third International Knowledge Discovery and Data Mining," University of California Irvine, 2002. [Online]. Available: .
  32. C. Sheridan and R. Harang, "Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec," in IEEE Security and Privacy Workshop on Language-Theoretic Security, 2016.

Author

Dr. Richard Harang
Dr. Richard Harang
Dr. Richard Harang received his PhD in Statistics and Applied Probability from the University of California Santa Barbara in 2010. After a year of postdoctoral research in the Computational Science and Engineering group under Dr. Linda Petzold, he began work at the U.S. Army Research Laboratory in 2011 focusing on applications of statistics and statistical machine learning to problems in network security. His current research interests include machine learning on structured data, analysis and attribution of source code and binary samples, and using generative models of time series data to explore properties of the underlying process.

Reader Interactions

Leave a Comment Cancel

You must be logged in to post a comment.

sidebar

Blog Sidebar

Featured Content

Data Privacy Day - Jan 28

Data Privacy Day is January 28th

You can help create a global community that respects privacy, safeguards data, and enables trust. You can help teach others about privacy at home, at work, and in your community.

Learn How

Featured Subject Matter Expert (SME): Daksha Bhasker

A dynamic CSIAC SME, Senior Principal Cybersecurity Architect, Daksha Bhasker has 20 years of experience in the telecommunications services provider industry. She has worked in systems security design and architecture in production environments of carriers, often leading multidisciplinary teams for cybersecurity integration, from conception to delivery of complex technical solutions. As a CSIAC SME, Daksha's contributions include several published CSIAC Journal articles and a webinar presentation on the sophiscated architectures that phone carriers use to stop robocalls.

View SME's Contributed Content

The DoD Cybersecurity Policy Chart

The DoD Cybersecurity Policy Chart

This chart captures the tremendous breadth of applicable policies, some of which many cybersecurity professionals may not even be aware, in a helpful organizational scheme.

View the Policy Chart

CSIAC Report - Smart Cities, Smart Bases and Secure Cloud Architecture for Resiliency by Design

Integration of Smart City Technologies to create Smart Bases for DoD will require due diligence with respect to the security of the data produced by Internet of Things (IOT) and Industrial Internet of Things (IIOT). This will increase more so with the rollout of 5G and increased automation "at the edge". Commercially, data will be moving to the cloud first, and then stored for process improvement analysis by end-users. As such, implementation of Secure Cloud Architectures is a must. This report provides some use cases and a description of a risk based approach to cloud data security. Clear understanding, adaptation, and implementation of a secure cloud framework will provide the military the means to make progress in becoming a smart military.

Read the Report

CSIAC Journal - Data-Centric Environment: Rise of Internet-Based Modern Warfare “iWar”

CSIAC Journal Cover Volume 7 Number 4

This journal addresses a collection of modern security concerns that range from social media attacks and internet-connected devices to a hypothetical defense strategy for private sector entities.

Read the Journal

CSIAC Journal M&S Special Edition - M&S Applied Across Broad Spectrum Defense and Federal Endeavors

CSIAC Journal Cover Volume 7 Number 3

This Special Edition of the CSIAC Journal highlights a broad array of modeling and simulation contributions – whether in training, testing, experimentation, research, engineering, or other endeavors.

Read the Journal

CSIAC Journal - Resilient Industrial Control Systems (ICS) & Cyber Physical Systems (CPS)

CSIAC Journal Cover Volume 7 Number 2

This edition of the CSIAC Journal focuses on the topic of cybersecurity of Cyber-Physical Systems (CPS), particularly those that make up Critical Infrastructure (CI).

Read the Journal

Recent Video Podcasts

  • Privacy Impact Assessment: The Foundation for Managing Privacy Risk Series: The CSIAC Podcast
  • Agile Condor: Supercomputing at the Edge for Intelligent Analytics Series: CSIAC Webinars
  • Securing the Supply Chain: A Hybrid Approach to Effective SCRM Policies and Procedures Series: The CSIAC Podcast
  • DoD Vulnerability Disclosure Program (VDP) Series: CSIAC Webinars
  • 5 Best Practices for a Secure Infrastructure Series: The CSIAC Podcast
View all Podcasts

Upcoming Events

Fri 22

SANS Cyber Security Central: Jan 2021

January 18 - January 23
Organizer: SANS Institute
Fri 22

SANS Cyber Threat Intelligence Summit 2021

January 21 - January 22
Organizer: SANS Institute
Fri 22

SANS Cyber Threat Intelligence Solutions Track 2021

January 22 @ 09:00 - 17:00 EST
Organizer: SANS Institute
Wed 27

Enterprise Data Governance Online 2021

January 27 @ 08:00 - 13:30 EST
Organizer: DATAVERSITY
Thu 28

Data Privacy Day

January 28
View all Events

Footer

CSIAC Products & Services

  • Free Technical Inquiry
  • Core Analysis Tasks (CATs)
  • Resources
  • Events Calendar
  • Frequently Asked Questions
  • Product Feedback Form

About CSIAC

The CSIAC is a DoD-sponsored Center of Excellence in the fields of Cybersecurity, Software Engineering, Modeling & Simulation, and Knowledge Management & Information Sharing.Learn More

Contact Us

Phone:800-214-7921
Email:info@csiac.org
Address:   266 Genesee St.
Utica, NY 13502
Send us a Message
US Department of Defense Logo USD(R&E) Logo DTIC Logo DoD IACs Logo

Copyright 2012-2021, Quanterion Solutions Incorporated

Sitemap | Privacy Policy | Terms of Use | Accessibility Information
Accessibility / Section 508 | FOIA | Link Disclaimer | No Fear Act | Policy Memoranda | Privacy, Security & Copyright | Recovery Act | USA.Gov

This website uses cookies to provide our services and to improve your experience. By using this site, you consent to the use of our cookies. To read more about the use of our site, please click "Read More". Otherwise, click "Dismiss" to hide this notice. Dismiss Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Non-necessary

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.