Accession Number : AD1033861


Title :   Finding Malicious Cyber Discussions in Social Media


Descriptive Note : Technical Report


Corporate Author : MASSACHUSETTS INST OF TECH LEXINGTON LEXINGTON United States


Personal Author(s) : Lippmann,Richard P ; Campbell,Joseph P ; Weller-Fahy,David J ; Mensch,Alyssa C ; Campbell,William M


Full Text : https://apps.dtic.mil/dtic/tr/fulltext/u2/1033861.pdf


Report Date : 02 Feb 2016


Pagination or Media Count : 7


Abstract : Security analysts gather essential information on cyber attacks, exploits, vulnerabilities, and victims by manually searching social media sites. This effort can be dramatically reduced using natural language machine learning techniques. Using a new English text corpus containing more than 250k discussions from Stack Exchange, Reddit, and Twitter on cyber and non-cyber topics, we demonstrate the ability to detect more than 90% of the cyber discussions with fewer than 1% false alarms. If an original searched document corpus includes only 5% cyber documents, then our processing provides an enriched corpus for analysts where 83% to 95% of the documents are on cyber topics. Good performance was obtained using TF-IDF features and logistic regression. A classifier trained using prior historical data accurately detected 86% of emergent Heartbleed discussions and retrospective experiments demonstrate that classifier performance remains stable up to a year without retraining.


Descriptors :   feature selection , preprocessing , denial of service attack , computer networks , computer security , cyberattacks , social networking services , online communications , social media , MALWARE , SOCIAL MEDIA , NATURAL LANGUAGES , MACHINE LEARNING


Subject Categories : Computer Systems
      Computer Programming and Software


Distribution Statement : APPROVED FOR PUBLIC RELEASE