Honeypots are active sensors deployed to obtain information about attacks. In their search for vulnerabilities, attackers generate large volumes of logs, whose analysis is time consuming and cumbersome. We here evaluate whether Natural Language Processing (NLP) approaches can provide meaningful representations to find common traits in attackers' activity. We consider a widely used SSH/Telnet honeypot to record more than 200,000 sessions, including 61,000 unique shell scripts, some containing sequences of more than 100 Bash commands. We first parse the sessions to separate Bash commands, options and parameters. Next, we project each session in a metric space opposing two common tools used in NLP: Bag of Words and Word2Vec. Last, we leverage a clustering algorithm to aggregate the sessions while offering an instrumental representation of the clustering process. In the end, we obtain few tens of clusters that we analyze to explain the attackers' goals, i.e., obtain system information, inject malicious accounts, download and run executables, etc. Our work is a first step towards automatically identifying attack patterns on honeypots, thus effectively supporting security activities.

Towards NLP-based Processing of Honeypot Logs / Boffa, Matteo; Milan, Giulia; Vassio, Luca; Drago, Idilio; Mellia, Marco; Ben Houidi, Zied. - STAMPA. - (2022), pp. 314-321. (Intervento presentato al convegno 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW) tenutosi a Genoa, Italy nel 06-10 June 2022) [10.1109/EuroSPW55150.2022.00038].

Towards NLP-based Processing of Honeypot Logs

Boffa, Matteo;Vassio, Luca;Drago, Idilio;Mellia, Marco;
2022

Abstract

Honeypots are active sensors deployed to obtain information about attacks. In their search for vulnerabilities, attackers generate large volumes of logs, whose analysis is time consuming and cumbersome. We here evaluate whether Natural Language Processing (NLP) approaches can provide meaningful representations to find common traits in attackers' activity. We consider a widely used SSH/Telnet honeypot to record more than 200,000 sessions, including 61,000 unique shell scripts, some containing sequences of more than 100 Bash commands. We first parse the sessions to separate Bash commands, options and parameters. Next, we project each session in a metric space opposing two common tools used in NLP: Bag of Words and Word2Vec. Last, we leverage a clustering algorithm to aggregate the sessions while offering an instrumental representation of the clustering process. In the end, we obtain few tens of clusters that we analyze to explain the attackers' goals, i.e., obtain system information, inject malicious accounts, download and run executables, etc. Our work is a first step towards automatically identifying attack patterns on honeypots, thus effectively supporting security activities.
2022
978-1-6654-9560-8
File in questo prodotto:
File Dimensione Formato  
Article_IEEE.pdf

non disponibili

Descrizione: IEEE Explorer Version
Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 774.13 kB
Formato Adobe PDF
774.13 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Workshop_WTMC.pdf

accesso aperto

Descrizione: Accepted Manuscript
Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 742.69 kB
Formato Adobe PDF
742.69 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2969416