DNS Malware Classification System

AWS | Python | Predictive Modeling | Scaling | Classification

Problem Statement:

To create a product that predicts if a FQDN (Fully Qualified Domain Name) or URL is a DGA(Domain Generation Algorithm) or benign. The product is presented to the customers as a RESTful web service API. The first iteration of the product scales to 1,000,000 predictions per minute.

Tools and Technologies: Python, AWS: Lambda, Sagemaker, API Gateway, S3, Load Balancer, Cloudwatch Logs

Tasks:
* Designed a web service API that accomplishes the following: www.google.com -> [ TRUE, FALSE]
* Created a system level design
* Data collection and cleaning
* Feature Engineering
*Model Selection
* Hyperparameter Tuning
* Model Deployment

The system architecture is shown below:

Data Collection and Cleaning:
Created a custom-built dataset of 10 Million+ domains generated from 45+ Domain Generation Algorithms (DGA) and collected Benign domain names. After data collection, the domain names were extracted using 'tldextract' library in python.

Feature Engineering:
Count of capitals, convert domains to lower-case, length of domains, count of digits, count of consecutive consonents, Ratio of count of vowels to length, entropy, count of unique characters, convert domain characters to numbers.

Model Selection:
The following classification algorithms were implemented:
* Logistic Regression
* Support Vector Machines
* Decision Trees
* Random Forest
* Long Short Term Memory (LSTM)
* XGBoost

Amongst these, the XGBoost model was selected for the following reasons:
* Easier to deploy on AWS
* Greater speed and efficiency compared to other algorithms
* Better accuracy than LSTM and Random Forest

After performing hyperparameter tuning, achieved an accuracy of 93.16% with the XGBoost model for the test dataset.

For a quick overview of the project, please refer to the powerpoint presentation here

The project files can be found here.