DNS Malware Classification System

AWS | Python | Predictive Modeling | Scaling | Classification
Problem Statement:
To create a product that predicts if a FQDN (Fully Qualified Domain Name) or URL is a DGA(Domain Generation Algorithm) or benign. The product is presented to the customers as a RESTful web service API. The first iteration of the product scales to 1,000,000 predictions per minute.Tools and Technologies: Python, AWS: Lambda, Sagemaker, API Gateway, S3, Load Balancer, Cloudwatch Logs
Tasks:
* Designed a web service API that accomplishes the following: www.google.com -> [ TRUE, FALSE]
* Created a system level design
* Data collection and cleaning
* Feature Engineering
*Model Selection
* Hyperparameter Tuning
* Model Deployment
The system architecture is shown below:

Created a custom-built dataset of 10 Million+ domains generated from 45+ Domain Generation Algorithms (DGA) and collected Benign domain names. After data collection, the domain names were extracted using 'tldextract' library in python.
Feature Engineering:
Count of capitals, convert domains to lower-case, length of domains, count of digits, count of consecutive consonents, Ratio of count of vowels to length, entropy, count of unique characters, convert domain characters to numbers.
Model Selection:
The following classification algorithms were implemented:
* Logistic Regression
* Support Vector Machines
* Decision Trees
* Random Forest
* Long Short Term Memory (LSTM)
* XGBoost
Amongst these, the XGBoost model was selected for the following reasons:
* Easier to deploy on AWS
* Greater speed and efficiency compared to other algorithms
* Better accuracy than LSTM and Random Forest
After performing hyperparameter tuning, achieved an accuracy of 93.16% with the XGBoost model for the test dataset.
For a quick overview of the project, please refer to the powerpoint presentation here
The project files can be found here.