网络在现代生活中扮演着重要的角色，网络安全已经成为一个重要的研究领域。入侵检测系统(IDS)是一种重要的网络安全技术，它监控着网络中软硬件的运行状态。尽管经过几十年的发展，现有的入侵检测系统在提高检测精度、降低误报率、检测未知攻击等方面仍面临挑战。为了解决上述问题，许多研究者致力于开发使用机器学习方法的IDS。

机器学习方法能够自动发现正常数据和异常数据之间的本质区别，具有较高的准确性。此外，机器学习方法具有很强的泛化能力，因此也能够检测未知的攻击。

IDS的分类

IDS的分类方法有两种：基于检测的方法和基于数据源的方法。在基于数据源的方法中，IDS可以分为基于主机的方法和基于网络的方法。这种方法的主要缺点是需要领域知识来设置、维护这些系统和监视推论。在基于检测的方法中，IDS可以分为滥用检测和异常检测。

为什么选择深度学习？

深度学习是机器学习的一个分支，它的性能是显著的，已经成为一个研究热点。当有足够的训练数据时，基于深度学习的入侵检测系统可以达到令人满意的检测水平，并且深度学习模型具有足够的泛化能力来检测攻击变体和新攻击。此外，他们很少依赖领域知识；因此，它们易于设计和构造。

与传统的机器学习技术相比，深度学习方法更擅长处理大数据。此外，深度学习方法可以从原始数据中自动学习特征表示，然后输出结果;它们以端到端的方式运行，而且很实用。

在本文中，我们将专注于使用深度学习网络和时间序列原则的机器学习方法。利用深度学习解决时间序列问题的基本要求是“数据”。这些数据可以是单变量/多变量时间序列数据，即数据是按时间顺序记录的。

该方法将攻击检测问题归结为异常检测问题。这背后的原因是，攻击通常很少发生，但是当它们发生时，它们的签名(更确切地说是分布)与正常操作条件下的签名非常不同。签名的这种变化反映在时间序列数据中。我们使用自编码器模型来学习正常状态数据的分布。使用此模型，我们可以确定传入数据是否具有显著不同的签名。

什么是自编码器？

自编码器是神经网络，包括两个对称组件，一个编码器和一个解码器，如图3所示。编码器从原始数据中提取特征（也称为潜在表示），而解码器从这些潜在表示中重建输入数据。在训练期间，编码器的输入与解码器的输出之间的差异逐渐减小。在此训练过程中，从编码器出来的潜在表示逐渐趋向于表示原始数据的本质。重要的是要注意，整个过程不需要监督信息。存在许多著名的自编码器变体，如去噪自编码器，变分自编码器和稀疏自编码器。

自编码器主要用于减少特征空间，而特征/潜在表示则位于工作流的下游，以训练不同类型的模型。自编码器在捕获输入特征空间的复杂多元分布方面也做得非常好。由于此特性，它们被广泛用于异常检测任务中。

由于这是一个时间序列问题，因此我们使用LSTM网络。它们是RNN（循环神经网络）的一种变体，目的是在长序列上保持时间相关特征。这些网络要求每个样本的形状为（时间步长，特征）。时间步长是可调的数字。因此，输入数据是一个3D形状的数组（样本数，时间步长，特征）。

数据集说明

为了建模入侵检测学习任务，我们使用KDD99数据集。KDD99数据集是最广泛的IDS（入侵检测系统）基准数据集。它从一个名为DARPA1998的原始数据集中提取了41维特征，该数据集包含原始TCP（传输控制协议）数据包和标签。由于原始数据包对于机器学习模型没有多大用处。因此，策划了一个新的数据集，称为KDD99数据集。

KDD99的标签与DARPA1998相同。KDD99有四种类型的特性，即基本特征、内容特征、基于主机的统计特征和基于时间的统计特征。可以从以下网站下载此数据集：http : //kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

在上面的链接中，有几个版本的数据集。在这个工作中，我们使用完整版本的数据集来训练和测试。

训练数据：kddcup.data.gz
测试数据：corrected.gz
未标记的数据（生产数据）：kddcup.newtestdata_10_percent_unlabeled.gz
列名称：kddcup.names

我们在训练集上训练模型，调整决策参数（例如分类器阈值），并在测试集（也称为验证集）上测量有效性指标。最后，我们使用生产集（无标签的数据集）来识别异常。

导入Python库

# importing all the necessary packages
import tqdm
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler, OneHotEncoder
from mpl_toolkits import mplot3d
from tensorflow.keras.layers import Input, Dense, LSTM, TimeDistributed, RepeatVector
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
from scipy.spatial.distance import euclidean
import random
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, f1_score, precision_score, recall_score, accuracy_score, plot_confusion_matrix

数据预处理

让我们首先导入所需的训练文件。

TRAIN_DATA_PATH = 'kdd_train.csv'
COLUMN_NAMES_PATH = 'columns_names.txt'
TEST_DATA_PATH = 'kdd_test.csv'
PROD_DATA_PATH = 'kdd_production.csv'

# reading the training data file
df = pd.read_csv(TRAIN_DATA_PATH, header=None)

# reading the file containing feature names
with open(COLUMN_NAMES_PATH, 'r') as txt_file:
    col_names = txt_file.readlines()
    
col_names_cleaned = [i.split(':')[0] for i in col_names]
# adding an extra column for the indicator
col_names_cleaned.extend(['result'])
# extracting only continous features
continous_col_indices = [col.split(':')[0] for i, col in enumerate(col_names) if col.split(':')[1]==' continuous\n' or col.split(':')[1]==' continuous']

df.columns = col_names_cleaned
df.head()

该训练数据集具有> 400万行，其中只有约20％是正常的。数据帧的最后一列是“结果”列，它指定连接是正常还是攻击。有不同类型的攻击，例如反向dos，buffer_overflow u2r，ftp_write r2l，guess_passwd r2l等。在测试集中，除了训练集中存在的攻击外，还存在一些新的攻击。这些新攻击是训练集中现有攻击的变体。这样做是为了衡量模型对未见攻击的有效性。但是，我们不打算将不同类型的攻击分类为JSut异常。

下面的直方图显示，该数据集比正常实例具有更多的异常。

这个训练数据集有400多万行，其中只有20%是正常的。dataframe的最后一列是“result”列，它指定连接是正常还是攻击。有各种不同类型的攻击，如back dos、buffer_overflow u2r、ftp_write r2l、guess_passwd r2l等。在测试集中，除了训练集中出现的攻击外，还出现了一些新的攻击。这些新的攻击是训练集中现有攻击的变体。这样做是为了衡量模型对未见攻击的有效性。然而，我们并不打算将不同类型的攻击分类，并将它们都视为异常。

下面的柱状图显示，数据集比正常实例有更多的异常。

pd.value_counts(df['result']).plot(kind='bar', figsize=(20, 10))
plt.ylabel('number of instances')
plt.xticks(fontsize=13)
plt.grid()
plt.show()

接下来，我们将研究在名为“service”字段下的数据中呈现的不同服务类型。

rint('different types of services: {}'.format(df['service'].unique()))

在这里，我们可以看到数据包含不同类型的服务。通过对不同服务的数据进行分析，发现它们在正常情况下的分布是不同的。因此，我们只考虑“http”服务类型，因为它是internet流量中最常见的服务类型。

# extracting only the rows which have http service.
df_http = df[(df['service']=='http')]

pd.value_counts(df_http['result']).plot(kind='bar', figsize=(20, 10))
plt.ylabel('number of instances')
plt.xticks(fontsize=13)
plt.grid()
plt.show()

接下来，我们只提取具有正常数据的行来训练自编码器。除此之外，我们还删除了所有分类列。通常，它们被转换为one hot编码的特征，但这样做之后，它们在确定输出中几乎没有意义。

normal_instances = df_http[df_http['result']=='normal.'].shape[0]
anomalous_instances = df_http[df_http['result']!='normal.'].shape[0]

# extracting only the instances belonging to the normal class
df_http_normal = df_http[df_http['result']=='normal.']

# dropping catrgorical columns
df_http_normal.drop(['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login', 'result'], axis=1, inplace=True)
# dropping columns with no std deviation
df_http_normal.drop(['wrong_fragment', 'urgent', 'num_failed_logins', 'su_attempted', 'num_file_creations', 'num_outbound_cmds'], axis=1, inplace=True)

df_http_normal.boxplot(figsize=(20, 10))
plt.show()

这是一个热图，描述了各个特征之间存在的相关性。

# scaling the data using standard scaler
scaler = StandardScaler()
df_http_normal = pd.DataFrame(scaler.fit_transform(df_http_normal), columns=df_http_normal.columns)
# examining the correlation between different features
plt.figure(figsize=(15, 10))
sns.heatmap(df_http_normal.corr(), cmap='viridis')
plt.show()

尽管大多数特征彼此之间不相关，几乎没有强相关性。为了消除这些相关性并使所有要特征到相同的scale，我们使用StandardScaler缩放数据，并使用PCA（主成分分析）降低数据的维数。由于所有PCA彼此正交，这也确保了消除所有相关性。

# reducing the dimensionality of the data using PCA and covering 80% of the variance in the original data
pca = PCA(n_components=0.8)
pca.fit(df_http_normal)

pca_cols = ['PCA_'+ str(i) for i in range(pca.n_components_)]
df_pca = pd.DataFrame(pca.transform(df_http_normal), columns=pca_cols)

最后，使用特定的窗口长度和步长从这些数据中提取窗口。连续windows之间的数据流是通过stride来保证的。

def get_windows(df, window_size=10, stride=5):
  windows_arr = []
  for i in tqdm.tqdm(range(0, len(df)-window_size+1, stride)):
    windows_arr.append(df.iloc[i:i+window_size, :].to_numpy())
  return np.array(windows_arr)

LSTM自编码器模型

使用tensorflow 2.0框架，我们构建了以下自编码器。

注意我们用的是Huber损失而不是均方误差。这样做的目的是减少模型的异常。Huber损失是MSE和MAE的组合。如果实际值和预测值之间的差值大于可调值“delta”，则应用MAE，否则应用MSE。

K.clear_session()
# encoder model with stacked LSTM
encoder = Sequential([LSTM(80, return_sequences=True, activation='selu',input_shape=(window_size, 14), dropout=0.2), LSTM(50, activation='selu', return_sequences=True), 
                      LSTM(20, activation='selu')], name='encoder')
# decoder model with output dimension same as input dimension
decoder = Sequential([RepeatVector(window_size), LSTM(50, activation='selu', return_sequences=True), LSTM(80, activation='selu',return_sequences=True), 
                      TimeDistributed(Dense(14, activation='linear'))], name='decoder')
# creating sequential autoencoder using encoder, decoder as layers
autoencoder = Sequential([encoder, decoder], name='autoencoder')
autoencoder.compile(optimizer='adam', loss = tf.keras.losses.Huber(100.))
autoencoder.summary()

encoder.summary(), decoder.summary()

另外，我们在训练时使用ModelCheckpoint回调。这将确保保存在验证集上具有最佳性能的模型和权重。稍后，可以恢复此模型以进行推理。

# training the autoencoder
check_point = tf.keras.callbacks.ModelCheckpoint('autoencoder.h5', monitor='val_loss', save_best_only=True, mode='min', verbose=1)
train_hist = autoencoder.fit(windows_shuffled, windows_shuffled[:, :, ::-1], batch_size=64, validation_split=0.2, epochs=100, callbacks=[check_point])

# loading the autonecoder with best set of weights
autoencoder_loaded = tf.keras.models.load_model('autoencoder.h5')

推理和阈值设置

重建误差用于度量样本/实例异常的可能性，其背后的原因是我们仅使用正常数据来训练自编码器。在推理过程中，如果模型遇到异常样本，则编码器会将其压缩为类似于正常数据的潜在表示。该潜在表示丢失与异常特征有关的信息，并且解码器将其重构为正常样本，从而导致较大的重构误差。

计算完测试数据中所有样本的重构误差后，我们可以将其缩放到[0，1]范围，称为异常评分。可以在此异常评分上设置阈值，高于该阈值将样本识别为异常。

作为一种好的做法，阈值设置是在单独的数据集（而不是测试数据集）上完成的。但是由于我们没有为此目的而明确拥有数据集，因此我们依靠测试集来设置阈值并使用评估方法的有效性。之后，我们使用另一个没有标签的数据集来预测异常(除非有专家介入诊断，否则我们无法确定它们是否是异常)。

对于训练数据和生产数据，我们遵循相同的数据处理步骤。

# loading the test dataframe
test_df = pd.read_csv(TEST_DATA_PATH, header=None, names=col_names_cleaned)

# slicing only the rows belonging to http service
test_df_http = test_df[test_df['service']=='http']

同样，测试集中具有一个或多个异常的窗口也被视为异常。这是至关重要的，因为在实际情况下，异常可能会在很短但连续的时间内出现。

# binary indicator to represent anomalies
status = pd.Series([0 if i=='normal.' else 1 for i in test_df_http['result']])
test_labels = [1 if np.sum(status[i:i+window_size])>0 else 0 for i in range(0, len(status)-window_size+1, stride)]

# dropping catrgorical columns
test_df_http.drop(['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login', 'result'], axis=1, inplace=True)
# dropping columns with no std deviation
test_df_http.drop(['wrong_fragment', 'urgent', 'num_failed_logins', 'su_attempted', 'num_file_creations', 'num_outbound_cmds'], axis=1, inplace=True)

test_df_http = pd.DataFrame(scaler.transform(test_df_http), columns=test_df_http.columns)
test_df_http_pca = pd.DataFrame(pca.transform(test_df_http), columns=pca_cols)

现在，我们提取测试窗口并使用已加载的自编码器对其进行重构。

# extracting windows from test data
test_windows = get_windows(test_df_http_pca, window_size=10, stride=10)

# reconstructing test windows using the trained autoencoder
test_windows_pred = autoencoder_loaded.predict(test_windows)

使用纯tensorflow操作来计算每个窗口的重构误差，以便在gpu可用时更快地执行。

# calculating reconstruction error for each sample
# implemented in tensorflow for faster execution when gpu is available

def get_recon_erros(true_windows, pred_windows):
  recon_errors = []

  def cond(y_true, y_pred, i, iters):
    return tf.less(i, iters)

  def body(y_true, y_pred, i, iters):
    tensor_for_error = tf.math.subtract(tf.slice(y_true, [i, 0, 0], [1, -1, -1]), tf.slice(y_pred, [i, 0, 0], [1, -1, -1]))
    tensor_for_error = tf.reshape(tensor_for_error, [window_size, pca.n_components_])
    recon_error = tf.math.reduce_mean(tf.norm(tensor_for_error, ord='euclidean', axis=1))
    # this is the list initialized above
    recon_errors.append(recon_error.numpy())
    return [y_true, y_pred, tf.add(i, 1), iters]

  iters = tf.constant(len(true_windows))

  result = tf.while_loop(cond, body, [tf.constant(true_windows.astype(np.float32)), tf.constant(pred_windows.astype(np.float32)), 0, iters])
  return recon_errors
  
recon_errors = get_recon_erros(test_windows, test_windows_pred)
recon_errors = np.array(recon_errors).reshape(-1, 1)

# scaling the set of reconstruction errors to [0, 1] scale
mm_scaler = MinMaxScaler()
anomaly_scores = mm_scaler.fit_transform(recon_errors).flatten()

plt.figure(figsize=(20, 10))
plt.plot(test_labels, c='blue', label='original')
plt.plot(anomaly_scores, c='red', label='predicted')
plt.yticks(np.arange(0, 1.1, 0.1))
plt.xlabel('samples')
plt.ylabel('anomaly score')
plt.grid()
plt.legend()
plt.show()

显然，对于真正的异常样本，异常得分很高，而对于真正的正常样本，异常得分非常低。但是，0.5并不是最佳决策边界。模型仅看到了正常样本（得分为0），因此能够强有力地识别它们。这是类不平衡的典型问题。如果我们将有相等数量的异常样本，则遵循监督学习方法训练分类器是有意义的。看起来决策阈值介于0.05到0.15之间。

在典型的分类问题环境中，ROC-AUC（曲线下方的接收器工作特征区域）用作衡量度量分类器有效性的实际指标，即分类器能够在两个分类之间进行区分的程度。ROC是由于不同阈值而导致的TPR（真阳性率）与FPR（假阳性率）的关系图。

fpr, tpr, thresholds = roc_curve(test_labels, anomaly_scores)
auc = roc_auc_score(test_labels, anomaly_scores)

#plotting the ROC
plt.figure(figsize=(10,5))
plt.plot([0, 1], [0, 1], color = 'black', linestyle='--')
plt.plot(fpr, tpr, label='AUC={}'.format(auc))
plt.grid()
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.title('ROC')
plt.show()

此ROC显示我们的分类工作流程在区分类别方面非常高效。在此有必要注意，ROC-AUC并非始终是确定分类器有效性的良好指标。

因此，我们使用一个称为F-1得分的度量，它是精度（假阳性影响的度量）和召回率（假阴性影响的度量）的加权调和平均数。最后，我们选择F-1得分最高的阈值。

anomaly_combinations = [(anomaly_scores>i).astype(np.int32) for i in thresholds]
f1_scores = [f1_score(test_labels, i) for i in anomaly_combinations]

# plotting f1 score vs thresholds
plt.figure(figsize=(10, 5))
plt.plot(thresholds, f1_scores)
plt.grid()
plt.xlabel('Thresholds')
plt.ylabel('F-1 Score')
plt.title('F-1 Score vs Thresholds')
plt.show()

max_f1_score = np.max(f1_scores)
best_threshold = thresholds[f1_scores.index(max_f1_score)]
print('best threshold = {}'.format(best_threshold))

上面的代码产生0.0999448，这是产生最佳f-1分数的阈值。

对于此阈值，以下是混淆矩阵：

anomaly_indicator = (anomaly_scores>best_threshold).astype(np.int32)
confusion_matrix(test_labels, anomaly_indicator)

此工作流程具有以下分类指标：

precision = precision_score(test_labels, anomaly_indicator)
recall = recall_score(test_labels, anomaly_indicator)
f1_sc = f1_score(test_labels, anomaly_indicator)
accuracy_sc = accuracy_score(test_labels, anomaly_indicator)

plt.figure(figsize=(20,10))
sns.scatterplot(x=np.arange(0, len(anomaly_scores)), y= anomaly_scores, hue=['normal' if i==0 else 'anomaly' for i in anomaly_indicator],
                palette=['blue', 'red'], legend='full')
plt.axhline(y = best_threshold, linestyle='--', label='threshold')
plt.xlabel('samples')
plt.ylabel('anomaly score')
plt.legend()
plt.grid()
plt.show()

最后

尽管这是检测网络中入侵的有效方法，但是还可以用其他几种高级深度学习技术：

可以使用GAN代替自编码器来对数据建模。如果训练好，GAN可以更准确地捕获数据分布。它们也可以通过生成正常或异常类的新示例来扩充数据集。
可以使用自编码器的变体，例如稀疏自编码器，变分自编码器。它们对潜在表示有特殊限制，并导致提取出更强大和有效的潜在表示。
上面提到的方法是纯的机器学习方法，几乎??没有/没有领域知识。在任何领域，将领域知识与ML相结合始终可以带来最佳结果。

使用自编码器的网络入侵检测(Python)

IDS的分类

为什么选择深度学习？

什么是自编码器？

数据集说明

导入Python库

数据预处理

LSTM自编码器模型

推理和阈值设置

最后

相关推荐

idea本地配置连接远程hadoop集群的一些网络问题解决汇总

无缓存不行?例行升级的入门级阿斯加特AN2 SSD装机点评

Ceph运维手册(基于P版本)

大数据开发前要做什么准备?8台Hadoop服务器进行集群规划前配置

Tensorflow分类loss函数总结 tensorflow绘制loss曲线

R语言学习笔记(七) -离散型数据的模型预测2

iOS Runtime详解

7 个对 Java 意义重大的性能指标，你知道几个?

Docker 命令大全（docker命令大全记录表）

服务器硬件RAID性能横评(2)（服务器常用raid技术）