Fine-grained Named Entity Recognition and Knowledge Graph Construction

Paper published at https://dl.acm.org/doi/abs/10.1145/3540250.3558920

cve-ner: Fine-grained Named Entity Recognition

Neo4j-D3-VKG :Vulnerability knowledge graph visualization

1. Introduction

1.1 Project Introduction

This is my machine learning project, the system is defined as a platform for extracting knowledge from the vulnerability descriptions in the current mainstream vulnerability database CVE and visualizing the results of the extraction. The visualization results are displayed in a knowledge graph, and the value of vulnerability information is deeply explored. It can be analyzed from the time dimension, space dimension, and vulnerability field dimensions, etc.....

For example, the text below is a description of CVE-2009-1194, the description is from "http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1194"

Integer overflow in the pango_glyph_string_set_size function in pango/glyphstring.c in Pango before 1.24 allows context-dependent attackers to cause a denial of service (application crash) or possibly execute arbitrary code via a long glyph string that triggers a heap-based buffer overflow, as demonstrated by a long document.location value in Firefox.

Basically, what I need to do is extracting the information from description above, factors like cause, location, consequence, version need to be recognized. For this specific instance, the extracted info should like this:

cause: Integer overflow

location: in the pango_glyph_string_set_size function in pango/glyphstring.c

version: in Pango before 1.24

attacker: context-dependent attackers

consequence: denial of service (application crash) or possibly execute arbitrary code

triggering operation: a long glyph string

After extracting info and adding some keys of vulnerabilities from CVE website, we can conduct a knowledge graph and visualize it.

1.2 Previews Steps

There is a lot of work to be done before visualizing the knowledge graph.

create own dataset

For this project, the dataset is from an article "A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries"
date labeling

For all 3000 records, labeled 1000 records

NER (model training and prediction)

I use google pretrain model "Bert_base" to do NER task.

Bert_pretrain_model: https://huggingface.co/bert-base-uncased/tree/main

The distribution of labeled dataset as below

training set(915)	dev set(102)
version: 901	version: 100
consequence: 871	consequence: 94
attacker: 823	attacker: 88
triggering operation: 819	triggering operation: 86
location: 755	location: 84
cause: 730	cause: 75
happened scenario: 64	happened scenario: 9

After adjusting the parameters and countless times of training, finally after 20 epochs the model performance as below:

import data into Neo4j

the data in Neo4j

When all those previews steps were done, final step is visualize the graph in neo4j.

2. User Guide

If the graph does not appear at the beginning, it will appear after a few refreshes.
Use the mouse wheel to zoom in or out of the graph.
Place the mouse on any node, all the nodes related to this node and the relationship between them will appear, and the related information will be automatically displayed on the right side.
Mode switch button to switch between different visual representations of nodes, circle or text.
The bars in different colors on the left represent different types of nodes, and the On/Off switch can turn on or off the visual display of all nodes of the same type.

cinnqi / vulkg Goto Github PK

vulkg's Introduction

Fine-grained Named Entity Recognition and Knowledge Graph Construction

1. Introduction

1.1 Project Introduction

1.2 Previews Steps

2. User Guide

vulkg's People

Contributors

Stargazers

Watchers

Forkers

vulkg's Issues

如何运行本项目？

数据集的问题？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs