Conversing with Vulnerabilities: AI-Assisted CVE Search

Artificial Intelligence is the latest and most revolutionary tool in today’s fast-changing technological landscape, which has the potential to redefine how we work, communicate, and innovate. AI simplifies complex tasks and drives next-generation solutions, making it a transformative power. The Retrieval-Augmented Generation (RAG) approach leads the charge in this digital revolution. It empowers AI systems to learn new skills, adapt, and overcome traditional limitations.

The core benefit of the RAG approach lies in its ability to combine the strengths of both retrieval-based and generative information processing. Instead of solely relying on pre-trained models, RAG strategy allows AI systems to fetch relevant external knowledge, ensuring the output is not just accurate but also contextually rich. This synergy of data retrieval and content generation is paving the way for more versatile, informed, and responsive AI systems capable of understanding and catering to nuanced user needs.

I decided to experiment with RAG to create a new way of analyzing common software vulnerabilities. To achieve that with minimal effort I used CVE database export and Arguflow.

What is Arguflow?

Arguflow is a tool designed to enhance how we interact with and utilize data. It is an all-in-one source-available platform for AI-driven semantic search and Retrieval-Augmented Generation (RAG) on user-provided datasets. With Arguflow, you can interact with data in a meaningful and context-aware way rather than simply searching through it.

Arguflow’s adaptability and low barrier to entry are some of its main advantages. It allows you to extract insights from any kind of data you have, while thanks to AI-powered natural-language query workflow, the learning curve is almost non-existent compared to traditional BI tools which use specialized languages like SQL. Also, it’s more fun.

Arguflow’s killer feature, in my personal opinion, is its chat-like functionality. This allows users to engage in long-term, natural language sessions with the AI, all while maintaining references to retrieved data and conversation context. Imagine accessing a vast database by simply asking questions like “What are the most common software vulnerabilities in JavaScript applications?” or “Show me trends in renewable energy research from the last decade.”

Arguflow is a software project that is in active development right now, licensed under BSL. The server component performs most of the magic, outsourcing the AI-related tasks to OpenAI APIs or any model that supports the localai package. It is written in a fairly readable Rust and is easy to extend to one’s need. The chat and search interfaces are two separate Javascript applications that share the same backend, so you upload your data once and can interact with it in both ways.

The Arguflow community, particularly its maintainers, have been quite hands-on and supportive. When I contributed minor improvements to the codebase, I was met with constructive feedback and genuine appreciation. More than just guiding me through the contribution process, they actively engaged with my idea of an AI-powered CVE database. Their collaborative approach included brainstorming and even incorporating some additional features to facilitate the concept. It’s refreshing to see a community that’s not just about pushing a product but genuinely values collaboration and shared growth.

Ingesting CVE database into Arguflow

Let’s be honest: official CVE-search functionality is not very user-friendly. However, with CVE database export available to everyone and with Arguflow doing the heavy lifting, it should be relatively easy to set up a CVE-aware AI agent capable of answering any cybersecurity question!

Self-hosting Arguflow was quite straightforward. All dependencies are neatly packaged into a docker-compose managed through a helper script; configuration is done through environment variables. Cloning the repo, installing dependencies, and starting the background services didn’t take more than 20 minutes (including some minor hiccups related to my environment).

Now, when it came to populating Arguflow with data, there are two strategies:

Uploading data manually through the user interface
Leveraging the API route for card creation

Considering the sheer volume of CVE vulnerabilities—numbering in the hundreds of thousands—it was clear that manually uploading each card via the UI was impractical. To leverage the API route, I wrote a Python script reading the CVE records from JSON files and converting them to the “card” structure Arguflow uses.

Creating a card involves several calls to OpenAPI for embeddings, the Quadrant database for vector storage, and Postgres for metadata retention, causing the script to wait for the server’s replies for a considerable amount of time. To address this issue, I divided the input data by year and created separate batches for ingestion. Then, I ran 12 ingestion threads concurrently, which significantly reduced the time required. I let the script run overnight, and by morning, most of the data had been ingested.

AI-powered vulnerability expert

After loading the CVE database into Arguflow, I was keen to see how it would handle real-world queries. My first test was a semantic search. I typed in “vulnerabilities in commonly public-facing proxy servers” and waited. The results were impressive. Arguflow pinpointed issues associated with servers like “Java System Web Proxy Server,” “myproxy-server,” and the “CERN Proxy Server.” Not bad.

With that success, I turned my attention to Arguflow’s chat feature. I posed a simple question: “How secure is WordPress?” The AI’s response was thorough. It highlighted three different vulnerabilities linked to WordPress and its plugins, directing me to the specific cards that had all the details. To top it off, it offered some handy advice on improving the security of a WordPress setup.

Wanting to dig deeper, I refined my question to focus on just XSS vulnerabilities within WordPress. Arguflow didn’t disappoint. It provided three examples of such vulnerabilities, all fitting the criteria perfectly. So far, so good.

Now, on to the real challenge: “What cybersecurity challenges are anticipated with the rise of quantum computing?”. A simple lookup cannot solve this type of question since CVE descriptions do not generally mention problems that are not exploitable with the current technology. The answer clearly combined knowledge of what quantum computing is from the LLM with data from the CVE database.

Conclusion

Embarking on this project with Arguflow has been both enlightening and enjoyable. The seamless integration of a vast CVE database into a platform like Arguflow showcased the potential of AI in reshaping how we perceive and interact with data. For those keen to explore, the fruits of this endeavor are available for access at cve-search.arguflow.ai and cve-chat.arguflow.ai.

While the project was an exploration, I aim to keep the database updated, ensuring it remains a relevant and valuable resource for all. A special nod to the Arguflow team for their unwavering support throughout this journey. Their collaborative approach not only facilitated the project but also enriched it.

As we reflect on the experiences shared, one thing is clear: the role of AI assistance in various tasks is not just a fleeting trend — it’s the future. Whether it’s cybersecurity, data analysis, or any other domain, the fusion of technology and intuitive AI platforms like Arguflow is set to redefine how we approach research and innovation.

To everyone who joined me on this exploration, whether as a reader or a collaborator, thank you. Here’s to many more adventures at the intersection of AI, data, and boundless curiosity!

P.S: By the way, since you’re here, consider subscribing to the AI Digest newsletter to receive the latest updates from the world of AI directly in your mailbox.