Manuelabenzoni

Overview

Exploring DeepSeek-R1’s Agentic Capabilities Through Code Actions

I ran a quick experiment examining how DeepSeek-R1 performs on agentic jobs, in spite of not supporting tool usage natively, and I was rather satisfied by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not only prepares the actions but likewise creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% correct, and other models by an even bigger margin:

The experiment followed design use guidelines from the DeepSeek-R1 paper and the design card: Don’t use few-shot examples, avoid including a system prompt, and set the temperature level to 0.5 – 0.7 (0.6 was used). You can find more assessment details here.

Approach

DeepSeek-R1‘s strong coding abilities enable it to serve as a representative without being clearly trained for tool use. By permitting the design to produce actions as Python code, it can flexibly interact with environments through code execution.

Tools are carried out as Python code that is consisted of straight in the timely. This can be a simple function meaning or a module of a larger plan – any legitimate Python code. The model then generates code actions that call these tools.

Results from executing these actions feed back to the design as follow-up messages, driving the next actions up until a final response is reached. The agent structure is an easy iterative coding loop that mediates the conversation between the model and its environment.

Conversations

DeepSeek-R1 is used as chat design in my experiment, where the model autonomously pulls extra context from its environment by utilizing tools e.g. by using a search engine or bring information from web pages. This drives the discussion with the environment that continues till a final response is reached.

On the other hand, o1 designs are known to when utilized as chat designs i.e. they don’t attempt to pull context during a discussion. According to the linked post, o1 models carry out best when they have the full context available, with clear directions on what to do with it.

Initially, I also tried a full context in a single prompt method at each step (with results from previous steps consisted of), however this caused considerably lower ratings on the GAIA subset. Switching to the conversational technique explained above, I was able to reach the reported 65.6% efficiency.

This raises an interesting question about the claim that o1 isn’t a chat model – perhaps this observation was more relevant to older o1 designs that did not have tool usage abilities? After all, isn’t tool usage support an essential mechanism for allowing designs to pull additional context from their environment? This conversational approach certainly appears reliable for DeepSeek-R1, though I still require to carry out similar try outs o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is remarkable that generalization to agentic tasks with tool usage by means of code actions works so well. This capability to generalize to agentic tasks reminds of recent research by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn’t investigated because work.

Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces long reasoning traces at each step, compared to other models in my experiments, restricting the usefulness of this design in a single-agent setup. Even simpler tasks sometimes take a long period of time to finish. Further RL on agentic tool usage, be it by means of code actions or not, might be one choice to improve efficiency.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model often switches between various thinking ideas without adequately checking out promising courses to reach a right option. This was a significant factor wiki.rrtn.org for excessively long thinking traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.

Future experiments

I’m also curious about how reasoning models that currently support tool use (like o1, o3, …) perform in a single-agent setup, with and without generating code actions. Recent developments like OpenAI’s Deep Research or Hugging Face’s open-source Deep Research, which also utilizes code actions, look interesting.