Version 1 GitHub

Kernelsphere Documentation

Framework Overview

Kernelsphere is an open-source Python framework for helping software interact with real websites through natural-language browser automation.

The framework separates browser automation into reusable components for page understanding, task planning, semantic element selection, browser action execution, state diffing, and goal validation.

The framework-style design makes it easier to build custom browser agents, swap individual modules, test pieces independently, and use Kernelsphere internals inside your own automation stack.

Introduction

Kernelsphere is a web automation agent built on Google Gemini and Playwright. You give it a task in plain text, point it at a URL, and it completes the task in a real Chromium browser. Things like "find the price of this laptop" or "get the cheapest flight from Hyderabad to Berlin on March 15" are the kind of tasks it handles.

There are no CSS selectors involved, no XPath, no recorded scripts that break whenever a site updates its layout. The agent reads the page fresh at every step, looks at what is actually there, and decides what to do next. That is the whole idea.

Getting Started

Requirements

You need Python 3.10 or higher, a Google Gemini API key, and Chromium installed through Playwright.

Installation

bash
git clone https://github.com/Kernelsphere-ai/kernelsphere-v2.git
cd kernelsphere
pip install -r requirements.txt
playwright install chromium

Environment Setup

Create a .env file in the project root and add your Gemini API key:

env
GOOGLE_API_KEY=your_gemini_api_key

# Only needed if you are using Browserbase
BROWSERBASE_API_KEY=your_key
BROWSERBASE_PROJECT_ID=your_project_id

GOOGLE_API_KEY is the only variable that is required. Everything else depends on what features you need.

Your First Run

bash
python main.py "Find the starting price of the MacBook Air M2" \
  --start-url "https://www.apple.com"

The agent opens a browser, navigates the site, finds the price, and writes the result to final_output.json. You can watch it happen in real time unless you pass --headless.

How the Agent Works

Understanding what happens internally makes it easier to write good tasks and figure out what went wrong when something fails.

The Step Loop

Every task runs as a series of steps up to the limit you set with --max-steps, which defaults to 30. Each step follows the same sequence.

Observe. The agent takes a screenshot of the current page and runs a JavaScript extraction to collect every visible interactive element: buttons, input fields, links, dropdowns, anything that responds to a click or keypress. Elements get indexed and scored by importance. Submit buttons and search fields rank higher than decorative links. Up to 200 elements make it into the list.

Decide. The screenshot, the element list, a text excerpt of the page, the previous action result, and the original task all go to Gemini 2.0 Flash as a single prompt. The model returns a JSON object specifying which action to take, which element to act on, and its reasoning.

Execute. Playwright runs the action in the browser. The supported actions are navigate, click_element, input_text, select_dropdown, search, scroll, go_back, extract, send_keys, wait, close_cookie_popup, close_popup, and done.

Check. After each action, the agent checks whether the URL changed, whether the DOM hash changed, and whether a new dialog appeared. If nothing changed and the action was not a passive one like wait or scroll, the agent marks it as a failed step and takes note.

Repeat or finish. If the task is done, the agent calls done and writes the answer. Otherwise it moves to the next step with the updated page state and starts the cycle again.

Stagnation Detection

Loops are one of the harder problems in browser automation. The agent tracks consecutive identical actions and consecutive no-progress steps. If the same action runs on the same element twice with no page change, the stagnation detector activates. Three consecutive steps where nothing changes (URL, DOM, and dialogs all stay the same) also trigger it.

When stagnation is detected, the agent switches to a recovery prompt that pushes it toward a different element, a different approach, or navigating backward. Most loops resolve within one or two recovery steps.

Extraction

When the agent reaches the page with the information it needs, it calls the extract action with a description of what to pull out. Five strategies run in sequence, and the first one that returns a valid result wins.

The first strategy targets elements by data-testid, itemprop, and aria-label attributes matched to the goal type. The second finds card or article containers and extracts fields from inside them. The third scans the viewport for price and rating patterns using JavaScript. The fourth runs regex patterns over the raw page text. The fifth does a full-page pass combining headings, prices, and ratings.

Once extraction returns something valid, that result is checked against the original task. If it satisfies the request, the agent calls done.

Running Tasks

Basic Usage

bash
python main.py "your task here" --start-url "https://example.com"

All Flags

FlagDefaultDescription
--start-urlrequiredWhere to begin
--max-steps30Steps before stopping
--headlessfalseRun without a visible browser window
--modelgemini-2.0-flash-expWhich Gemini model to use
--outputfinal_output.jsonPath for the result file
--use-browserbasefalseUse cloud browser with CAPTCHA handling
--browserbase-timeout600Session timeout in seconds, max 21600
--use-proxyfalseRotate proxies between requests
--proxy-countrynonePreferred country code, e.g. US, DE
--enable-proxy-health-checkfalseBackground monitoring of proxy pool
--viewport-width1280Browser width in pixels
--viewport-height720Browser height in pixels

Output Format

Every run writes a JSON file with the result and the full step history.

json
{
  "task": "Find the starting price of the MacBook Air M2",
  "start_url": "https://www.apple.com",
  "success": true,
  "total_steps": 7,
  "final": {
    "final_answer": "MacBook Air M2 starts at $1,099.",
    "reasoning": "Found on the Mac product page under MacBook Air"
  },
  "steps": [
    {
      "step": 1,
      "url": "https://www.apple.com",
      "title": "Apple",
      "actions": [
        {
          "action": "click_element",
          "success": true,
          "url_changed": true,
          "dom_changed": true
        }
      ]
    }
  ]
}

success is true when the agent found and validated the answer. When it is false, it means the agent either ran out of steps or could not reach the target information. The steps array shows every action taken in order, which is the most useful thing to look at when a task fails.

CAPTCHA and Bot Detection

Stealth Configuration

Every browser session starts with a stealth configuration applied automatically. This disables the AutomationControlled Chromium flag, masks navigator.webdriver in JavaScript, randomizes the user agent across a set of realistic Chrome and Firefox strings, and spoofs canvas fingerprinting. For a large portion of sites, this is enough to get through without any issues.

Browserbase

For sites with active bot detection like Cloudflare challenges, hCaptcha, or reCAPTCHA, local stealth often is not enough. Browserbase runs cloud browser sessions with built-in CAPTCHA solving and residential proxy routing.

bash
python main.py "your task" \
  --start-url "https://example.com" \
  --use-browserbase
env
BROWSERBASE_API_KEY=your_key
BROWSERBASE_PROJECT_ID=your_project_id

Session timeout defaults to 600 seconds. For longer tasks, increase it:

bash
--browserbase-timeout 3600

Manual Fallback

If you are running locally without Browserbase and a CAPTCHA appears, the agent pauses and waits. The default wait is 120 seconds, which gives you time to solve it manually in the browser window. Pass --captcha-max-wait 300 if you need more time.

Proxies

Setup

The proxy manager supports ProxyEmpire, Smartproxy, Oxylabs, Webshare, Proxy6, and custom providers. Add proxies to your .env file:

env
PROXY_LIST=host1:port1:user1:pass1,host2:port2:user2:pass2
PROXY_PROVIDER=smartproxy
PROXY_TYPE=residential

Enable proxy rotation per run:

bash
python main.py "your task" \
  --start-url "https://example.com" \
  --use-proxy \
  --proxy-country US

Health Tracking

The proxy manager tracks success and failure counts per proxy. When a proxy fails three consecutive sessions, it gets marked unhealthy and skipped. Healthy proxies are always preferred. If no proxies are available in the requested country, it falls back to any available healthy proxy.

Pass --enable-proxy-health-check to run background health monitoring. This starts a separate thread that periodically tests each proxy and updates its status.

Site-Specific Handlers

Three sites have dedicated automation modules because their interfaces do not respond reliably to the generic agent flow.

Google Flights

Google Flights uses autocomplete fields for origin and destination that require typing, waiting for dropdown suggestions, and selecting the right entry. The date picker is a multi-step calendar widget. Cabin class and passenger count each have their own interaction patterns.

The dedicated handler manages the full sequence: origin field, then destination, then departure date, then return date if it is a round trip, then cabin class, then search, then result extraction. When your task starts at google.com/flights or google.com/travel/flights, this handler activates automatically.

Google Maps

The Maps handler covers place search, nearby search, directions, and review extraction. It pulls structured data from place cards including name, rating, address, phone number, and website.

Booking.com

Hotel search on Booking requires setting check-in and check-out dates, guest count, and room count across a multi-step form before results appear. The handler manages the full form sequence and extracts pricing and availability from the results.

Running the WebVoyager Benchmark

The Dataset

WebVoyager is a benchmark of 643 tasks across 15 websites. Each task is a natural language question paired with a starting URL. The dataset is at data/WebVoyager_data.jsonl.

The 15 sites are Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking.com, Cambridge Dictionary, Coursera, ESPN, GitHub, Google Flights, Google Maps, Google Search, Hugging Face, and Wolfram Alpha.

Running It

bash
python webvoyager_runner.py \
  --tasks-file data/WebVoyager_data.jsonl \
  --output-dir results \
  --max-steps 30 \
  --model gemini-2.0-flash-exp

To run only specific websites:

bash
python webvoyager_runner.py \
  --tasks-file data/WebVoyager_data.jsonl \
  --websites "Amazon" "GitHub" "Allrecipes"

To filter by task ID range, which is useful for resuming an interrupted run:

bash
python webvoyager_runner.py \
  --tasks-file data/WebVoyager_data.jsonl \
  --task-id-start 100 \
  --task-id-end 200

Several tasks in the dataset reference specific dates for hotel and flight searches. Those dates are in the past now. Update them before running, otherwise those tasks will fail immediately.

Parallel Execution

Running Multiple Tasks at Once

Default concurrency is 3 workers. Each browser instance uses roughly 300 to 500 MB of memory, so the right concurrency limit depends on available RAM.

bash
python parallel_runner.py \
  --tasks-file data/tasks.jsonl \
  --output-dir results \
  --concurrency 5 \
  --max-steps 30

Failed tasks retry automatically with exponential backoff before being marked as failed.

Splitting Large Files

For large task files, split them into batches before running:

bash
python batch_processor.py split data/tasks.jsonl --batch-size 50
python batch_processor.py split data/tasks.jsonl --by-website

After multiple batch runs, merge the results into one file:

bash
python batch_processor.py aggregate results/batch_1 results/batch_2 --output combined.json

Evaluation

After a benchmark run, use GPT-4V to evaluate whether answers are correct. The evaluator sends the task, the agent's final answer, and the last N screenshots to GPT-4V, which returns a SUCCESS or NOT SUCCESS verdict for each task.

bash
python auto_eval.py \
  --process_dir results \
  --api_key your_openai_key \
  --api_model gpt-4-vision-preview \
  --max_attached_imgs 3

--max_attached_imgs controls how many screenshots from the end of the session get sent to the evaluator. Three is usually sufficient. For tasks where the visual end state matters a lot, increase it.

Logging

Each task writes a step-by-step log with the action taken, the element targeted, the reasoning, the result, and a timestamp. Logs are organized by website:

text
logs/
  Amazon/
    Amazon--task_1.json
    Amazon--task_2.json
  GitHub/
    GitHub--task_1.json

Cross-run statistics are tracked in task_tracker.json, which records success and failure counts per website across all runs. That file is the quickest way to see where the agent is having consistent trouble.

Troubleshooting

The agent keeps repeating the same action. The stagnation detector should catch this before it gets too far. If it does not, check the step log to see which element it is targeting. Usually the element is not responding because a popup or overlay is sitting in front of it. Increase --max-steps if the task needs more room to recover.

Tasks fail on sites with login walls. Check that OTP_EMAIL and OTP_EMAIL_PASSWORD are set if the site sends verification codes. Also confirm that the agent is actually reaching the login page. Some sites redirect to a different URL than the --start-url you would expect, and the agent may end up somewhere else entirely.

CAPTCHA failures. Switch to --use-browserbase. If that is not an option, run without --headless so you can solve CAPTCHAs manually during the pause window.

Memory errors during parallel runs. Reduce --concurrency. Three workers runs stably on most machines. Five is fine if you have around 8 GB of free memory. Going higher than that without a lot of RAM available will cause instability.

Benchmark tasks fail because of outdated dates. Update the dates directly in WebVoyager_data.jsonl before running. The flight and hotel tasks have hardcoded dates from when the benchmark was created.

OTP code not found. The default timeout is 60 seconds. If your email provider takes longer to deliver, pass --otp-timeout 120 to extend the wait.