Kernelsphere Documentation
Framework Overview
Kernelsphere is an open-source Python framework for helping software interact with real websites through natural-language browser automation.
The framework separates browser automation into reusable components for page understanding, task planning, semantic element selection, browser action execution, state diffing, and goal validation.
The framework-style design makes it easier to build custom browser agents, swap individual modules, test pieces independently, and use Kernelsphere internals inside your own automation stack.
Introduction
Kernelsphere is a web automation agent built on Google Gemini and Playwright. You give it a task in plain text, point it at a URL, and it completes the task in a real Chromium browser. Things like "find the price of this laptop" or "get the cheapest flight from Hyderabad to Berlin on March 15" are the kind of tasks it handles.
There are no CSS selectors involved, no XPath, no recorded scripts that break whenever a site updates its layout. The agent reads the page fresh at every step, looks at what is actually there, and decides what to do next. That is the whole idea.
Getting Started
Requirements
You need Python 3.10 or higher, a Google Gemini API key, and Chromium installed through Playwright.
Installation
git clone https://github.com/Kernelsphere-ai/kernelsphere-v2.git
cd kernelsphere
pip install -r requirements.txt
playwright install chromium
Environment Setup
Create a .env file in the project root and add your Gemini API key:
GOOGLE_API_KEY=your_gemini_api_key
# Only needed if you are using Browserbase
BROWSERBASE_API_KEY=your_key
BROWSERBASE_PROJECT_ID=your_project_id
GOOGLE_API_KEY is the only variable that is required. Everything else depends on what features you need.
Your First Run
python main.py "Find the starting price of the MacBook Air M2" \
--start-url "https://www.apple.com"
The agent opens a browser, navigates the site, finds the price, and writes the result to final_output.json. You can watch it happen in real time unless you pass --headless.
How the Agent Works
Understanding what happens internally makes it easier to write good tasks and figure out what went wrong when something fails.
The Step Loop
Every task runs as a series of steps up to the limit you set with --max-steps, which defaults to 30. Each step follows the same sequence.
Observe. The agent takes a screenshot of the current page and runs a JavaScript extraction to collect every visible interactive element: buttons, input fields, links, dropdowns, anything that responds to a click or keypress. Elements get indexed and scored by importance. Submit buttons and search fields rank higher than decorative links. Up to 200 elements make it into the list.
Decide. The screenshot, the element list, a text excerpt of the page, the previous action result, and the original task all go to Gemini 2.0 Flash as a single prompt. The model returns a JSON object specifying which action to take, which element to act on, and its reasoning.
Execute. Playwright runs the action in the browser. The supported actions are navigate, click_element, input_text, select_dropdown, search, scroll, go_back, extract, send_keys, wait, close_cookie_popup, close_popup, and done.
Check. After each action, the agent checks whether the URL changed, whether the DOM hash changed, and whether a new dialog appeared. If nothing changed and the action was not a passive one like wait or scroll, the agent marks it as a failed step and takes note.
Repeat or finish. If the task is done, the agent calls done and writes the answer. Otherwise it moves to the next step with the updated page state and starts the cycle again.
Stagnation Detection
Loops are one of the harder problems in browser automation. The agent tracks consecutive identical actions and consecutive no-progress steps. If the same action runs on the same element twice with no page change, the stagnation detector activates. Three consecutive steps where nothing changes (URL, DOM, and dialogs all stay the same) also trigger it.
When stagnation is detected, the agent switches to a recovery prompt that pushes it toward a different element, a different approach, or navigating backward. Most loops resolve within one or two recovery steps.
Extraction
When the agent reaches the page with the information it needs, it calls the extract action with a description of what to pull out. Five strategies run in sequence, and the first one that returns a valid result wins.
The first strategy targets elements by data-testid, itemprop, and aria-label attributes matched to the goal type. The second finds card or article containers and extracts fields from inside them. The third scans the viewport for price and rating patterns using JavaScript. The fourth runs regex patterns over the raw page text. The fifth does a full-page pass combining headings, prices, and ratings.
Once extraction returns something valid, that result is checked against the original task. If it satisfies the request, the agent calls done.
Running Tasks
Basic Usage
python main.py "your task here" --start-url "https://example.com"
All Flags
| Flag | Default | Description |
|---|---|---|
--start-url | required | Where to begin |
--max-steps | 30 | Steps before stopping |
--headless | false | Run without a visible browser window |
--model | gemini-2.0-flash-exp | Which Gemini model to use |
--output | final_output.json | Path for the result file |
--use-browserbase | false | Use cloud browser with CAPTCHA handling |
--browserbase-timeout | 600 | Session timeout in seconds, max 21600 |
--use-proxy | false | Rotate proxies between requests |
--proxy-country | none | Preferred country code, e.g. US, DE |
--enable-proxy-health-check | false | Background monitoring of proxy pool |
--viewport-width | 1280 | Browser width in pixels |
--viewport-height | 720 | Browser height in pixels |
Output Format
Every run writes a JSON file with the result and the full step history.
{
"task": "Find the starting price of the MacBook Air M2",
"start_url": "https://www.apple.com",
"success": true,
"total_steps": 7,
"final": {
"final_answer": "MacBook Air M2 starts at $1,099.",
"reasoning": "Found on the Mac product page under MacBook Air"
},
"steps": [
{
"step": 1,
"url": "https://www.apple.com",
"title": "Apple",
"actions": [
{
"action": "click_element",
"success": true,
"url_changed": true,
"dom_changed": true
}
]
}
]
}
success is true when the agent found and validated the answer. When it is false, it means the agent either ran out of steps or could not reach the target information. The steps array shows every action taken in order, which is the most useful thing to look at when a task fails.
CAPTCHA and Bot Detection
Stealth Configuration
Every browser session starts with a stealth configuration applied automatically. This disables the AutomationControlled Chromium flag, masks navigator.webdriver in JavaScript, randomizes the user agent across a set of realistic Chrome and Firefox strings, and spoofs canvas fingerprinting. For a large portion of sites, this is enough to get through without any issues.
Browserbase
For sites with active bot detection like Cloudflare challenges, hCaptcha, or reCAPTCHA, local stealth often is not enough. Browserbase runs cloud browser sessions with built-in CAPTCHA solving and residential proxy routing.
python main.py "your task" \
--start-url "https://example.com" \
--use-browserbase
BROWSERBASE_API_KEY=your_key
BROWSERBASE_PROJECT_ID=your_project_id
Session timeout defaults to 600 seconds. For longer tasks, increase it:
--browserbase-timeout 3600
Manual Fallback
If you are running locally without Browserbase and a CAPTCHA appears, the agent pauses and waits. The default wait is 120 seconds, which gives you time to solve it manually in the browser window. Pass --captcha-max-wait 300 if you need more time.
Proxies
Setup
The proxy manager supports ProxyEmpire, Smartproxy, Oxylabs, Webshare, Proxy6, and custom providers. Add proxies to your .env file:
PROXY_LIST=host1:port1:user1:pass1,host2:port2:user2:pass2
PROXY_PROVIDER=smartproxy
PROXY_TYPE=residential
Enable proxy rotation per run:
python main.py "your task" \
--start-url "https://example.com" \
--use-proxy \
--proxy-country US
Health Tracking
The proxy manager tracks success and failure counts per proxy. When a proxy fails three consecutive sessions, it gets marked unhealthy and skipped. Healthy proxies are always preferred. If no proxies are available in the requested country, it falls back to any available healthy proxy.
Pass --enable-proxy-health-check to run background health monitoring. This starts a separate thread that periodically tests each proxy and updates its status.
Site-Specific Handlers
Three sites have dedicated automation modules because their interfaces do not respond reliably to the generic agent flow.
Google Flights
Google Flights uses autocomplete fields for origin and destination that require typing, waiting for dropdown suggestions, and selecting the right entry. The date picker is a multi-step calendar widget. Cabin class and passenger count each have their own interaction patterns.
The dedicated handler manages the full sequence: origin field, then destination, then departure date, then return date if it is a round trip, then cabin class, then search, then result extraction. When your task starts at google.com/flights or google.com/travel/flights, this handler activates automatically.
Google Maps
The Maps handler covers place search, nearby search, directions, and review extraction. It pulls structured data from place cards including name, rating, address, phone number, and website.
Booking.com
Hotel search on Booking requires setting check-in and check-out dates, guest count, and room count across a multi-step form before results appear. The handler manages the full form sequence and extracts pricing and availability from the results.
Running the WebVoyager Benchmark
The Dataset
WebVoyager is a benchmark of 643 tasks across 15 websites. Each task is a natural language question paired with a starting URL. The dataset is at data/WebVoyager_data.jsonl.
The 15 sites are Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking.com, Cambridge Dictionary, Coursera, ESPN, GitHub, Google Flights, Google Maps, Google Search, Hugging Face, and Wolfram Alpha.
Running It
python webvoyager_runner.py \
--tasks-file data/WebVoyager_data.jsonl \
--output-dir results \
--max-steps 30 \
--model gemini-2.0-flash-exp
To run only specific websites:
python webvoyager_runner.py \
--tasks-file data/WebVoyager_data.jsonl \
--websites "Amazon" "GitHub" "Allrecipes"
To filter by task ID range, which is useful for resuming an interrupted run:
python webvoyager_runner.py \
--tasks-file data/WebVoyager_data.jsonl \
--task-id-start 100 \
--task-id-end 200
Several tasks in the dataset reference specific dates for hotel and flight searches. Those dates are in the past now. Update them before running, otherwise those tasks will fail immediately.
Parallel Execution
Running Multiple Tasks at Once
Default concurrency is 3 workers. Each browser instance uses roughly 300 to 500 MB of memory, so the right concurrency limit depends on available RAM.
python parallel_runner.py \
--tasks-file data/tasks.jsonl \
--output-dir results \
--concurrency 5 \
--max-steps 30
Failed tasks retry automatically with exponential backoff before being marked as failed.
Splitting Large Files
For large task files, split them into batches before running:
python batch_processor.py split data/tasks.jsonl --batch-size 50
python batch_processor.py split data/tasks.jsonl --by-website
After multiple batch runs, merge the results into one file:
python batch_processor.py aggregate results/batch_1 results/batch_2 --output combined.json
Evaluation
After a benchmark run, use GPT-4V to evaluate whether answers are correct. The evaluator sends the task, the agent's final answer, and the last N screenshots to GPT-4V, which returns a SUCCESS or NOT SUCCESS verdict for each task.
python auto_eval.py \
--process_dir results \
--api_key your_openai_key \
--api_model gpt-4-vision-preview \
--max_attached_imgs 3
--max_attached_imgs controls how many screenshots from the end of the session get sent to the evaluator. Three is usually sufficient. For tasks where the visual end state matters a lot, increase it.
Logging
Each task writes a step-by-step log with the action taken, the element targeted, the reasoning, the result, and a timestamp. Logs are organized by website:
logs/
Amazon/
Amazon--task_1.json
Amazon--task_2.json
GitHub/
GitHub--task_1.json
Cross-run statistics are tracked in task_tracker.json, which records success and failure counts per website across all runs. That file is the quickest way to see where the agent is having consistent trouble.
Troubleshooting
The agent keeps repeating the same action. The stagnation detector should catch this before it gets too far. If it does not, check the step log to see which element it is targeting. Usually the element is not responding because a popup or overlay is sitting in front of it. Increase --max-steps if the task needs more room to recover.
Tasks fail on sites with login walls. Check that OTP_EMAIL and OTP_EMAIL_PASSWORD are set if the site sends verification codes. Also confirm that the agent is actually reaching the login page. Some sites redirect to a different URL than the --start-url you would expect, and the agent may end up somewhere else entirely.
CAPTCHA failures. Switch to --use-browserbase. If that is not an option, run without --headless so you can solve CAPTCHAs manually during the pause window.
Memory errors during parallel runs. Reduce --concurrency. Three workers runs stably on most machines. Five is fine if you have around 8 GB of free memory. Going higher than that without a lot of RAM available will cause instability.
Benchmark tasks fail because of outdated dates. Update the dates directly in WebVoyager_data.jsonl before running. The flight and hotel tasks have hardcoded dates from when the benchmark was created.
OTP code not found. The default timeout is 60 seconds. If your email provider takes longer to deliver, pass --otp-timeout 120 to extend the wait.