DebloatProxy is a CGI or shell script that will modify a given website to remove bloat. Removing bloat in this case means that it will apply a configurable set of regex rules to remove certain HTML tags.
DebloatProxy works as a CGI, that is, it does not function like a proxy.
Let's say you use DebloatProxy on the website
The URL you actually want to call is given by the
site parameter in the query of the URL.
E.g., instead of calling the website
https://example.org, you call instead
Note that you need to do URL encoding for any parameters passed to the original URL.
E.g., instead of calling the website
https://example.org?site=index you would call
%3F rather than
As long as you continue browsing on a site served by DebloatProxy, you will always use DebloatProxy.
All links will be replaced by the same ones, served through the same DebloatProxy instance.
You can add further options as URL parameters by giving the full option name and setting that parameter to
true (to activate the option
E.g., for calling the websitehttps://example.org?site=index` but enabling vector graphics and disabling classes, you would call
DebloatProxy considers to be it called from a shell when the variable
GATEWAY_INTERFACE is not set in the environment.
The first argument given to DebloatProxy when called from the shell is the file to be processed.
E.g., to debloat a file called
index.html, you call
The debloated result is sent to stdout.
Therefore, use the
> operator to redirect the output to a file, e.g., to debloat
index.html and write it to
debloated_index.html, you call
debloat.py index.html > debloated_index.html
Options are given as short and long options to enable or disable the according processing, the argument given by
false to disable or
true to enable.
E.g., to enable vector graphics and disable classes on the file
index.py, you would call
debloat.py -V true --classes=false index.html > debloated_index.html
%### split: Split regions
%Splits the document by its section elements, that is, <header>, <nav>, <main>, <footer>
Remove <svg> tags, that is, vector graphics which are embedded in the website code.
Removes various tags from the HTML header, including stylesheets, preload and prefetch information, and vendor-specific tags (like Facebook, Twitter).
Condenses <picture>-tags, which include alternative images for different screen resolution, and replaces it by only the primary image.
Removes all classes, IDs and names from all HTML elements. These are normally used for the execution of scripts and for CSS assignments.
Removes all <div> and <span> tags.
This removes whitespaces like tabs and empty lines, quite some of them also generated from prior operations.
Removes everything but <p>, <h*> and <a> tags. This is the most extreme form of debloating and might likely lead to false positives.
Removes all <style> tags. These are used to define CSS properties inside the HTML file.
Removes the <footer> tags. What is normally written in the footer seems rather irrelevant to the actual page content.
DebloatProxy needs only Python standard packages, in particular:
They should normally be included in every standard Python distribution.
To install DebloatProxy for shell usage, download and store debloat.py in your
If you use the
owncss option, you will also need the files dark.css, light.css.
They need to be in the same directory as the output files.
To install DebloatProxy for CGI usage, download debloat.py, dark.css, light.css.
Move these files to any subdirectory of the document root.
Make sure that the file
debloat.py is executable, e.g., by executing
chmod u+x debloat.py (depending on the webserver, more execution permissions might be required).
You might have to enable execution of .py files as a CGI.
This can be achieved (running Apache) by creating a file named
.htaccess in the same directory as the other files with the following content:
Options +ExecCGI AddHandler cgi-script .py
Use your webbrowser to access
debloat.py where you stored it on the websever.
Without any parameters, a help page will be shown.
Yes, that's what they tell you. And they are right. But we are facing adversarial websites. XML parsers are often not intended for untrusted content. And these sites probably have no valid XML/HTML code. So regexes it is, for now.