DebloatProxy is a CGI or shell script that will modify a given website to remove bloat. Removing bloat in this case means that it will apply a configurable set of regex rules to remove certain HTML tags.
DebloatProxy will only work with kind-of cooperative websites. In general, any website that does not function without JavaScript will not work. This is e.g. when you are required to accept/decline cookie policies before loading the actual content.
DebloatProxy works as a CGI, that is, it does not function like a proxy.
Let's say you use DebloatProxy on the website https://debloatproxy.org
.
The URL you actually want to call is given by the site
parameter in the query of the URL.
E.g., instead of calling the website https://example.org
, you call instead
https://debloatproxy.org?site=https://example.org
Note that you need to do URL encoding for any parameters passed to the original URL.
E.g., instead of calling the website https://example.org?site=index
you would call
https://debloatproxy.org?site=https://example.org%3F=index
(note the %3F
rather than ?
).
As long as you continue browsing on a site served by DebloatProxy, you will always use DebloatProxy.
All links will be replaced by the same ones, served through the same DebloatProxy instance.
You can add further options as URL parameters by giving the full option name and setting that parameter to true
(to activate the option) or
false(to deactivate).
E.g., for calling the website
https://example.org?site=index` but enabling vector graphics and disabling classes, you would call
https://debloatproxy.org?site=https://example.org%3F=index#vector=true#classes=false
DebloatProxy considers to be it called from a shell when the variable GATEWAY_INTERFACE
is not set in the environment.
The first argument given to DebloatProxy when called from the shell is the file to be processed.
E.g., to debloat a file called index.html
, you call
debloat.py index.html
The debloated result is sent to stdout.
Therefore, use the >
operator to redirect the output to a file, e.g., to debloat index.html
and write it to debloated_index.html
, you call
debloat.py index.html > debloated_index.html
Options are given as short and long options to enable or disable the according processing, the argument given by false
to disable or true
to enable.
E.g., to enable vector graphics and disable classes on the file index.py
, you would call
debloat.py -V true --classes=false index.html > debloated_index.html
%### split: Split regions
%Long name: split
%Short name: S
%Splits the document by its section elements, that is, <header>, <nav>, <main>, <footer>
Long name: vector
Short name: V
Remove <svg> tags, that is, vector graphics which are embedded in the website code.
Long name: script
Short name: s
Removes <script> tags. This is the tags that contain and execute Javascript code.
Long name: header
Short name: H
Removes various tags from the HTML header, including stylesheets, preload and prefetch information, and vendor-specific tags (like Facebook, Twitter).
Long name: picture
Short name: p
Condenses <picture>-tags, which include alternative images for different screen resolution, and replaces it by only the primary image.
Long name: classes
Short name: c
Removes all classes, IDs and names from all HTML elements.
These are normally used for the execution of scripts and for CSS assignments.
Long name: div
Short name: d
Removes all <div> and <span> tags.
These tags are used as "meaningless" structuring elements and fulfill a function in conjunction with CSS and Javascript.
Long name: whitespace
Short name: w
This removes whitespaces like tabs and empty lines, quite some of them also generated from prior operations.
Long name: novisible
Short name: n
Removes everything but <p>, <h*> and <a> tags.
This is the most extreme form of debloating and might likely lead to false positives.
Long name: style
Short name: C
Removes all <style> tags. These are used to define CSS properties inside the HTML file.
Long name: footer
Short name: f
Removes the <footer> tags. What is normally written in the footer seems rather irrelevant to the actual page content.
Long name: button
Short name: b
Removes buttons from websites.
Note that this likely breaks the page, when instead of regular <a> hyperlinks buttons and javascript is used to navigate to further pages.
However, DebloatProxy will in most cases break these links anyway, so they can normally be safely removed.
DebloatProxy needs only Python standard packages, in particular:
They should normally be included in every standard Python distribution.
To install DebloatProxy for shell usage, download and store debloat.py in your $PATH
, e.g., ~/.bin/
.
If you use the owncss
option, you will also need the files dark.css, light.css.
They need to be in the same directory as the output files.
To install DebloatProxy for CGI usage, download debloat.py, dark.css, light.css.
Move these files to any subdirectory of the document root.
Make sure that the file debloat.py
is executable, e.g., by executing chmod u+x debloat.py
(depending on the webserver, more execution permissions might be required).
You might have to enable execution of .py files as a CGI.
This can be achieved (running Apache) by creating a file named .htaccess
in the same directory as the other files with the following content:
Options +ExecCGI
AddHandler cgi-script .py
Use your webbrowser to access debloat.py
where you stored it on the websever.
Without any parameters, a help page will be shown.
Yes, that's what they tell you. And they are right. But we are facing adversarial websites. XML parsers are often not intended for untrusted content. And these sites probably have no valid XML/HTML code. So regexes it is, for now.