I am a person that is often annoyed by bloat on websites. Therefore, I write DebloatProxy. It is a CGI or shell script that modifies a given website to remove bloat. Removing bloat means here applying a configurable set of regex rules to remove or modify certain HTML tags and attributes.
DebloatProxy only works with kind-of cooperative websites. That means, the website has to be accessible without JavaScript and has to deliver its content also without JavaScript. Websites that require you to accept/decline cookie spyware before loading the actual content will not work.
Despite the name, DebloatProxy runs as a CGI and not a literal HTTP proxy.
Let's say DebloatProxy is installed as https://tolledomain.ch/debloat.py
and you want to call the website https://example.org
.
Then you pass the URL with the site
parameter to the CGI.
I.e., you would call the URL
https://tolledomain.ch/debloat.py?site=https://example.org
Note that you need to do URL encoding for any parameters passed to the original URL.
E.g., instead of calling the website https://example.org?site=index
you would call
https://tolledomain.ch/debloat.py?site=https://example.org%3F=index
(note the %3F
rather than ?
).
DebloatProxy will replace all links to link to itself.
You will not leave DebloatProxy after starting with it.
Further options (as detailed below) can be added as URL parameters by giving the full option name and setting it to true
(to activate) or false
(to deactivate).
E.g., to call https://example.org?site=index
and enable vector graphics and disable classes:
https://tolledomain.ch/debloat.py?site=https://example.org%3F=index#vector=true#classes=false
DebloatProxy will run in shell mode when the environment variable GATEWAY_INTERFACE
is not set.
The first argument given to DebloatProxy is the file to be processed, the output is sent to stdout
.
E.g., to debloat a file called index.html
, you call
$ debloat.py index.html
Therefore, use the >
operator to redirect the output to a file.
E.g., to debloat index.html
and write it to debloat_index.html
, you call
$ debloat.py index.html > debloat_index.html
Options are given as short and long options to enable with the argument true
or disable with the argument false
.
E.g., to enable vector graphics and disable classes on the file index.html
, call
debloat.py -V true --classes=false index.html
Long name: split
Short name: S
Splits the document by its section elements, that is, <header>, <nav>, <main>, <footer>
Long name: vector
Short name: V
Remove <svg> tags, that is, vector graphics which are embedded in the website code.
Long name: script
Short name: s
Removes <script> tags. This is the tags that contain and execute Javascript code.
Long name: header
Short name: H
Removes various tags from the HTML header, including stylesheets, preload and prefetch information, and vendor-specific tags (like Facebook, Twitter).
Long name: picture
Short name: p
Condenses <picture>-tags, which include alternative images for different screen resolution, and replaces it by only the primary image.
Long name: classes
Short name: c
Removes all classes, IDs and names from all HTML elements.
These are normally used for the execution of scripts and for CSS assignments.
Long name: div
Short name: d
Removes all <div> and <span> tags.
These tags are used as "meaningless" structuring elements and fulfill a function in conjunction with CSS and Javascript.
Long name: whitespace
Short name: w
This removes whitespaces like tabs and empty lines, quite some of them also generated from prior operations.
Long name: novisible
Short name: n
Removes everything but <p>, <h*> and <a> tags.
This is the most extreme form of debloating and might likely lead to false positives.
Long name: style
Short name: C
Removes all <style> tags. These are used to define CSS properties inside the HTML file.
Long name: footer
Short name: f
Removes the <footer> tags. What is normally written in the footer seems rather irrelevant to the actual page content.
Long name: button
Short name: b
Removes buttons from websites.
Note that this likely breaks the page, when instead of regular <a> hyperlinks buttons and javascript is used to navigate to further pages.
However, DebloatProxy will in most cases break these links anyway, so they can normally be safely removed.
Long name: onclick
Short name: o
Removes actions on elements that happen on a click or any other action, in particular onclick
, oncontextmenu
, ondblclick
, onmousedown
, onmouseenter
, onmouseleave
, onmousemove
, onmouseout
, onmouseover
, onmouseup
.
Since JS is likely deactivated, these actions are without meaning.
Long name: jsvoid
Short name: v
javascript:void(0) links are used when a link has a listener in JavaScript, but performs no other action.
These links will be entirely removed.
Long name: adaptlinks
Short name: a
Adapts the links so that they link back to debloatproxy.
If this action is not activated, only absolute URLs will work, but relative and absolute paths will link to non-existent paths on the webserver where debloat.py is running.
DebloatProxy needs only standard Python packages which you find in every Python distribution, in particular:
To install DebloatProxy for shell usage, download and store debloat.py in your $PATH
, e.g., ~/.bin/
.
If you use the owncss
option, you will also need the files dark.css, light.css.
They need to be in the same directory as the output files.
To install DebloatProxy for CGI usage, download debloat.py, dark.css, light.css.
Move the files to any subdirectory of the document root.
Make sure that debloat.py
is executable by executing chmod u+x debloat.py
(group or other execute permissions might be required).
You might have to enable execution of .py files as a CGI.
This can be achieved (running Apache) by creating a file named .htaccess
in the same directory as the other files with the following content:
Options +ExecCGI
AddHandler cgi-script .py
Use your webbrowser to access debloat.py
where you stored it on the websever.
Without any parameters, a help page will be shown.
That's what they tell you. And they are right. However, facing adversarial websites, there might be no valid HTML/XML content, and XML parsers are often not intended for untrusted content. So regexes it is, for now.