DebloatProxy

I am a person that is often annoyed by bloat on websites. Therefore, I write DebloatProxy. It is a CGI or shell script that modifies a given website to remove bloat. Removing bloat means here applying a configurable set of regex rules to remove or modify certain HTML tags and attributes.

DebloatProxy only works with kind-of cooperative websites. That means, the website has to be accessible without JavaScript and has to deliver its content also without JavaScript. Websites that require you to accept/decline cookie spyware before loading the actual content will not work.

Usage (web)

Despite the name, DebloatProxy runs as a CGI and not a literal HTTP proxy. Let's say DebloatProxy is installed as https://tolledomain.ch/debloat.py and you want to call the website https://example.org. Then you pass the URL with the site parameter to the CGI. I.e., you would call the URL

https://tolledomain.ch/debloat.py?site=https://example.org

Note that you need to do URL encoding for any parameters passed to the original URL. E.g., instead of calling the website https://example.org?site=index you would call

https://tolledomain.ch/debloat.py?site=https://example.org%3F=index

(note the %3F rather than ?). DebloatProxy will replace all links to link to itself. You will not leave DebloatProxy after starting with it.

Further options (as detailed below) can be added as URL parameters by giving the full option name and setting it to true (to activate) or false (to deactivate). E.g., to call https://example.org?site=index and enable vector graphics and disable classes:

https://tolledomain.ch/debloat.py?site=https://example.org%3F=index#vector=true#classes=false

Usage (shell)

DebloatProxy will run in shell mode when the environment variable GATEWAY_INTERFACE is not set. The first argument given to DebloatProxy is the file to be processed, the output is sent to stdout. E.g., to debloat a file called index.html, you call

$ debloat.py index.html

Therefore, use the > operator to redirect the output to a file. E.g., to debloat index.html and write it to debloat_index.html, you call

$ debloat.py index.html > debloat_index.html

Options are given as short and long options to enable with the argument true or disable with the argument false. E.g., to enable vector graphics and disable classes on the file index.html, call

debloat.py -V true --classes=false index.html

Options

split: Split regions

Long name: split
Short name: S
Splits the document by its section elements, that is, <header>, <nav>, <main>, <footer>

vector: Remove vector graphics

Long name: vector
Short name: V
Remove <svg> tags, that is, vector graphics which are embedded in the website code.

script: Remove Javascript

Long name: script
Short name: s
Removes <script> tags. This is the tags that contain and execute Javascript code.

header: Clean up header

Long name: header
Short name: H
Removes various tags from the HTML header, including stylesheets, preload and prefetch information, and vendor-specific tags (like Facebook, Twitter).

Rpcture: Remove picture tags to one img tag

Long name: picture Short name: p
Condenses <picture>-tags, which include alternative images for different screen resolution, and replaces it by only the primary image.

classes: Remove all class, ID and name information

Long name: classes Short name: c
Removes all classes, IDs and names from all HTML elements. These are normally used for the execution of scripts and for CSS assignments.

div: Remove div and span tags

Long name: div Short name: d Removes all <div> and <span> tags. These tags are used as "meaningless" structuring elements and fulfill a function in conjunction with CSS and Javascript.

whitespace: Clean up non-visible whitespaces

Long name: whitespace
Short name: w
This removes whitespaces like tabs and empty lines, quite some of them also generated from prior operations.

novisible: Let only readable visible paragraphs remain

Long name: novisible
Short name: n
Removes everything but <p>, <h*> and <a> tags. This is the most extreme form of debloating and might likely lead to false positives.

style: Remove style tags

Long name: style
Short name: C
Removes all <style> tags. These are used to define CSS properties inside the HTML file.

footer: Remove the footer

Long name: footer
Short name: f
Removes the <footer> tags. What is normally written in the footer seems rather irrelevant to the actual page content.

button: Remove buttons

Long name: button
Short name: b
Removes buttons from websites. Note that this likely breaks the page, when instead of regular <a> hyperlinks buttons and javascript is used to navigate to further pages. However, DebloatProxy will in most cases break these links anyway, so they can normally be safely removed.

onclick: Onclick items and similar

Long name: onclick
Short name: o
Removes actions on elements that happen on a click or any other action, in particular onclick, oncontextmenu, ondblclick, onmousedown, onmouseenter, onmouseleave, onmousemove, onmouseout, onmouseover, onmouseup. Since JS is likely deactivated, these actions are without meaning.

jsvoid: javascript:void(0) links

Long name: jsvoid
Short name: v
javascript:void(0) links are used when a link has a listener in JavaScript, but performs no other action. These links will be entirely removed.

adaptlinks: Adapt links

Long name: adaptlinks
Short name: a
Adapts the links so that they link back to debloatproxy. If this action is not activated, only absolute URLs will work, but relative and absolute paths will link to non-existent paths on the webserver where debloat.py is running.

Installation

DebloatProxy needs only standard Python packages which you find in every Python distribution, in particular:

getopt
os
re
sys
urllib

To install DebloatProxy for shell usage, download and store debloat.py in your $PATH, e.g., ~/.bin/. If you use the owncss option, you will also need the files dark.css, light.css. They need to be in the same directory as the output files.

To install DebloatProxy for CGI usage, download debloat.py, dark.css, light.css. Move the files to any subdirectory of the document root. Make sure that debloat.py is executable by executing chmod u+x debloat.py (group or other execute permissions might be required).

You might have to enable execution of .py files as a CGI. This can be achieved (running Apache) by creating a file named .htaccess in the same directory as the other files with the following content:

Options +ExecCGI
AddHandler cgi-script .py

Use your webbrowser to access debloat.py where you stored it on the websever. Without any parameters, a help page will be shown.

But one should not parse HTML with regex?!

That's what they tell you. And they are right. However, facing adversarial websites, there might be no valid HTML/XML content, and XML parsers are often not intended for untrusted content. So regexes it is, for now.