Index  

DebloatProxy

DebloatProxy is a CGI or shell script that will modify a given website to remove bloat. Removing bloat in this case means that it will apply a configurable set of regex rules to remove certain HTML tags.

DebloatProxy will only work with kind-of cooperative websites. In general, any website that does not function without JavaScript will not work. This is e.g. when you are required to accept/decline cookie policies before loading the actual content.

Usage (web)

DebloatProxy works as a CGI, that is, it does not function like a proxy. Let's say you use DebloatProxy on the website https://debloatproxy.org. The URL you actually want to call is given by the site parameter in the query of the URL. E.g., instead of calling the website https://example.org, you call instead

https://debloatproxy.org?site=https://example.org

Note that you need to do URL encoding for any parameters passed to the original URL. E.g., instead of calling the website https://example.org?site=index you would call

https://debloatproxy.org?site=https://example.org%3F=index

(note the %3F rather than ?). As long as you continue browsing on a site served by DebloatProxy, you will always use DebloatProxy. All links will be replaced by the same ones, served through the same DebloatProxy instance.

You can add further options as URL parameters by giving the full option name and setting that parameter to true (to activate the option) orfalse(to deactivate). E.g., for calling the websitehttps://example.org?site=index` but enabling vector graphics and disabling classes, you would call

https://debloatproxy.org?site=https://example.org%3F=index#vector=true#classes=false

Usage (shell)

DebloatProxy considers to be it called from a shell when the variable GATEWAY_INTERFACE is not set in the environment. The first argument given to DebloatProxy when called from the shell is the file to be processed. E.g., to debloat a file called index.html, you call

debloat.py index.html

The debloated result is sent to stdout. Therefore, use the > operator to redirect the output to a file, e.g., to debloat index.html and write it to debloated_index.html, you call

debloat.py index.html > debloated_index.html

Options are given as short and long options to enable or disable the according processing, the argument given by false to disable or true to enable. E.g., to enable vector graphics and disable classes on the file index.py, you would call

debloat.py -V true --classes=false index.html > debloated_index.html

Options

%### split: Split regions %Long name: split
%Short name: S
%Splits the document by its section elements, that is, <header>, <nav>, <main>, <footer>

vector: Remove vector graphics

Long name: vector
Short name: V
Remove <svg> tags, that is, vector graphics which are embedded in the website code.

script: Remove Javascript

Long name: script
Short name: s
Removes <script> tags. This is the tags that contain and execute Javascript code.

header: Clean up header

Long name: header
Short name: H
Removes various tags from the HTML header, including stylesheets, preload and prefetch information, and vendor-specific tags (like Facebook, Twitter).

Rpcture: Remove picture tags to one img tag

Long name: picture Short name: p
Condenses <picture>-tags, which include alternative images for different screen resolution, and replaces it by only the primary image.

classes: Remove all class, ID and name information

Long name: classes Short name: c
Removes all classes, IDs and names from all HTML elements. These are normally used for the execution of scripts and for CSS assignments.

div: Remove div and span tags

Long name: div Short name: d Removes all <div> and <span> tags. These tags are used as "meaningless" structuring elements and fulfill a function in conjunction with CSS and Javascript.

whitespace: Clean up non-visible whitespaces

Long name: whitespace
Short name: w
This removes whitespaces like tabs and empty lines, quite some of them also generated from prior operations.

novisible: Let only readable visible paragraphs remain

Long name: novisible
Short name: n
Removes everything but <p>, <h*> and <a> tags. This is the most extreme form of debloating and might likely lead to false positives.

style: Remove style tags

Long name: style
Short name: C
Removes all <style> tags. These are used to define CSS properties inside the HTML file.

footer: Remove the footer

Long name: footer
Short name: f
Removes the <footer> tags. What is normally written in the footer seems rather irrelevant to the actual page content.

button: Remove buttons

Long name: button
Short name: b
Removes buttons from websites. Note that this likely breaks the page, when instead of regular <a> hyperlinks buttons and javascript is used to navigate to further pages. However, DebloatProxy will in most cases break these links anyway, so they can normally be safely removed.

Installation

DebloatProxy needs only Python standard packages, in particular:

They should normally be included in every standard Python distribution.

To install DebloatProxy for shell usage, download and store debloat.py in your $PATH, e.g., ~/.bin/. If you use the owncss option, you will also need the files dark.css, light.css. They need to be in the same directory as the output files.

To install DebloatProxy for CGI usage, download debloat.py, dark.css, light.css. Move these files to any subdirectory of the document root. Make sure that the file debloat.py is executable, e.g., by executing chmod u+x debloat.py (depending on the webserver, more execution permissions might be required).

You might have to enable execution of .py files as a CGI. This can be achieved (running Apache) by creating a file named .htaccess in the same directory as the other files with the following content:

Options +ExecCGI
AddHandler cgi-script .py

Use your webbrowser to access debloat.py where you stored it on the websever. Without any parameters, a help page will be shown.

One should not parse HTML with regex?!

Yes, that's what they tell you. And they are right. But we are facing adversarial websites. XML parsers are often not intended for untrusted content. And these sites probably have no valid XML/HTML code. So regexes it is, for now.