What's new
  • ICMag with help from Landrace Warden and The Vault is running a NEW contest in November! You can check it here. Prizes are seeds & forum premium access. Come join in!

WOOT! Fu*k Meta-data! New tool from Tor: "MAT"

spurr

Active member
Veteran
Hello,

This thread is just a heads up for a Google Summer of Code student's work for the Tor Project, funded by Google. Many here are familiar with the term "meta data", with respect to GPS data hidden in JPEG images. However, meta data is in lots of files, such as documents one may write on their computer, incl. computer name, registration numbers and even MAC address.

This tool is still very young, and it's written in Python so one would need to install Python on their computer (as well as other dependencies listed below). I haven't tested this yet, as it was only noted a few days ago that it's ready for real-world testing.

Anyway, enough chit-chat, here is info and links. I'll post more once I test. After that point I hope this gets a sticky so people are made aware ...


"MAT - Metadata Anonymisation Toolkit"
The blog of the GSoC of Julien (jvoisin) Voisin for the Tor project.
http://mat-tor.blogspot.com/




[tor-talk] [GSoC] Metadata Anonymisation Toolkit
https://lists.torproject.org/pipermail/tor-talk/2011-August/020949.html

Hello,

the end of the GSoC is near, and I think I have almost achieved my objectives (except for zip files, but I'm working on a patch). My project is also meant to run well on Tails; dependencies are python-core and python-parser; optionals dependencies are python-mutagen for more audio format support, and python-poppler and python-cairo for pdf support.

I need some testing/usability comments of my project.

Thank you, and have a nice day ! -- VOISIN Julien

| pgp key : C48815F2 <http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x9768FD3CC48815F2>
| dustri.org
https://gitweb.torproject.org/user/jvoisin/mat.git/blob/HEAD:/README
METADATA:

Metadata consist of information that characterizes data. Metadata are used to provide documentation for data products. In essence, metadata answer who, what, when, where, why, and how about every facet of the data that are being documented.

METADATA AND PRIVACY:


Metadata within a file can tell a lot about you. Cameras record data about when a picture was taken and what camera was used. Office documents like pdf or Office automatically adds author and company information to documents and spreadsheets. Maybe you don't want to disclose those informations on the web.

WARNING :

Mat only remove metadata from your files, it does not anonymise their content, nor it can handle watermarking, steganography, or any too custom metadata field/system.

If you really want to be anonym, use format that does not contain any metadata, or better : use plain-text.


DEPENDENCIES:


  1. python2.6 (at least)
  2. python-hachoir-core and python-hachoir-parser
  3. shred (should be already installed)

OPTIONALS DEPENDENCIES:

  1. python-poppler and python-poppler : for pdf support
  2. python-mutagen : for massive audio format support

USAGE:


    • [command prompt] python cli.py --help
  • or
    • [command prompt] python gui.py

SUPPORTED FORMAT:


  • Portable Network Graphics (.png)
    • support : full
    • metadata : textual metadata + date
    • method : removal of harmful fields is done with hachoir


  • Jpeg (.jpeg, .jpg)
    • support : full
    • metadata : comment + exif/photoshop/adobe
    • method : removal of harmful fields is done with hachoir


  • Open Document [i.e., LibreOffice; formally OpenOffieOrg] (.odt, .odx, .ods, ...)
    • support : full
    • metadata : a meta.xml file
    • method : removal of the meta.xml file


  • Office Openxml (.docx, .pptx, .xlsx, ...)
    • support : full
    • metadata : a docProps folder containings xml metadata files
    • method : removal of the docProps folder


  • Portable Document Fileformat (.pdf)
    • support : full
    • metadata : a lot
    • method : rendering of the pdf file on a cairo surface with the help of poppler in order to remove all the internal metadata, then removal of the remaining metadata fields of the pdf itself with pdfrw (the next version of python-cairo will support metadata, so we should get rid of pdfrw)


  • Tape ARchive (.tar, .tar.bz2, .tar.gz)
    • support : full
    • metadata : metadata from the file itself, metadata from the file contained into the archive, and metadata added by tar to the file at then creation of the archive method : extraction of each file, treatement of the file, add treated file to a new archive, right before the add, remove the metadata added by tar itself. When the new archive is complete, remove all his metadata.


  • Zip (.zip)
    • support : .partial
    • metadata : metadata from the file itself, metadata from the file contained into the archive, and metadata added by zip to the file when added to the archive.
    • method : extraction of each file, treatement of the file, add treated file to a new archive. When the new archive is complete, remove all his metadata


  • MPEG Audio (.mp3, .mp2, .mp1)
    • support : full
    • metadata : id3
    • method : removal of harmful fields is done with hachoir


  • Ogg Vorbis (.ogg)
    • support : full
    • metadata : Vorbis
    • method : removal of harmful fields is done with mutagen


  • Free Lossless Audio Codec (.flac)
    • support : full
    • metadata : Flac, Vorbis
    • method : removal of harmful fields is done with mutagen

LICENSE:


This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.


THANKS:

Mat would not exist without :
- the Google Summer of Code,
- the Python language
- the amazing (and messy) hachoir library,
- poppler and cairo's python bindings,
- and the mutagen library many thanks to them !


KNOWN BUGS:

Zipfiles are not totally cleaned, I know. I am working on a patch for zipfile.py

Git repo for code:

https://gitweb.torproject.org/user/jvoisin/mat.git

:tiphat:
 

spurr

Active member
Veteran
This tool is good for us PDF freaks, ex., if we edit or create PDFs on our own computers.
 

smokefrogg

Active member
Veteran
that's handy for the pdfs and ms office files, i use exiftool to clean the images already, this sounds much more handy since it handles so many file types though, thanks for the head's up!
 

bonsai

Member
fyi for mac users: OSX ships with Python. Snow Leopard ships with Python 2.6, which is the minimum version required to run MAT.
Further more, you'll need to install git to download the MAT. Git is source-code management software written by the primary author of the Linux kernel. I recommend using Homebrew to install packages on OSX, here's a video on installing homebrew and git http://vimeo.com/14649488
 

Latest posts

Latest posts

Top