SoxMosh

Perfect Echo

Overview

This is a python command line application for manipulating images by applying audio effects to them.

The idea was inspired by some initial approaches I tried using Audacity, which while fun is quite a tedious process and prone to errors.

Since the approach is completely non-visual anyway I figured that writing some simple code to handle the transformations wouldn't go against the spirit of the endeavour.

The project is based around pysox which is itself a wrapper around the SoX tool.

Note: I am far from the first to try this idea out. Two resources I used for inspiration & reference are:

Webapp

The most direct way to try this out is to use this webapp

Download

A very early version can be downloaded here.

Note 1: this has only been tested on macOS at this stage - so may not work on Windows / Linux (but maybe it will!)

Note 2: I don't have a personal Apple developer account and so, unfortunately, the zip is not signed / notarised. This means

  1. You'll have to trust me not to do anything nefarious
  2. Go into System Preferences -> Security Privacy and allow the software to be run

All of the dependencies are bundled into the application - so there's no need to install any extra software

Usage

At this stage the usage is fairly simple, but requires using a terminal / console application

  1. Unzip the downloaded archive and open a terminal window inside the newly created directory
  2. Run with the following pattern: ./soxmosh_cli input_image output_image effects
  3. Try ./soxmosh_cli perfect_blue_face.bmp perfect_face_moshed.bmp example_effects.json directly to see an example using a provided image & effects data file

How does it work?

The reason this works is due to the way a computer interprets media such as audio & images. Though we have dedicated audio software to handle reading & playback of sounds, image software for displaying & editing images, video software... etc, these are really just specific ways of interpreting the data or information contained in a media file. Of course different encodings exist, various compression algorithms etc which complicate this view somewhat - but essentially the computer is just acting on binary streams of 0s and 1s without really seeing the forest for the trees.

One of the keys to applying this kind of audio effect treatment to images is the ability of SoX to load in data in a raw format. Formats like WAV, FLAC, MP3 etc are "self-describing", in that they contain a header file which describes aspects of the audio data such as sample-rate, number of channels, length, encoding etc. In this case because all of the pertinent details are contained in the header, SoX can figure out how to interpret the data itself.
Raw formats don't contain such a header and therefore in order to interpret the data as expected this information must be explicitly provided to SoX. This allows us to manipulate SoX to our own ends. We specify the data encoding to be a-law (though I can't claim to know why this particular encoding works while others don't...) which expects each sample to be 8-bit (this will be important later).

It then becomes quite simple to input an image to sox on the command line and have it output that same image:
sox -t al -c 1 -r 48k path/to/image.bmp -t al path/to/output.bmp

This is obviously a totally pointless example as it just outputs a 1:1 copy of the same image (in fact even this isn't really true - I do spot some artefacts appearing) but it shows that by specifying -t al flag, which sets the filetype to be a-law, we can pass through image data.

Perfect Original Perfect Moshed
We've told SoX that we're working with raw data - but that isn't actually true, its just a nice hack. We're working with bitmap files which actually do have a header. Using that header we can figure out where the data portion actually begins and only perform our operations from that point in the data (i.e. removing the header, performing our transformations & then finally reattaching the header to the modified block of data) this avoids corrupting the header.

So, now we're ready to look into what SoX actually does to our data to achieve these effects. Its useful to briefly look into what happens to audio data first to get a feel for what we can expect to see. The simplest and clearest demonstration is probably the delay effect:
delay {position(=)}: Delay one or more audio channels such that they start at the given position.

This has the effect of offsetting all of the data samples by a given number of seconds. The resulting output contains only the delayed samples, i.e. its 100% wet, it doesn't include any of the dry original samples

Adding a delay effect to audio

We will work with this simple synth note from an SH101:


This file contains 11680 audio samples, which at a sampling rate of 44100 Hz corresponds to a length of 0.264 seconds

Running the command:

sox tiny_rave_SH101_C4.wav tiny_rave_SH101_C4_out.wav delay 1.0

Has the effect of creating a new wav file with the samples delayed by 1.0 seconds:


This file contains 55780 audio samples, which with the same sampling rate corresponds to 1.264 seconds of audio. This is quite self explanatory and is in line with what we expected.
To belabor the point if we look at the first 50 samples of the audio signal data contained within the original file:
            
            [ 0.00000000e+00 -2.05039978e-05 -9.76562500e-04 -4.47976589e-03
            -9.54377652e-03 -1.52943134e-02 -2.14105845e-02 -2.78952122e-02
            -3.46711874e-02 -4.15842533e-02 -4.85666991e-02 -5.54741621e-02
            -6.22984171e-02 -6.83031082e-02 -7.33066798e-02 -7.80060292e-02
            -8.25167895e-02 -8.67084265e-02 -9.07150507e-02 -9.43706036e-02
            -9.76780653e-02 -1.00758076e-01 -1.03464842e-01 -1.05913758e-01
            -1.07764721e-01 -1.09514713e-01 -1.10503912e-01 -1.11728311e-01
            -1.11741304e-01 -1.12472177e-01 -1.10652328e-01 -1.13386512e-01
            -7.16021061e-02  1.30800843e-01  2.30443716e-01  2.74714947e-01
            3.09366226e-01  3.36177826e-01  3.61080170e-01  3.82261157e-01
            4.03543711e-01  4.21683312e-01  4.40880895e-01  4.54923272e-01
            4.70372915e-01  4.83487487e-01  4.98167157e-01  5.10751724e-01
            5.24211765e-01  5.37293554e-01]
            
        
Clearly showing audio amplitude data as we would expect.

Doing the same for the delayed audio file yields:
            
                [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
                0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
                0. 0.]
            
        
Once again this is what we expect to see considering we've shifted all of the samples by 1 second. In order to find the data in the array we have to index from the sample equivalent of 1.0s, which with our 44100 Hz sample rate is actually just sample 44100 (sample = time * sampling-rate), yielding:
            
                [ 0.00000000e+00 -2.05039978e-05 -9.76562500e-04 -4.47976589e-03
                 -9.54377652e-03 -1.52943134e-02 -2.14105845e-02 -2.78952122e-02
                 -3.46711874e-02 -4.15842533e-02 -4.85666991e-02 -5.54741621e-02
                 -6.22984171e-02 -6.83031082e-02 -7.33066798e-02 -7.80060292e-02
                 -8.25167895e-02 -8.67084265e-02 -9.07150507e-02 -9.43706036e-02
                 -9.76780653e-02 -1.00758076e-01 -1.03464842e-01 -1.05913758e-01
                 -1.07764721e-01 -1.09514713e-01 -1.10503912e-01 -1.11728311e-01
                 -1.11741304e-01 -1.12472177e-01 -1.10652328e-01 -1.13386512e-01
                 -7.16021061e-02  1.30800843e-01  2.30443716e-01  2.74714947e-01
                 3.09366226e-01  3.36177826e-01  3.61080170e-01  3.82261157e-01
                 4.03543711e-01  4.21683312e-01  4.40880895e-01  4.54923272e-01
                 4.70372915e-01  4.83487487e-01  4.98167157e-01  5.10751724e-01
                 5.24211765e-01  5.37293554e-01]
            
        
which is exactly the same amplitude data as we saw at the beginning of the original file.

What is important to note here is that the sample data is just translated in time, but the acual amplitude values remain unaffected. Other effects like compression, reverb, echoes etc will affect some combination of the temporal sample position & the amplitude values themselves (i.e. changing the shape of the overall signal). This is what makes delay an ideal candidate for exploring the impact on images.

Adding a delay effect to an image

The first thing to do is to get a small & simple bitmap file to work with in order to clearly see what is going on. This won't necessarily give us an interesting image to look at, but it will help uncover some of the mystery (I hope!).
For this I've copied the bitmap data from this blog by Uday Hiwarale. Well worth a read if you want to understand more about the bitmap format.

This is the bitmap data we're going to be dealing with:

            
                42 4D
                00 00 00 00
                00 00
                00 00
                36 00 00 00

                28 00 00 00
                05 00 00 00
                05 00 00 00
                01 00
                18 00
                00 00 00 00
                00 00 00 00
                00 00 00 00
                00 00 00 00
                00 00 00 00
                00 00 00 00

                FF FF FF   00 00 00   00 00 00   00 00 00   00 FF FF   00
                00 00 00   00 00 00   00 00 00   00 00 00   00 00 00   00
                00 00 00   00 00 00   00 FF 00   00 00 00   00 00 00   00
                00 00 00   00 00 00   00 00 00   00 00 00   00 00 00   00
                00 00 FF   00 00 00   00 00 00   00 00 00   FF 00 00   00
            
        
If you copy and paste that into an application like hexfiend (for macOS) and save the file wih a bitmap extension you should be able to view the (tiny! so zoom in a lot) image we've created:

test bitmap

In the hex data everything up to the final block (starting with FF FF FF) is header information.
The most important points to note are:
In this encoding we represent a single pixel by 3 bytes, allowing us to specify 0-255 colours in each channel. Note that the order the bytes are intepreted is BGR, which kind of feels backwards. Another peculiarity is that the image data is read from the bottom left most 24 bits in the pixel data, moving right and then up. The bottom left 24 bits correspond to the first pixel in the image (i.e. top left). This means that in our example: The rest should be self evident.

Now if we take that image as our input and run the following in the python command line for SoxMosh:

./soxmosh_cli test.bmp test_output.bmp delay.json --sample-rate=60

Where delay.json is a file simply containing:

{ "effects": [ {"delay": {"positions":[0.05]}} ] }

We obtain as output:

test bitmap output

And the corresponding hex data:

            
                42 4D
                00 00 00 00 
                00 00 
                00 00
                36 00 00 00 

                28 00 00 00 
                05 00 00 00 
                05 00 00 00
                01 00 
                18 00 
                00 00 00 00
                00 00 00 00
                00 00 00 00 
                00 00 00 00 
                00 00 00 00

                FF FF FF   FF FF FF   00 00 00   00 00 00   00 00 00   00
                FF FF 00   00 00 00   00 00 00   00 00 00   00 00 00   00
                00 00 00   00 00 00   00 00 00   00 FF 00   00 00 00   00
                00 00 00   00 00 00   00 00 00   00 00 00   00 00 00   00
                00 00 00   00 00 FF   00 00 00   00 00 00   00 00 00   FF
                00 00 00
            
        
This isn't exactly what we expect to see based on our previous analysis of how the delay deals with samples - some of these discrepencies I can explain, some I can't.

What worked?

The top left red pixel moved to the right as expected, as did the green middle block. They appear to have moved by 1 pixel. This is as a result of experimenting with the sample rate and delay time to get it just right.
Notice the sample rate = 60 Hz, well, what does this really mean for an image? Its obviously related to the time in some way since the unit is 1/s, if I choose a higher value (e.g. >= 200 Hz) then I simply end up with a white image - obviously thats too high for a 5x5 pixel image. So I landed on 60 Hz as something that worked.

Now, the u / a-law encoding schemes that we're using expect 8-bit audio samples, but as previously mentioned our bitmap encoding scheme uses 24 bits for a single pixel, this means that shifting everything in the image by 1 sample (8 bits) will have the effect of shifting each byte place to the right. For e.g. a blue FF 00 00 pixel would become 00 FF 00, which results in shifting the colour of the pixel to green and occupying the same space in the image, what we desire is in fact considering two neighbouring pixels, FF 00 00 00 00 00 -> 00 00 00 FF 00 00 which has the effect of moving the blue pixel one space to the right and preserving the colour. i.e. we want to shift 3 samples across

In order to achieve this I've used 0.05s as the delay time, since that corresponds to a 3 sample shift (0.05 * 60).

We also observe that the blue pixel which occupied the top right has gone - thats simply because we've retained the same size image and so its been delayed out of existence.
The yellow pixel has disappeared, but the delayed position is occupied by a light blue pixel (FF FF 00). This is due to the fact that each row is zero padded (the 00's) to the nearest 4 byte boundary, i.e. 15 -> 16 bytes. So that when we perform our shift of the 00 FF FF pixel, it actually goes through the 00 padding and hence changes colour in the process. This pixel would actually need a 3 pixel shift in order to retain its colour.

What didn't work

The bottom left white pixel was correctly shifted to the right - but it also left behind a white pixel, not a black one as would be expected.
Here we can see the issue with our white pixels, instead of simply shifting FF FF FF 00 00 00 -> 00 00 00 FF FF FF as we would expect, we see FF FF FF 00 00 00 -> FF FF FF FF FF FF. I.e.it has simply created a new white pixel for us!
In addition the file size has increased from 134 bytes to 137