Base85x-java available on SourceForge

I am thrilled to announce that the first base85x code is available online on SourceForge for free! During the last couple of weeks I tried differen approaches to implement base85x in java with a good performance.

The project as well as the source code is available here: https://sourceforge.net/projects/base85xjava/. Downloads will be packed as soon as there are more tests and better comments in the code.

Next step will be a simple but less efficient PHP implementation in order to be able to provider an online encoder and decoder for base85x.

Cheers!

Minor correction of base85x encoding

In the transformation between base 85 values and ASCII characters there was a minor mistake last time. 40 minus 2 is 38 of course, not 39.

The corrected formulars are

From base 85 to ACSII
out = (c >= 51 ? c+42 : (c >= 22 ? c+41 : (c >= 2 ? c+38 : c+36)));

From ASCII to base 85
c = ((in >= 93) ? in-42 : ((in >= 63) ? in-41 : ((in >= 40) ? in-38 : in-36)));

Getting rid of the backslash \ in the base85x alphabet

Escaping sucks, right? Last version of the base85x alphabet included the backslash character which is used in a bunch of popular programming languages like for example C, Java and PHP.

But we want to be able to store small binary files inside of sourcecode without taking care about escaping. Just like

$image_data = '
?%AvZOxjj)vwXPQ_*hhA2Sj~vME^X32O]TMX/0+P41T2AtZ4;ER04^WhSDqym%r;
ei,%N}}O|3b3w{WawiL4twR51zu4CEQ{OvNL/ebH:wK)KWF{x5IC.C2[Hlpq|vh9
9g/]vUEa6/,lF[YjpYrV%d2zqr5nQlp)PXzqXbWtE}]ZcN*FKK2$~YcoYK6S:W4G
^u9Yw:h^m6d.ukMTGEiCt^8?(ybvbCr,ij;QDoHTjuV(,$niJj7,);/3Mx9oVygq
%w9q*i0_pBy)Np~(/oYnO32H8t~3u+gAxF6M.YutDQ+v]Ege$UDFCb6lnAWjJ2un
DfCV.(h%f}N4YL(knk+dcy{bc%4,.],7}Uo{ox(*lRH,r.AoHi,614GxIF7?8Oy[
g_u3hyf5e?/s?/cPbvuP3VSmpIRiWgT})HmnZ*,pFE[%y*mH^rQZo]iF3P)bg0wY
Ko%){H.qQhA/GdlS^E,X}M/6Nvt{Vti2Z{k9A|C(35mc1rkml+IP,Jva6.P1Qs[f
nwAY.m}i?cyimPKMl%9Xq](WEabr77
';

Thus the base85x alphabet changed as follows:

In formular style that’s the same as

out = (char) (c1 >= 51 ? c1 + 42 : (c1 >= 22 ? c1 + 41 : (c1 >= 2 ? c1 + 39 : c1 + 36)));

0
$
1
%
2
(
3
)
4
*
5
+
6
,
7
-
8
.
9
/
10
0
11
1
12
2
13
3
14
4
15
5
16
6
17
7
18
8
19
9
20
:
21
;
22
?
23
@
24
A
25
B
26
C
27
D
28
E
29
F
30
G
31
H
32
I
33
J
34
K
35
L
36
M
37
N
38
O
39
P
40
Q
41
R
42
S
43
T
44
U
45
V
46
W
47
X
48
Y
49
Z
50
[
51
]
52
^
53
_
54
`
55
a
56
b
57
c
58
d
59
e
60
f
61
g
62
h
63
i
64
j
65
k
66
l
67
m
68
n
69
o
70
p
71
q
72
r
73
s
74
t
75
u
76
v
77
w
78
x
79
y
80
z
81
{
82
|
83
}
84
~

New alphabet for the base85x encoding

Unfortunately it turned out that the base85x encoding proposed in the last post is damn slow. To encode a 400 KB picture from memory to memory, the PHP implementation needs 0.2 seconds and the Java implementation needs 0.1 seconds. That’s only 4 MB/s.

A lot of this time is spent on alphabet table lookups. So I have the hope to be able to reduce the number of lookups. The original base85 just adds 33 to every value 0,…,84 in order to get the right ASCII character. That is fast but leads to all ASCII characters between 33 and 117 including “<” and “>”.

So the question was, what is the simplest function to map 0,…,84 to a subset of the 94 printable ASCII values excluding: < > ” ‘. The best I could find was this one:

In Java I could reduce the encoding time from 0.1 s to 0.075 s by using the following formula instead of a lookup table:

out = (char) ((c1 >= 21) ? c1 + 42 : ((c1 != 0) ? c1 + 39 : 33)));

I think some more enhancements are possible, especially in c/c++ where you can do better binary operations and use unsigned variables.

Overall this leads to the following new base85x encoding alphabet:

0
!
1
(
2
)
3
*
4
+
5
,
6
-
7
.
8
/
9
0
10
1
11
2
12
3
13
4
14
5
15
6
16
7
17
8
18
9
19
:
20
;
21
?
22
@
23
A
24
B
25
C
26
D
27
E
28
F
29
G
30
H
31
I
32
J
33
K
34
L
35
M
36
N
37
O
38
P
39
Q
40
R
41
S
42
T
43
U
44
V
45
W
46
X
47
Y
48
Z
49
[
50
\
51
]
52
^
53
_
54
`
55
a
56
b
57
c
58
d
59
e
60
f
61
g
62
h
63
i
64
j
65
k
66
l
67
m
68
n
69
o
70
p
71
q
72
r
73
s
74
t
75
u
76
v
77
w
78
x
79
y
80
z
81
{
82
|
83
}
84
~

But why is base85 so much slower compared to base64? Well, 64 is a power of 2 (2^6 = 64) which means that a multiplication by 64 is equal to a binary shift of 6 positions to the left and a division by 64 is equal to a right shift of 6 positions. Additionally the modulo operation a%64 is the same as cutting the lower 5 bits of a. All these essential operations are slower in base 85.

The base85x encoding

How to store binary data within XML files

Kullo will use a slightly modified form of the base64/base85 encoding to store binary files inside a xml file: base85x.

The general idea is to encode 4 bytes of binary data into 5 bytes of printable characters, which means that on the one hand, file size increases by 25 % but on the other hand binary data can be displayed, copy’n’pasted and stored in text files and source code.

Given four binary bytes b_1, …, b_4 out of [0, 255], we use the following base transformation to get the values c_1, …, c_5 in the range of [0, 84]:
7c0d52b0582cf30c85e6fe72bf2f2455

Then we use another alphabet than Adobe’s base85, since that includes the crucial XML characters < and >. The alphabet we’re using is the following:

0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
A
11
B
12
C
13
D
14
E
15
F
16
G
17
H
18
I
19
J
20
K
21
L
22
M
23
N
24
O
25
P
26
Q
27
R
28
S
29
T
30
U
31
V
32
W
33
X
34
Y
35
Z
36
a
37
b
38
c
39
d
40
e
41
f
42
g
43
h
44
i
45
j
46
k
47
l
48
m
49
n
50
o
51
p
52
q
53
r
54
s
55
t
56
u
57
v
58
w
59
x
60
y
61
z
62
?
63
!
64
.
65
:
66
-
67
_
68
,
69
;
70
(
71
)
72
[
73
]
74
=
75
|
76
{
77
}
78
*
79
+
80
@
81
#
82
%
83
~
84
^

There are 94 printable characters in ACSII (plus the space ” “). The spare characters we are not using are: "$&'/<>\`

All these 94 ASCII chars are represented as 1-Byte characters in UTF-8, thus the size will not increase when storing base85x blocks in XML files.

In order to get consistent results, we need to use a standardized padding if the input data is not divisible by 4: We add as many ^ characters as we need to get a full 4-bytes block. And to store the number of tildes which were added, we trim the resulting 5-bytes output by exactly this number.

If we had for example 42 bytes of binary data, ending on “n!”, the padding rule expands this block to “n!^^” before encoding.

base85xEncode(“n!^^”) = 110*256³ + 33*256² + 94*256 + 94 = (35,33,54,28,76)85 = ZXsS{

In this Example, we cut the last 2 characters and get “ZXs”

Note: Since the first version of this encoding there have been incompatible updates in the alphabet. See all related posts.

First draft of the Kullo message format

Today I am proud to present you the first version of the Kullo message model.

This model will be implemented as nested XML files to make it easy accessable in every programming language as well as via a text editor.

envelope envelopeTo address
symmetric (RSA encrypted) symmetricAlgorithm string symmetricKey string symmetricIV string
letterContainer (encrypted)
letter (compressed)
header to string cc string bcc string from string subject string date datetime messageID string InReplyTo string
body
footer
imprint name string rank string company string logo file comments text
attachmentsIndex
item name string mime string start int length int
letterSignature string
attachmentsContainer (encrypted) attachments file attachmentsSignature string