Most of what I do involves the messy world of text, and I think this is a great resource. I wish the software I depended on tested against it.
I can think of a few more cases that I've seen cause havoc:
- U+FEFF in the middle of a string (people are used to seeing it at the beginning of a string, because Microsoft, but elsewhere it may be more surprising)
- U+0 (it's encoded as the null byte!)
- U+1B (the codepoint for "escape")
- U+85 (Python's "codecs" module thinks this is a newline, while the "io" module and the Python 3 standard library don't)
- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when used in JSON literals)
- A glyph with a million combining marks on it, but not in NFC order (do your Unicode algorithms use insertion sort?)
- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)
- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and for some reason U+FDD0
People should also test what happens with isolated surrogate codepoints, such as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put them in the BLNS. (If you put the fake UTF-8 for them in a file, the best thing for a program to do would be to give up on reading the file.)
Isolated UTF-16 surrogate code points definitely crash Unity when it tries to display them. (Seen when I pasted some emoji in a text box in TIS-100 and tried to backspace.)
One fun (and very interesting) string is EICAR[0]. I worked for an antivirus company once and we had the EICAR string for testing but couldn't check it into source control because it triggered the AV software which we dogfooded...
Fun times indeed. Windows defender picks up a test.txt with those contents as malicious (and closes the file handle causing Notepad to misbehave) but if you add a space between EI and CAR it doesn't see anything.
Edit: Seriously, Microsoft?
Category: Virus
Description: This program is dangerous and replicates by infecting other files.
Recommended action: Remove this software immediately.
Yeah, I would make the SQL injection and command injections test a little less kinetic =). Using a simple SELECT test, like SELECT @@VERSION, would be a little safer... Edit: Forget to say thanks! This is a pretty cool list.
It's not completely clear to me which encoding the blns.txt file uses. Since this project is all about weird/evil bytestrings, the encoding of the file itself is very important.
Using a newline as a delimiter in that file excludes newlines from being part of the strings you are testing - but newlines are an important "naughty" character to consider. Unfortunately the same is true of basically any other common delimiter character.
Maybe base64-encoding the strings would be one way to solve for this? You could use base64-encoded values in JSON, for example.
Fair question. Encoding is UTF-8. This is fine for time being since UTF-8 is ubiquitous.
I had it set as UTF-16 for the two-byte characters when first writing it, but that had caused issues. If there is a demand, a second list can be added.
for anyone testing web sites, I built a chrome extension that makes things like this available in the right-click menu [1]
the code is on github, so it can be easily extended [2]
Yeah, the other exploit strings do innocuous stuff like putting up javascript alerts or touching files, but the SQL injection ones aren't innocuous at all. I wonder if there's something better to replace those with. Something like `1'; CREATE TABLE blns ...--` would be more akin to what the shell exploits do.
The list seems to be missing the simplest naughty string of all: The empty string!
(Well, the text file has empty lines separating the comments and example strings so it technically includes the empty string, but it's not in the JSON file.)
Is the scope just well-formed strings or would you consider adding binary nasties like null bytes, mal-encoded characters, or even just newlines on their own?
What about XML billion laughs strings, or parser-busting very long runs of parentheses?
Nice; sort of a programming complement to Shutterstock's _List of Dirty, Naughty, Obscene, and Otherwise Bad Words_[0]. So helpful to have a bunch of minds working on useful lists like this. Good to see that GitHub passes this test!
I worked on a swear filter at a previous job. Not quite sure how this list could benefit anyone really unless you are matching a whole string e.g. title of a photo rather than words in the title of a photo.
There are so many creative ways to get around swearing. Replace letters with numbers, drop consonants and vowels. And you almost always need to check for word boundaries otherwise somebody from Scunthorpe might be upset you banned them. And then there are cases where word boundaries aren't enough. Good luck ;-)
I doubt the value of this repository. The first naughty French word "allumé" can't be considered naughty, dirty, or bad, like, at all. And many others are not naughty under too many circumstances...
Except very few swear words, word filtering is pretty much useless.
* How could this be used to test 'corrupt' characters? Isn't the process of savign the file itself as UTF-8 un-corrupt...the file?
* Is there some recommended way to group these into "strings that should pass validation" versus "strings that should fail"... or is that too application-specific?
If you really intend this for use in testing, I'd suggest making the injections less nasty. I could easily see a junior dev slapping this in and deleting some important stuff.
I'd also add more invalid UTF encodings and embedded null bytes, etc. The JSON format would be preferable to plain text for that though.
/dev/urandom can also be used as a source of random and unusual input data, as it contains by definition all 256 byte values and 65536 2-byte values, 16M 3-byte values, etc., and should eventually output every possible string.
I absolutely love strange unicode strings. It's handy if you ever want to find out what a server's running. One time, I put a bunch of emoji's in a GET param of a Google site, then got a big Java error page. I had no idea Google ran Java.
[+] [-] rspeer|10 years ago|reply
I can think of a few more cases that I've seen cause havoc:
- U+FEFF in the middle of a string (people are used to seeing it at the beginning of a string, because Microsoft, but elsewhere it may be more surprising)
- U+0 (it's encoded as the null byte!)
- U+1B (the codepoint for "escape")
- U+85 (Python's "codecs" module thinks this is a newline, while the "io" module and the Python 3 standard library don't)
- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when used in JSON literals)
- A glyph with a million combining marks on it, but not in NFC order (do your Unicode algorithms use insertion sort?)
- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)
- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and for some reason U+FDD0
People should also test what happens with isolated surrogate codepoints, such as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put them in the BLNS. (If you put the fake UTF-8 for them in a file, the best thing for a program to do would be to give up on reading the file.)
[+] [-] grapeshot|10 years ago|reply
[+] [-] zuzun|10 years ago|reply
[+] [-] gsnedders|10 years ago|reply
[+] [-] gizmo686|10 years ago|reply
[+] [-] pwenzel|10 years ago|reply
[+] [-] jsat|10 years ago|reply
/dev/null; rm -rf /*; echo " That's a little aggressive for testing no?
[+] [-] jleader|10 years ago|reply
[+] [-] minimaxir|10 years ago|reply
[+] [-] cowls|10 years ago|reply
1;DROP TABLE users 1'; DROP TABLE users--
Seems a bit hairy to have that in there in case someone tries to run these tests on their prod environment
[+] [-] MertsA|10 years ago|reply
[+] [-] afandian|10 years ago|reply
Is it naughty to include it here?
[0] https://en.wikipedia.org/wiki/EICAR_test_file[+] [-] lamby|10 years ago|reply
[0] https://github.com/minimaxir/big-list-of-naughty-strings/pul... [1] http://spamassassin.apache.org/gtube/
[+] [-] girvo|10 years ago|reply
[+] [-] voltagex_|10 years ago|reply
Edit: Seriously, Microsoft?
Category: Virus
Description: This program is dangerous and replicates by infecting other files.
Recommended action: Remove this software immediately.
Items: file:C:\Users\Adam\Desktop\test.txt
[+] [-] efriese|10 years ago|reply
[+] [-] bryanlarsen|10 years ago|reply
[+] [-] tptacek|10 years ago|reply
https://code.google.com/p/fuzzdb/
Fuzz lists are to web pentesters what drain snakes are to plumbers.
[+] [-] janfry|10 years ago|reply
As other commenters noted, strings like DROP TABLES should be used with caution!
[+] [-] simonw|10 years ago|reply
Using a newline as a delimiter in that file excludes newlines from being part of the strings you are testing - but newlines are an important "naughty" character to consider. Unfortunately the same is true of basically any other common delimiter character.
Maybe base64-encoding the strings would be one way to solve for this? You could use base64-encoded values in JSON, for example.
[+] [-] minimaxir|10 years ago|reply
I had it set as UTF-16 for the two-byte characters when first writing it, but that had caused issues. If there is a demand, a second list can be added.
[+] [-] adzicg|10 years ago|reply
[1] - https://chrome.google.com/webstore/detail/bug-magnet/efhedld...
[2] - https://github.com/gojko/bugmagnet
[+] [-] acehyzer|10 years ago|reply
[+] [-] sanderjd|10 years ago|reply
[+] [-] reitanqild|10 years ago|reply
Edit: Found this two minutes later: https://github.com/googlei18n/libphonenumber, seems to be an official Google product and Apache licensed.
[+] [-] thomasfoster96|10 years ago|reply
[+] [-] duncans|10 years ago|reply
[+] [-] orf|10 years ago|reply
[+] [-] minimaxir|10 years ago|reply
I only added what was off the top of my head for those sections; this list will consistently be updated.
[+] [-] siculars|10 years ago|reply
בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ
[+] [-] gizmo686|10 years ago|reply
[+] [-] itaibn|10 years ago|reply
(Well, the text file has empty lines separating the comments and example strings so it technically includes the empty string, but it's not in the JSON file.)
[+] [-] minimaxir|10 years ago|reply
[+] [-] jl6|10 years ago|reply
What about XML billion laughs strings, or parser-busting very long runs of parentheses?
[+] [-] eli|10 years ago|reply
[+] [-] hoprocker|10 years ago|reply
[0] https://github.com/shutterstock/List-of-Dirty-Naughty-Obscen...
[+] [-] deanc|10 years ago|reply
There are so many creative ways to get around swearing. Replace letters with numbers, drop consonants and vowels. And you almost always need to check for word boundaries otherwise somebody from Scunthorpe might be upset you banned them. And then there are cases where word boundaries aren't enough. Good luck ;-)
[+] [-] brohee|10 years ago|reply
Except very few swear words, word filtering is pretty much useless.
[+] [-] joelcollinsdc|10 years ago|reply
* How could this be used to test 'corrupt' characters? Isn't the process of savign the file itself as UTF-8 un-corrupt...the file?
* Is there some recommended way to group these into "strings that should pass validation" versus "strings that should fail"... or is that too application-specific?
[+] [-] pbnjay|10 years ago|reply
I'd also add more invalid UTF encodings and embedded null bytes, etc. The JSON format would be preferable to plain text for that though.
[+] [-] ph0rque|10 years ago|reply
[+] [-] userbinator|10 years ago|reply
[+] [-] RandomBK|10 years ago|reply
"Eventually" being the key word here. Fuzzing with purely random inputs will take eons to actually reveal non-trivial bugs...
[+] [-] x0|10 years ago|reply
Edit: Another one that tends to be fun is [] in the param, like http://example.com/?get[]=[].
And you can things inside, like http://example.com/?get['"%05<!]=[%FE%FF]
[+] [-] nradov|10 years ago|reply