The Big List of Naughty Strings

Last year, I was working on some database-related features and found a weird message.

The message read thus:

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

Confused, I looked around me. Was this a prank?

I fired up my trusty search engine and found out that the aforementioned text is but one entry of something called The Big List of Naughty Strings – "a list of strings which have a high probability of causing issues when used as user-input data."

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data. - minimaxir/big-list-of-naughty-strings

Input Validation, also known as data validation or data sanitization, is a practice that consists of pre-processing and preparing any possible user-generated input so that it'll be handled gracefully and appropriately by the software.

Since BLNS contains a vast variety of string and character samples, it is possible to test a wide range of problems at the same time.

Issues related to bad input can range from mild up to critical. For instance, weird characters could cause the software to crash.

Another major and infamous kind of problem are SQL Injections: the execution malicious database calls via fields exposed as part of the front-end of the application.

Her daughter is named Help I'm trapped in a driver's license factory.
XKCD #327 – Exploits of a Mom

Data validation is not limited exclusively to the detection of malicious exploits, though. For example, a user's input could result in involuntary errors in the software; sometimes, this results in UI problems.

It could also happen that a user's input is falsely detected as invalid; this is know as the Scunthrope Problem, where real-world names or words could be misinterpreted as offensive language.

In the modern day, a lack of support for multi-byte characters (e.g. UTF-8) is another common pitfall. Software developed to only support ASCII (or even ANSI) text will likely fail as soon as multi-byte characters are introduced. With the globalisation of software and the rise in popularity of emojis, it is necessary to adapt to better standards.

So, next time you're developing software, make sure that it complies with the Big List of Naughty Numbers.