For most serious applications, you will often have to have persistent storage (storage that still exists after your applications stops running) of some sort. For new developers, it can be quite daunting to decide which option to go for. Is a simple flat file enough? When should you use something like a database? Which database should you use? There are so many options that are available it becomes quite daunting to decide which way to go for.
This is a starting guide to provide an overview of some of the many data storage options that are available for you and how you can go about deciding. One thing to keep in mind is that if you are developing an application which is either planned or has a possibility to scale over time, your underlying database might also grow overtime. It may be quick and easy to implement a file as storage, but as your data grows it might be better to use a relational database but it will take a little bit more effort. Let’s look at this a bit deeper
What are the possible ways to store data?
There are many methods of persistent storage that you can use (persistent storage means that after your program is finished running your data is not lost). The typical ways you can do this is either by using a file which you save data to, or by using the python pickle mechanism. Firstly I will explain what some of the persistent storage options are:
- File: This is where you store the data in a text based file in format such as CSV (comma separated values), JSON, and others
- Python Pickle: A python pickle is a mechanism where you can save a data structure directly to a file, and then you can retrieve the data directly from the file next time you run your program. You can do this with a library called “pickle”
- Config files: config files are similar to File and Python Pickle in that the data is stored in a file format but is intended to be directly edited by a user
- Database SQLite: this is a database where you can run queries to search for data, but the data is stored in a file
- Database Postgres (or other SQL based database): this is a database service where there’s another program that you run to manage the database, and you call functions (or SQL queries) on the database service to get the data back in an efficient manner. SQL based databases are great for structured data – e.g. table-like/excel-like data. You would search for data by category fields as an example
- Key-value database (e.g redis is one of the most famous): A key-value database is exactly that, it contains a database where you search by a key, and then it returns a value. This value can be a single value or it can be a set of fields that are associated with that value. A common use of a key-value database is for hash-based data. Meaning that you have a specific key that you want to search for, and then you get all the related fields associated with that key – much like a dictionary in python, but the benefit being its in a persistent storage
- Graph Database (e.g. Neo4J): A graph database stores data which is built to navigate relationships. This is something that is rather cumbersome to do in a relational database where you need to have many intermediary tables but becomes trivial with GraphQL language
- Text Search (e.g. Elastic Search): A purpose built database for text search which is extremely fast when searching for strings or long text
- Time series database (e.g. influx): For IoT data where each record is stored with a timestamp key and you need to do queries in time blocks, time series databases are ideal. You can do common operations such as to aggregate, search, slice data through specific query operations
- NOSQL document database (e.g. mongodb, couchdb): this is a database that also runs as a separate service but is specifically for “unstructured data” (non-table like data) such as text, images where you search for records in a free form way such as by text strings.
There is no one persistent storage mechanism that fits all, it really depends on your purpose (or “use case”) to determine which database works best for you as there are pros and cons for each.
|Setup||Editable outside Python||Volume||Read Speed||Write Speed||Inbuilt Redundancy|
|File||None – you can create a file in your python code||For text based||Small||Slow||Slow||No – manual|
|Python Pickle||None- you can create this in your python code||No – only in python||Small||Slow||Slow||No – manual|
|Config File||Optional. You can create a config file before hand||Yes – you can use any text based editor||Small||Slow||Slow||No – manual|
|Database SQLite||None – database created automatically||No – only in python||Small-Med||Slow-Med||Slow-Med||No – manual|
|Relational SQL Database||Separate installation of server||Through the SQL console or other SQL clients||Large||Fast||Fast||Yes, require extra setup|
|NoSQL Column Database||Separate installation of server||Yes, through external client||Very large||Very fast||Very fast||Yes, inbuilt|
|Key-Value database||Separate installation of server||Yes, through external client||Very large||Very fast||Fast-Very Fast||Yes, require extra setup|
|Graph Database||Separate installation of serverSeparate installation of server||Yes, through external client||Large||Med||Med||Yes, require extra setup|
|Time Series Database||Separate installation of server||Yes, through external client||Very large||Very fast||Fast||Yes, require extra setup|
|Text Search Database||Separate installation of server||Yes, through external client||Very large||Very fast||Fast||Yes, require extra setup|
|NoSQL Documet DB||Separate installation of server||Yes, through external client||Very large||Very fast||Fast||Yes, require extra setup|
A big disclaimer here, for some of the responses, the more accurate answer is “it depends”. For example, for redundancy for relational databases, some have it inbuilt such as Oracle RAC enterprise databases and for others you can set up redundancy where you could have an infrastructure solution. However, to provide a simpler guidance, I’ve made this a bit more prescriptive. If you would like to dive deeper, then please don’t rely purely on the table above! Look into the documentation of the particular database product you are considering or reach out to me and I’m happy to provide some advice.
There are in fact plenty of SaaS-based options for database or persistent storage that are popping up which is exciting. These newer SaaS options (for example, firebase, restdb.io, anvil.works etc) are great in that they save you time on the heavy lifting, but then there may be times you still want to manage your own database. This may be because you want to keep your data yourself, or simply because you want to save costs as you already have an environment either on your own laptop, or you’re paying a fixed price for a virtual machine. Hence, managing your own persistent storage may be more cost effective rather than paying for another SaaS. However, certainly don’t discount the SaaS options altogether, as they will at least help you with things like backups, security updates etc for you.