I have a database that will get around 2 000 000 rows every day into a data table. I'm currently testing this on MySQL but I'm open to any suggestions and different approaches.
Nature of the data is:
- text label (up to 200 characters)
- value (int)
Each data row belongs to a category, these are divided by country, category and type. Category is either completed or not (meaning the data in it is complete). My table schema is here pastebin
, all tables are InnoDB (although I'm not using foreign keys).
The query that will be used most on the data is:
select ref.label , data.date_checked , cats.country_code, cats.category, cats.type, cats.positions, cats.completed , data.value from ref ref join data data on data.data_id = ref.id join data_categories cats on cats.country_code = data.country_code and cats.category = data.category and cats.type = data.type and date(cats.date_checked) = date(data.date_checked) where ref.label = 'my data label'
ref.id and data.data_id are of binary(16) type and store an unhexed MD5 hash of the label; the label reoccurs in the data set - 189759 unique labels in 3877629 data items and there won't be many new labels.
Indexes, besides primary keys:
- data.data_id - non unique
- ref.label - unique
- (data_categories.country_code, data_categories.category, data_categories.type) - compound, non unique
Here's the EXPLAIN on the query - pastebin raw
and profiling info from phpmyadmin imgur
. It takes ~5 ms to execute this query over 3877629 data items on a freshly restarted database with no other load on the machine but that's my PC with ssd disk, etc. My target environment won't be operating under perfect conditions.
5 ms is good, far better than I expected for almost 4 million data items. But that's only two days worth of data so what will happen in 6 months? Are there any obvious bottlenecks here, anything I could have done better? Maybe MySQL is not really suited for this and I should be using something else? Paid solutions are out of the question for me unfortunately.
One obvious optimization would be to normalize category data. I took the current approach because data table contains all the relevant information now, if the application querying the database knows the exact label and doesn't need the category data, it allows for a query on data table alone.
I know it's a long post but I didn't know how to explain myself better. I would greatly appreciate any advice on the issue, it's been a while since I've played with databases. tl;dr
there's a lot of data, i am worried if there's anything i could do better, help me avoid a headache in 6 months.