My fun journey of managing a large table of PostgreSQL (2024)

Recently, I have had a chance to work on a very large data table. By very large, I mean, 1.5 billion rows up to this point, and it growing about by about 300,000 rows every single minute.

The table takes in the constant stream of data from IoT Edge devices, and I periodically run the data consolidation script(let’s call it consolidator) to aggregate the data stream into data points by each minute.

The consolidator, at first, runs reasonably fast for its size. It takes about a minute to perform an SQL SELECT/GROUP BY/ORDER BY function. One might be surprised that one minute is relatively fast compared to the number of rows in the table. The reason for that is, I do, use an index to keep track of data. I will take you with me through my journey to the data wonderland.

For a reader who just wants TLDR

  1. Try indexing the columns that are used in conditional clauses.
  2. Think about the nature of your data. If it has intrinsic order, using BRIN might help.
  3. If you are using GROUP BY DATE_TRUNC or some similar kind of aggregate function, you might as well index the aggregated results.
  4. Every time the rows are inserted, the indices are also updated.
  5. SQL is declarative, so the end result of GROUP BY is order-independent. But indexing is not, and you might need to start the aggregate with whatever indexed column.

At the very first step, for one who works with a big data table, indexing is the first go-to solution. Indexing will basically rearrange your sequential data into a better kind of data structure that is searching-friendly in terms of time complexity. Postgres automatically creates indices for unique fields including the primary key fields. The default data structure of B-tree. This is a balanced tree of size 8KB, which will make the depth of just 3 even though the number of records is billions!

I found this a great resource explaining indexing if you have some spare time.

Since there is no free lunch, we sacrifice our space for the time complexity trade-off. The space used for indexing is just about 11GB in size per index!

My fun journey of managing a large table of PostgreSQL (2)

Luckily, the data is time-series. This enables one option for indexing as a Block-range index(aka. BRIN). It requires minimal space of just over 4kB.

My fun journey of managing a large table of PostgreSQL (3)

This indexing is based on an assumption that columns are well-correlated with the physical order of the table rows. In my specific case, the order of data is well-correlated(timestamp of each measurement value). And what kind of order is NOT well-correlated? You might ask. The data that order is artificially created. For example, user id 0001, 0002, 0003, … does not tell anything besides its order. it doesn’t naturally correlate with the columns’ values.

So, All I need is just to go ahead and implement the BRIN index and go enjoy life in the lavender field? No, not so fast…

My SQL query looks like this

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY data2, DATE_TRUNC('minute', data_timestamp);

And this query took about 14 minutes. Not ideal, but if you compare it to this.

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY DATE_TRUNC('minute', data_timestamp), data2;

Which took me more than 12 hours just to realize that, perhaps, I rather give up waiting.

So, what’s the difference here? What makes it takes so much time to retrieve the same query(from a declarative perspective) with a different order of group by column?

  • Aggregate is expensive
  • You might need to be careful about index when you want to optimize aggregation query
  • Insert Query Plan Image of wrong indexing
  • Insert Query Plan Image of indexing with DATE_TRUNC function

Remember that the table I am trying to optimize is taking a constant of data from IoT Devices? It implies that DB-Engine will, most of the time, perform inserts operation.

This is taken from the official document of the PostgreSQL website.

Once an index is created, no further intervention is required: the system will update the index when the table is modified, and it will use the index in queries when it thinks doing so would be more efficient than a sequential table scan.

The keyword here is “update”! Meaning, the index needs to be re-arranged, every time new data is inserted. This doesn’t sound too bad for the BRIN-index type. However, in order to keep the query planner using either the system runs ANALYZE command or you have to run it periodically.

But you might have to run the ANALYZE command regularly to update statistics to allow the query planner to make educated decisions.

In conclusion, if you are working on big data with PostgreSQL. You might very well need to consider the nature of your application seriously. Knowing the right indexing might help on speed things up. BRIN index might help to keep index size in check. However, you should double check with Query planner that it does what you expect.

Next article is how I manage to break down large index (and table) into partition using pg_partman.

My fun journey of managing a large table of PostgreSQL (2024)
Top Articles
Do Solar Panels Save on Heating Bills? | Boston Solar | MA
Capital Gains Tax for business
Duralast Gold Cv Axle
Average Jonas Wife
Television Archive News Search Service
Craigslist Vans
Jennette Mccurdy And Joe Tmz Photos
라이키 유출
Ub Civil Engineering Flowsheet
Victoria Secret Comenity Easy Pay
Atrium Shift Select
Lesson 3 Homework Practice Measures Of Variation Answer Key
Geometry Escape Challenge A Answer Key
2024 Non-Homestead Millage - Clarkston Community Schools
Worcester On Craigslist
All Buttons In Blox Fruits
Jackson Stevens Global
Highland Park, Los Angeles, Neighborhood Guide
Check From Po Box 1111 Charlotte Nc 28201
Xomissmandi
Daylight Matt And Kim Lyrics
Wausau Marketplace
1989 Chevy Caprice For Sale Craigslist
*Price Lowered! This weekend ONLY* 2006 VTX1300R, windshield & hard bags, low mi - motorcycles/scooters - by owner -...
SuperPay.Me Review 2023 | Legitimate and user-friendly
Sodium azide 1% in aqueous solution
Baja Boats For Sale On Craigslist
Silky Jet Water Flosser
4 Times Rihanna Showed Solidarity for Social Movements Around the World
Tuw Academic Calendar
Wolfwalkers 123Movies
Maisons près d'une ville - Štanga - Location de vacances à proximité d'une ville - Štanga | Résultats 201
By.association.only - Watsonville - Book Online - Prices, Reviews, Photos
Maths Open Ref
Ugly Daughter From Grown Ups
Blush Bootcamp Olathe
How to Use Craigslist (with Pictures) - wikiHow
NIST Special Publication (SP) 800-37 Rev. 2 (Withdrawn), Risk Management Framework for Information Systems and Organizations: A System Life Cycle Approach for Security and Privacy
About Us | SEIL
Build-A-Team: Putting together the best Cathedral basketball team
Whitehall Preparatory And Fitness Academy Calendar
Wal-Mart 2516 Directory
Is The Nun Based On a True Story?
Low Tide In Twilight Manga Chapter 53
Three V Plymouth
Fluffy Jacket Walmart
2294141287
Ouhsc Qualtrics
Lebron James Name Soundalikes
Billings City Landfill Hours
Ihop Deliver
Latest Posts
Article information

Author: Horacio Brakus JD

Last Updated:

Views: 5542

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.