Generate Time Series Data with TimescaleDB

Alessandro Mengoli

Generate Time Series Data with TimescaleDB

When developing applications that work with time series data, we often need test data to validate our system’s functionality. Creating this data manually can be time-consuming and prone to errors. Moreover, having a good amount of test data is crucial for performance testing and identifying potential issues before going into production.

Let’s explore a TimescaleDB feature that can help us generate test data efficiently. Using the generate_series function, we can create time-based sequences that can be enriched with random or calculated values. This approach allows us to quickly generate days, months, or even years of data with the desired time interval.

To demonstrate this concept, we’ll create a practical example: generating financial data for the last 6 months with readings every 5 minutes. This could be useful when developing trading platforms, financial analysis tools, or monitoring dashboards.

First, let’s create our table structure:

CREATE TABLE stock_prices (
    time TIMESTAMPTZ NOT NULL,
    symbol TEXT NOT NULL,
    price DECIMAL(10,2) NOT NULL,
    volume INTEGER NOT NULL
);

-- Convert to hypertable for better time series management
SELECT create_hypertable('stock_prices', 'time');

Now, let’s generate our test data:

WITH params AS (
  SELECT 
    100.00::decimal as initial_price,  -- Starting price for our stock
    0.02 as volatility                 -- Price variation factor
),
time_series AS (
  SELECT 
    generate_series(
      NOW() - INTERVAL '6 months',
      NOW(),
      INTERVAL '5 minutes'
    ) AS time
)
INSERT INTO stock_prices
SELECT 
  time,
  'DEMO' as symbol,
  -- Generate price with random variations
  (
    params.initial_price * (1 + params.volatility * (random() - 0.5))
  )::decimal(10,2) as price,
  -- Simulate trading volume
  floor(random() * 10000)::integer as volume
FROM time_series, params;

Let’s examine what we’ve created with a few queries:

-- Check the latest data points
SELECT time, symbol, price, volume 
FROM stock_prices 
ORDER BY time DESC 
LIMIT 5;

-- Get daily statistics
SELECT 
    date_trunc('day', time) as trading_day,
    avg(price) as avg_price,
    max(price) as high_price,
    min(price) as low_price,
    sum(volume) as total_volume
FROM stock_prices
GROUP BY trading_day
ORDER BY trading_day DESC
LIMIT 5;

With minimal effort, we’ve created a substantial dataset that we can use for testing our applications. While this example focuses on financial data, the same approach can be applied to any time series data you need to generate, such as sensor readings, system metrics, or user activity logs.

The beauty of this approach lies in its simplicity and flexibility. By adjusting the time interval, data range, and value generation logic, you can quickly create different scenarios for your testing needs. This allows you to focus on developing and testing your application’s functionality rather than spending time on manual data creation.

Advanced generate_series Configuration

The generate_series function offers several advanced configuration parameters that give us more control over the generated data. The basic syntax accepts parameters for start timestamp, end timestamp, and the interval between values. Additionally, you can control the series growth pattern through the optional growth parameter.

-- Basic syntax demonstration
generate_series(
    start,    -- Start timestamp
    stop,     -- End timestamp
    step,     -- Interval between values
    growth    -- Optional: Control series growth pattern
)

-- Examples with different configurations
SELECT * FROM generate_series(
    '2024-01-01'::timestamp,
    '2024-01-02'::timestamp,
    '1 hour'
);

When working with real-world scenarios, you might need to generate data that follows specific business hours or patterns. Here’s how you can generate data only for trading hours:

-- Generate only during trading hours (9:30 AM - 4:00 PM)
SELECT * FROM generate_series(
    '2024-01-01 09:30:00'::timestamp,
    '2024-01-01 16:00:00'::timestamp,
    '5 minutes'
) AS t(time)
WHERE EXTRACT(HOUR FROM time) BETWEEN 9 AND 16;

Complex Data Patterns

Real-world time series data often follows cyclical patterns or seasonal trends. For example, temperature readings typically follow daily and yearly cycles, while retail sales might show weekly patterns and holiday season peaks. We can simulate these patterns by combining TimescaleDB’s generate_series with mathematical functions.

Here’s how we can generate temperature data that follows a daily cycle:

WITH time_series AS (
  SELECT generate_series(
    '2024-01-01'::timestamp,
    '2024-01-07'::timestamp,
    '1 hour'
  ) AS time
)
SELECT 
  time,
  20 + 5 * sin((EXTRACT(HOUR FROM time)::float / 24) * 2 * pi()) AS temperature
FROM time_series;

This query generates a week of hourly temperature readings that follow a sinusoidal pattern, simulating daily temperature variations. The base temperature is 20 degrees, with a 5-degree amplitude variation throughout the day.

For more complex patterns that include both seasonal trends and random variations, we can layer multiple periodic functions:

WITH RECURSIVE dates AS (
  SELECT generate_series(
    '2024-01-01'::timestamp,
    '2024-12-31'::timestamp,
    '1 day'
  ) AS date
)
SELECT 
  date,
  100 + (25 * sin((EXTRACT(DOY FROM date)::float / 365) * 2 * pi())) AS base_value,
  random() * 10 AS noise
FROM dates;

When generating large volumes of test data, performance becomes a crucial consideration. TimescaleDB offers several optimization techniques for efficient data generation and insertion. One effective approach is to use batch inserts rather than inserting rows individually:

INSERT INTO measurements 
SELECT * FROM generate_series(
    NOW() - INTERVAL '1 year',
    NOW(),
    '1 minute'
) AS time
CROSS JOIN LATERAL (
    SELECT generate_series(1, 1000) AS sensor_id
) AS sensors;

Data validation is another crucial aspect of test data generation. It’s important to verify that the generated data maintains the expected distribution and characteristics. Here’s a query that helps analyze the distribution of generated values:

SELECT 
    width_bucket(price, min(price), max(price), 10) as bucket,
    count(*),
    min(price) as min_price,
    max(price) as max_price
FROM stock_prices
GROUP BY bucket
ORDER BY bucket;

Special attention should be paid to edge cases and temporal anomalies. For instance, when generating time series data that spans daylight saving time transitions, you need to handle timezone changes correctly:

-- Handle timezone transitions
SELECT generate_series(
    '2024-03-10 01:00:00-05'::timestamptz,
    '2024-03-10 03:00:00-04'::timestamptz,
    '30 minutes'
)::time with time zone;

The step parameter in generate_series accepts various interval formats, from microseconds to years, providing the flexibility to generate data at any granularity needed for your use case. When choosing the appropriate interval, consider both the requirements of your application and the performance implications of the data volume you’ll be generating.

Remember that while having comprehensive test data is important, it’s equally crucial to ensure that the data generation process itself is maintainable and well-documented. Include clear comments explaining any complex patterns or business rules, and consider creating reusable functions for commonly needed data generation patterns.

By leveraging these advanced techniques in TimescaleDB, you can create realistic test datasets that help validate your application’s behavior under various scenarios while maintaining good performance and maintainability.

#timescaledb
#postgresql
#testing
#data-generation