shreyaw

I recently read this 2024 paper on using LLMs to understand time-series data. The paper outlines three different tasks to evaluate an LLM’s ability to reason about some time series:

etiological reasoning
question answering
context-aided forecasting

In pursuit of a better repitoire of LLM-based tasks and their evals, I decided to re-implement parts of the pipeline in this paper.

In this post, I'll explore just the etiological reasoning task.

The task

A model is given

a time series and
three possible descriptions of the time series.

The model's task is to determine which description matches the time series.

For example, a time series might look like:

[20.88, 20.20, 20.48, 21.12, 20.93, 19.5, ..., 19.97, 19.94, 19.59, 19.96]

And four descriptions such as:

A temperature reading from an industrial freezer over two weeks.
Daily household water usage in liters over 3 months.
Congestion at city choke points is measured every 30 minutes over a week.
A time series tracking daily rainfall in a location across a year.

What the paper found

Model / Baseline	Accuracy
Human annotators	66.1%
Random baseline	25%
GPT4*	33.5%
Llama2-13B	27.8%

Basically, LLMs are pretty bad at this.

(*The paper also mentions to keep in mind that GPT4 generated all the evaluation data.)

What I found

Using the dataset from the paper, I ran the task on a few different models (with my own prompts).

Llama 2 13B (control, as this model was used in the paper)
Llama 3.1 8B (have open source models improved?)
Gemini 2.5 Flash (what about closed source models?)

Big Picture

Accuracy Comparison

These accuracy scores indicate how many samples the model got correct out of how many samples the model responded to with a valid answer. The model was asked to output a number between 1 and 4 to indicate which numbered description best matched the provided time series. The open source models in particular sometimes responded with answers outside of this.

model_stats_table

Here, we see Llama 3.1 8B with a much higher valid response rate and much lower variation in unique answers outputted than Llama 2 13b. Gemini 2.5 Flash always provided a valid answer. It could be argued that playing around with the prompts could improve the adherence to the valid answer range for the open models and I'd be curious to see how by how much. But for now, this was something interesting to note.

Also note that only 4500 samples were run on Gemini 2.5 Flash for cost + rate limits.

Diving Deeper

To try to figure out why the models (especially the open models) are still performing so poorly, I played around with the following ways of organizing the data:

Are there any differences between the samples that are correct and the samples that are incorrect?
Do different models perform similarly on the same samples?

Correct vs Incorrect Samples

In our dataset, each sample contains:

a time series
a correct desciption that accurately describes the time series
three incorrect descriptions that do not describe the time series

There are three different types of descriptions that could have been used: description, description_short, and description_tiny. Here are examples of each:

description: A scientist is measuring the temperature in an industrial freezer over two weeks, with a new reading taken every 6 hours. An unexpected power outage occurs in the middle of this period lasting 3 days. This causes the freezer temperature to rise before it is restored and begins dropping again.
description_short: A temperature reading from an industrial freezer over two weeks, including a 3-day power outage.
description_tiny: Freezer Temperature Time Series

In my experiments, I used the description_short type of description. The below plots show the description_shorts embedded for each sample. If the model correctly answered the sample, the point is green. If the model incorrectly answered the sample, the point is red.

Embeddings Plot

There is no clear separation between the red and green points, therefore, there is no immediate indication of a clear semantic difference between the correct and incorrect samples.

As another way of arriving at the same conclusion, I also (1) clustered the embeddings and (2) inspected the percentage of green vs. red points in each cluster. Equal percentage of incorrect and correct samples in each cluster -- same results.

I wanted to also quickly check a few other possible culprits:

time series length
description length (as a very rough proxy for complexity)

Correct vs Incorrect Distributions

The easiest thing to note here is that Llama2 13B seems to show a bias towards shorter time series, yet Llama3.1 8B showed no significant time series length bias.

It is worth noting that these graphs don't distinguish between two possible explanations for the length bias:

longer time series are hard for the model to process
longer time series happen to represent more difficult reasoning problems in this dataset

A deeper analysis into other forms of quantifying "complexity" of each sample based on factors such as the time series metrics and descriptions would be interesting to explore.

Different models, same samples

Some observations:

Models solve mostly different problems.

~2% of samples were correctly answered by all three models, while ~25% of samples were incorrectly answered by all three models. This could indicate that task difficulty varies across models.

Independence betweeen Llama2 13B and Llama3.1 8B

P(Llama3 correct | Llama2 correct) = 0.228
P(Llama3 correct | Llama2 wrong) = 0.223

This is interesting in combination with:

Llama2-correctly only samples average time series length: 162.2
Llama3-correctly only samples average time series length: 482.8

For comparison:

Gemini 2.5 Flash-correctly only samples average time series length: 330.7
All samples average time series length: 350.8

Difficulty of samples likely isn't semantic

If we consider how many models got a sample correct as a proxy for difficulty, difficulty doesn't seem to be semantic. Each tier of difficulty is similarly distributed across the semantic space.

Difficulty Embeddings

What I'd do next

adjust prompt to improve adherence to valid answer range
do all samples on gemini 2.5 flash
run with different length descriptions (i.e. description_tiny)
rerun these experiements to check for consistency
explore across more open source models
explore better ways of quantifying "difficulty" of a sample (most interesting to me)