DataFrame

In DQL (Data Query Language), DataFrame is a two-dimensional, labeled data structure that can store and process different types of data. It is similar to a table in a relational database, where each column can be a different data type. A DataFrame is typically composed of rows and columns — each row represents a sample in the dataset, and each column represents a feature or attribute.

In DQL, DataFrames can be created in various ways, including reading data from files, querying data from databases, converting from other data structures, etc. Once created, you can perform various operations on the DataFrame, such as filtering, sorting, grouping, merging, etc., for data analysis and processing.

In summary, DataFrame is a very important and commonly used data structure in DQL, providing flexible and powerful functionality that makes data processing and analysis more convenient and efficient.

Creation Methods

Create from Query

d1 = query("select * from klines.future_15m limit 100")
return d1

Create from Dict

dict = {"name": ["js", "go"], "count": [1, 2]}
d = DataFrame(dict)
return d

Create from List of Dicts

dictList = [{ "name": "js", "count": 1 }, { "name": "go", "count": 2 }]
d = DataFrame(dictList)
return d

Create from List

d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d

Properties

`columns`

Returns the list of column names: Type list

d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d.columns # Returns column name list: ["name","count"]

`shape`

Returns the DataFrame shape (length × width): Type (int, int)

d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d.shape # Returns DataFrame shape: (2, 2)

`empty`

Checks whether the DataFrame is empty: Type bool

d = DataFrame(columns=["name","count"])
return d.empty # Empty DataFrame, returns "True"

`has_more`

Data returned by query may not be complete (max 10000 rows). Use this property to check whether there is more content to process or iterate over: Type bool

data = query("select * from klines.future_15m")
return data.has_more # More content exists, returns "True"

Methods

`head`

Returns the first few rows of the DataFrame.

df.head(n) # "n" is the number of rows to return, default is 5

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.head() # Returns first five rows

`tail`

Returns the last few rows of the DataFrame.

df.tail(n) # "n" is the number of rows to return, default is 5

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.tail() # Returns last five rows

`rename`

Renames columns of the DataFrame.

df.rename()

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
d2 = d.rename({ "name": "language" })
return d2.columns # Returns renamed columns: ["language", "count"]

`items`

Returns an items column object, including column names and column data.

df.items()

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.items() # Returns object "dataframe.DataFrameItems"

`iterrows`

Returns an items row object, including rows and row data.

df.iterrows()

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.iterrows() # Returns object "dataframe.DataFrameRows"

`to_list`

Converts the DataFrame to a list.

df.to_list()

# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.to_list() # Returns column List

`append`

Appends a row of data to the original DataFrame.

df.append()

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
d1.append({ "name": "c", "count": 16 })
return d1  # Returns DataFrame with the new row appended

`ffill`

Forward fill missing values.

df.ffill(inplace = False) # "inplace" indicates whether to update in place, default is "False"

# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.ffill() # Missing values filled with the previous non-null value

`bfill`

Backward fill missing values.

df.bfill(inplace = False)

# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.bfill() # Missing values filled with the next non-null value

`fillna`

Fill missing values with a fixed value.

df.fillna(n, inplace = False) # Fill missing values with value "n"

# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.fillna(0) # Fill missing values with 0

`apply`

Apply a function to each column of the DataFrame.

df.apply()

# Example
d1 = DataFrame({"count": [12, 13, 14, 15]})
return d1.apply(lambda x: x * 2) # Returns [24, 26, 28, 30]

`rolling`

Returns a rolling window object dataframe.Rolling based on the specified parameters.

df.rolling(window=n1, min_periods=n2) # "window" is the sliding window size, "min_periods" is the minimum computation range

# Example
d1 = DataFrame({"count": [12, 13, 14, 15]})
return d1.rolling(window=3, min_periods=1)

When time series data is involved, you can use the rolling method for sliding window calculations. You can specify a sliding period, e.g., seconds (s), minutes (m), hours (h), and add a timeline parameter as the time series index.

df.rolling(period, timeline) # "period" is the sliding time period, "timeline" is the time series index

# Example
d1 = DataFrame({
  "count": [4, 6, 1, 4, 7, 1, 5, 8, 2, 9],
  "age": [3, 2, 8, 2, 0, 8, 4, 1, 3, 5],
  "timeline": [
    Timestamp('2013-01-01 09:00:00'),
    Timestamp('2013-01-01 09:00:01'),
    Timestamp('2013-01-01 09:00:02'),
    Timestamp('2013-01-01 09:00:03'),
    Timestamp('2013-01-01 09:00:04'),
    Timestamp('2013-01-01 09:00:05'),
    Timestamp('2013-01-01 09:00:06'),
    Timestamp('2013-01-01 09:00:07'),
    Timestamp('2013-01-01 09:00:08'),
    Timestamp('2013-01-01 09:00:12'),
  ],
})
r1 = d1[["count","age"]].rolling('3s', timeline=d1["timeline"])
return r1# Returns "dataframe.Rolling" property object

`groupby`

Groups the DataFrame by specified criteria and returns a dataframe.DataFrameGroupBy object.

df.groupby()

# Example
d1 = DataFrame({ "name": ["js", "js", "js", "go", "go"], "count": [1, 2, 3, 4, 5], "age": [1, 1, 2, 3, 3] })
g1 = d1.groupby(["name"])
return g1

`resample`

Resamples time series data. Returns a resampling object dataframe.Resampler.

df.resample() # Takes two parameters: the first is the sampling interval, the second is the time index

d1 = DataFrame({
  "timeline": [
    Timestamp('2013-01-01 08:59:51'),
    Timestamp('2013-01-01 08:59:52'),
    Timestamp('2013-01-01 08:59:53'),
    Timestamp('2013-01-01 08:59:54'),
    Timestamp('2013-01-01 09:00:55'),
    Timestamp('2013-01-01 09:00:56'),
    Timestamp('2013-01-01 09:00:57'),
    Timestamp('2013-01-01 09:00:58'),
    Timestamp('2013-01-01 09:00:59'),
    Timestamp('2013-01-01 09:01:00'),
  ],
  "price": [
    4, 6, 1, 4, 7, 1, 5, 8, 2, 9,
  ]
})
r1 = d1[["price"]].resample('5s', timeline=d1["timeline"])
return r1

`sort_values`

Sorts the DataFrame by the values in specified columns.

df.sort_values(['col1'...],  ascending=False na_position="First") # The first parameter specifies which columns to sort the DataFrame by. You can pass a column name string or a list of column names. If multiple column names are passed, the DataFrame will be sorted in the order of these columns, i.e., first by the first column, then by the second when the first is equal, and so on. "ascending" indicates ascending sort, default is "True". "na_position" indicates the position of missing values, options: "First" or "Last"

# Example
d1 = DataFrame({ "name": ["js", "js", "js", "go", "go"], "count": [1, 2, 3, 4, 5], "age": [1, 1, 2, 3, 3] })
d1 = d1.sort_values(by=['name', 'count'])
return d1 # Returns the result sorted by 'name' and 'count' columns

`pivot`

Reshapes the data table by pivoting specified column values into new index and column labels. Takes three parameters:

index: The column used for the new DataFrame's row index. Can be a column name string.
columns: The column used for the new DataFrame's column index. Can be a column name string.
values: The column used to fill the new DataFrame's values. Can be a column name string.

Note: When there are duplicates in the data, directly using the pivot method will raise an error. You need to first use groupby to group and aggregate the data (e.g., mean, sum), then use pivot for reshaping.

df.pivot(index='...', columns='...', values='...')

# Example
data = {
  'date': ['2024-05-01', '2024-05-01', '2024-05-01', '2024-05-01'],
  'city': ['Beijing', 'Shanghai', 'Beijing', 'Shanghai'],
  'temperature': [25, 20, 26, 21]
}
d1 = DataFrame(data)

# Directly calling pivot will error because of duplicates
assert.fails(lambda: d1.pivot(index='date', columns='city', values='temperature'), 'there are duplicates in the data, please groupby first')

# Need to groupby first
d2 = d1.groupby(['date', 'city']).mean()

# Then pivot
d3 = d2.pivot(index='date', columns='city', values='temperature')

# Check that column names match expectations
assert.eq(d3.columns, ["temperature_Beijing", "temperature_Shanghai"])

`concat`

Used to concatenate multiple data objects along a specified axis. Takes the following parameters:

objs: The objects to merge, which can be a list or dict of Series or DataFrames.
axis: The axis to concatenate along. Defaults to 0 (row-wise). Set to 1 for column-wise.
join: The merge mode. "outer" (default) for union — merges all columns or rows; "inner" for intersection — only merges columns or rows that exist in all data objects.

concat([d1, d2], axis = 0/1, join = "outer"/"inner")

# Example
d1 = DataFrame({
  "year": [2021, 2022, 2023, 2024],
  "month": [1, 2, 3, 4],
  "price": [1, 2, 3, 4],
})
d2 = DataFrame({
  "year": [2024, 2021, 2022, 2023],
  "month": [5, 1, 2, 3],
  "count": [4, 3, 2, 1],
  "price": [1, 2, 3, 4],
})
# Default outer
d3 = concat([d1, d2])
assert.eq(d3.shape, (8, 4))
assert.eq(list(d3["count"]), [None, None, None, None, 4, 3, 2, 1])
# inner
d3 = concat([d1, d2], join="inner")
assert.eq(d3.shape, (8, 3))
assert.eq(d3.columns, ["year", "month", "price"])

# Merge DataFrame and Series
s1 = Series(["a", "b", "c"], name="s1")
d3 = concat([d1, d2, s1])
assert.eq(d3.shape, (11, 5))
assert.eq(d3.columns, ["year", "month", "price", "count", "s1"])
assert.eq(list(d3['s1']), [None, None, None, None, None, None, None, None, "a", "b", "c"])

d3 = concat([d1, d2, s1], join="inner")
assert.eq(d3.shape, (0, 0))
assert.eq(d3.empty, True)

`drop_duplicates`

Returns unique rows in the DataFrame, removing duplicate rows.

DataFrame.drop_duplicates()

# Example
data = DataFrame({
  "year": [2022, 2022, 2023, 2024],
  "month": [2, 2, 2, 2],
  "price": [2, 2, 3, 3],
})

return data.drop_duplicates()
# Returns
#    year  month  price
# 0  2022      2      2
# 2  2023      2      3
# 3  2024      2      3

`min`

Returns the minimum value of each column.

df.min(skipna=True, numeric_only=False) # "skipna" indicates whether to ignore missing values, default "True". "numeric_only" indicates whether to compute only numeric columns, default "False".

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.min(numeric_only=True) # Returns count column minimum: 12

`max`

Returns the maximum value of each column.

df.max(skipna=True, numeric_only=False)

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.max(numeric_only=True) # Returns count column maximum: 15

`std`

Returns the standard deviation of each column.

df.std(skipna=True, numeric_only=False, ddof=n) # "ddof" stands for Delta Degrees of Freedom. "n" is the degrees of freedom adjustment value, specifically the value subtracted from the sample size. By default, ddof=1, which computes the sample standard deviation. Set ddof=0 for population standard deviation.

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.std(numeric_only=True) # Returns count column standard deviation: 1.2909944487358056

`var`

Returns the variance of each column.

df.var(skipna=True, numeric_only=False, ddof=n)

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.var(numeric_only=True) # Returns count column variance: 1.6666666666666667

`sum`

Returns the sum of each column.

df.sum(skipna=True, numeric_only=False)

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.sum(numeric_only=True)  # Returns "count" column sum: 54

`mean`

Returns the mean of each column.

df.mean(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.mean(numeric_only=True)   # Returns 13.5

`median`

Returns the median of each column.

df.median(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error

# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.median(numeric_only=True)   # Returns 13.5

`mode`

Returns the mode of each column.

df.mode(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error

Creation Methods​

Create from Query​

Create from Dict​

Create from List of Dicts​

Create from List​

Properties​

columns​

shape​

empty​

has_more​

Methods​

head​

tail​

rename​

items​

iterrows​

to_list​

append​

ffill​

bfill​

fillna​

apply​

rolling​

groupby​

resample​

sort_values​

pivot​

concat​

drop_duplicates​

min​

max​

std​

var​

sum​

mean​

median​

mode​