DataFrame
In DQL (Data Query Language), DataFrame is a two-dimensional, labeled data structure that can store and process different types of data. It is similar to a table in a relational database, where each column can be a different data type. A DataFrame is typically composed of rows and columns — each row represents a sample in the dataset, and each column represents a feature or attribute.
In DQL, DataFrames can be created in various ways, including reading data from files, querying data from databases, converting from other data structures, etc. Once created, you can perform various operations on the DataFrame, such as filtering, sorting, grouping, merging, etc., for data analysis and processing.
In summary, DataFrame is a very important and commonly used data structure in DQL, providing flexible and powerful functionality that makes data processing and analysis more convenient and efficient.
Creation Methods
Create from Query
d1 = query("select * from klines.future_15m limit 100")
return d1
Create from Dict
dict = {"name": ["js", "go"], "count": [1, 2]}
d = DataFrame(dict)
return d
Create from List of Dicts
dictList = [{ "name": "js", "count": 1 }, { "name": "go", "count": 2 }]
d = DataFrame(dictList)
return d
Create from List
d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d
Properties
columns
Returns the list of column names: Type list
d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d.columns # Returns column name list: ["name","count"]
shape
Returns the DataFrame shape (length × width): Type (int, int)
d = DataFrame([["js", 1],["go",2]], columns=["name","count"])
return d.shape # Returns DataFrame shape: (2, 2)
empty
Checks whether the DataFrame is empty: Type bool
d = DataFrame(columns=["name","count"])
return d.empty # Empty DataFrame, returns "True"
has_more
Data returned by query may not be complete (max 10000 rows). Use this property to check whether there is more content to process or iterate over: Type bool
data = query("select * from klines.future_15m")
return data.has_more # More content exists, returns "True"
Methods
head
Returns the first few rows of the DataFrame.
df.head(n) # "n" is the number of rows to return, default is 5
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.head() # Returns first five rows
tail
Returns the last few rows of the DataFrame.
df.tail(n) # "n" is the number of rows to return, default is 5
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.tail() # Returns last five rows
rename
Renames columns of the DataFrame.
df.rename()
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
d2 = d.rename({ "name": "language" })
return d2.columns # Returns renamed columns: ["language", "count"]
items
Returns an items column object, including column names and column data.
df.items()
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.items() # Returns object "dataframe.DataFrameItems"
iterrows
Returns an items row object, including rows and row data.
df.iterrows()
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.iterrows() # Returns object "dataframe.DataFrameRows"
to_list
Converts the DataFrame to a list.
df.to_list()
# Example
d = DataFrame([["js", 1],["go",2],["c",3],["c++",4],["c#",5],["py",6]], columns=["name","count"])
return d.to_list() # Returns column List
append
Appends a row of data to the original DataFrame.
df.append()
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
d1.append({ "name": "c", "count": 16 })
return d1 # Returns DataFrame with the new row appended
ffill
Forward fill missing values.
df.ffill(inplace = False) # "inplace" indicates whether to update in place, default is "False"
# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.ffill() # Missing values filled with the previous non-null value
bfill
Backward fill missing values.
df.bfill(inplace = False)
# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.bfill() # Missing values filled with the next non-null value
fillna
Fill missing values with a fixed value.
df.fillna(n, inplace = False) # Fill missing values with value "n"
# Example
d1 = DataFrame({ "name": ["js", "go", None, "python"], "count": [12, None, 14, 15] })
return d1.fillna(0) # Fill missing values with 0
apply
Apply a function to each column of the DataFrame.
df.apply()
# Example
d1 = DataFrame({"count": [12, 13, 14, 15]})
return d1.apply(lambda x: x * 2) # Returns [24, 26, 28, 30]
rolling
Returns a rolling window object dataframe.Rolling based on the specified parameters.
df.rolling(window=n1, min_periods=n2) # "window" is the sliding window size, "min_periods" is the minimum computation range
# Example
d1 = DataFrame({"count": [12, 13, 14, 15]})
return d1.rolling(window=3, min_periods=1)
When time series data is involved, you can use the rolling method for sliding window calculations. You can specify a sliding period, e.g., seconds (s), minutes (m), hours (h), and add a timeline parameter as the time series index.
df.rolling(period, timeline) # "period" is the sliding time period, "timeline" is the time series index
# Example
d1 = DataFrame({
"count": [4, 6, 1, 4, 7, 1, 5, 8, 2, 9],
"age": [3, 2, 8, 2, 0, 8, 4, 1, 3, 5],
"timeline": [
Timestamp('2013-01-01 09:00:00'),
Timestamp('2013-01-01 09:00:01'),
Timestamp('2013-01-01 09:00:02'),
Timestamp('2013-01-01 09:00:03'),
Timestamp('2013-01-01 09:00:04'),
Timestamp('2013-01-01 09:00:05'),
Timestamp('2013-01-01 09:00:06'),
Timestamp('2013-01-01 09:00:07'),
Timestamp('2013-01-01 09:00:08'),
Timestamp('2013-01-01 09:00:12'),
],
})
r1 = d1[["count","age"]].rolling('3s', timeline=d1["timeline"])
return r1# Returns "dataframe.Rolling" property object
groupby
Groups the DataFrame by specified criteria and returns a dataframe.DataFrameGroupBy object.
df.groupby()
# Example
d1 = DataFrame({ "name": ["js", "js", "js", "go", "go"], "count": [1, 2, 3, 4, 5], "age": [1, 1, 2, 3, 3] })
g1 = d1.groupby(["name"])
return g1
resample
Resamples time series data. Returns a resampling object dataframe.Resampler.
df.resample() # Takes two parameters: the first is the sampling interval, the second is the time index
d1 = DataFrame({
"timeline": [
Timestamp('2013-01-01 08:59:51'),
Timestamp('2013-01-01 08:59:52'),
Timestamp('2013-01-01 08:59:53'),
Timestamp('2013-01-01 08:59:54'),
Timestamp('2013-01-01 09:00:55'),
Timestamp('2013-01-01 09:00:56'),
Timestamp('2013-01-01 09:00:57'),
Timestamp('2013-01-01 09:00:58'),
Timestamp('2013-01-01 09:00:59'),
Timestamp('2013-01-01 09:01:00'),
],
"price": [
4, 6, 1, 4, 7, 1, 5, 8, 2, 9,
]
})
r1 = d1[["price"]].resample('5s', timeline=d1["timeline"])
return r1
sort_values
Sorts the DataFrame by the values in specified columns.
df.sort_values(['col1'...], ascending=False na_position="First") # The first parameter specifies which columns to sort the DataFrame by. You can pass a column name string or a list of column names. If multiple column names are passed, the DataFrame will be sorted in the order of these columns, i.e., first by the first column, then by the second when the first is equal, and so on. "ascending" indicates ascending sort, default is "True". "na_position" indicates the position of missing values, options: "First" or "Last"
# Example
d1 = DataFrame({ "name": ["js", "js", "js", "go", "go"], "count": [1, 2, 3, 4, 5], "age": [1, 1, 2, 3, 3] })
d1 = d1.sort_values(by=['name', 'count'])
return d1 # Returns the result sorted by 'name' and 'count' columns
pivot
Reshapes the data table by pivoting specified column values into new index and column labels. Takes three parameters:
index: The column used for the new DataFrame's row index. Can be a column name string.columns: The column used for the new DataFrame's column index. Can be a column name string.values: The column used to fill the new DataFrame's values. Can be a column name string.
Note: When there are duplicates in the data, directly using the pivot method will raise an error. You need to first use groupby to group and aggregate the data (e.g., mean, sum), then use pivot for reshaping.
df.pivot(index='...', columns='...', values='...')
# Example
data = {
'date': ['2024-05-01', '2024-05-01', '2024-05-01', '2024-05-01'],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Shanghai'],
'temperature': [25, 20, 26, 21]
}
d1 = DataFrame(data)
# Directly calling pivot will error because of duplicates
assert.fails(lambda: d1.pivot(index='date', columns='city', values='temperature'), 'there are duplicates in the data, please groupby first')
# Need to groupby first
d2 = d1.groupby(['date', 'city']).mean()
# Then pivot
d3 = d2.pivot(index='date', columns='city', values='temperature')
# Check that column names match expectations
assert.eq(d3.columns, ["temperature_Beijing", "temperature_Shanghai"])
concat
Used to concatenate multiple data objects along a specified axis. Takes the following parameters:
objs: The objects to merge, which can be a list or dict of Series or DataFrames.axis: The axis to concatenate along. Defaults to0(row-wise). Set to1for column-wise.join: The merge mode."outer"(default) for union — merges all columns or rows;"inner"for intersection — only merges columns or rows that exist in all data objects.
concat([d1, d2], axis = 0/1, join = "outer"/"inner")
# Example
d1 = DataFrame({
"year": [2021, 2022, 2023, 2024],
"month": [1, 2, 3, 4],
"price": [1, 2, 3, 4],
})
d2 = DataFrame({
"year": [2024, 2021, 2022, 2023],
"month": [5, 1, 2, 3],
"count": [4, 3, 2, 1],
"price": [1, 2, 3, 4],
})
# Default outer
d3 = concat([d1, d2])
assert.eq(d3.shape, (8, 4))
assert.eq(list(d3["count"]), [None, None, None, None, 4, 3, 2, 1])
# inner
d3 = concat([d1, d2], join="inner")
assert.eq(d3.shape, (8, 3))
assert.eq(d3.columns, ["year", "month", "price"])
# Merge DataFrame and Series
s1 = Series(["a", "b", "c"], name="s1")
d3 = concat([d1, d2, s1])
assert.eq(d3.shape, (11, 5))
assert.eq(d3.columns, ["year", "month", "price", "count", "s1"])
assert.eq(list(d3['s1']), [None, None, None, None, None, None, None, None, "a", "b", "c"])
d3 = concat([d1, d2, s1], join="inner")
assert.eq(d3.shape, (0, 0))
assert.eq(d3.empty, True)
drop_duplicates
Returns unique rows in the DataFrame, removing duplicate rows.
DataFrame.drop_duplicates()
# Example
data = DataFrame({
"year": [2022, 2022, 2023, 2024],
"month": [2, 2, 2, 2],
"price": [2, 2, 3, 3],
})
return data.drop_duplicates()
# Returns
# year month price
# 0 2022 2 2
# 2 2023 2 3
# 3 2024 2 3
min
Returns the minimum value of each column.
df.min(skipna=True, numeric_only=False) # "skipna" indicates whether to ignore missing values, default "True". "numeric_only" indicates whether to compute only numeric columns, default "False".
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.min(numeric_only=True) # Returns count column minimum: 12
max
Returns the maximum value of each column.
df.max(skipna=True, numeric_only=False)
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.max(numeric_only=True) # Returns count column maximum: 15
std
Returns the standard deviation of each column.
df.std(skipna=True, numeric_only=False, ddof=n) # "ddof" stands for Delta Degrees of Freedom. "n" is the degrees of freedom adjustment value, specifically the value subtracted from the sample size. By default, ddof=1, which computes the sample standard deviation. Set ddof=0 for population standard deviation.
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.std(numeric_only=True) # Returns count column standard deviation: 1.2909944487358056
var
Returns the variance of each column.
df.var(skipna=True, numeric_only=False, ddof=n)
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.var(numeric_only=True) # Returns count column variance: 1.6666666666666667
sum
Returns the sum of each column.
df.sum(skipna=True, numeric_only=False)
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.sum(numeric_only=True) # Returns "count" column sum: 54
mean
Returns the mean of each column.
df.mean(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.mean(numeric_only=True) # Returns 13.5
median
Returns the median of each column.
df.median(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error
# Example
d1 = DataFrame({ "name": ["js", "go", "rust", "python"], "count": [12, 13, 14, 15] })
return d1.median(numeric_only=True) # Returns 13.5
mode
Returns the mode of each column.
df.mode(skipna=True, numeric_only=False) # Computing with non-numeric columns will return an error