Preprocessing Datasets#
ESG Ratings Dataset#
ESG Ratings Dataset Description
The ESG ratings dataset contains environmental, social, and governance (ESG) ratings for various companies from 2011 to 2020.
Structure
The original dataset is a structured CSV file with columns:
‘type’: the type of the rating (ESG, E, S, or G).
‘code’: a unique identifier for the company (‘A’ followed by 6 digits).
‘name’: the name of the company.
‘market’: a category indicating the market in which the company operates (0 for Others, 1 for KOSPI, 2 for KOSDAQ).
‘2011’ to ‘2020’: the ESG ratings for each year.
The ratings range from ‘A+’ (best) to ‘D’ (worst), and missing ratings are denoted as ‘nan’.
Sample data:
type |
code |
name |
market |
2011 |
2012 |
2013 |
2014 |
2015 |
2016 |
2017 |
2018 |
2019 |
2020 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
ESG |
A000010 |
신한은행 |
0 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
1 |
ESG |
A000020 |
동화약품 |
1 |
B |
B |
C |
B |
C |
B |
B |
B |
C |
B |
2 |
ESG |
A000030 |
우리은행 |
1 |
nan |
nan |
nan |
nan |
B+ |
B+ |
B+ |
B+ |
nan |
nan |
3 |
ESG |
A000040 |
KR모터스 |
1 |
C |
B |
C |
B |
C |
C |
C |
D |
C |
C |
4 |
ESG |
A000050 |
경방 |
1 |
nan |
B |
B |
C |
C |
C |
C |
C |
C |
C |
Conversion Procedure
The dataset was transformed for further analysis using the following procedure:
The ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) were first mapped to numerical scores (6, 5, 4, 3, 2, 1, and NaN respectively) for easy comparison.
The DataFrame was reshaped (“melted”) to convert the year columns into rows, resulting in a DataFrame with ‘year’ and ‘rating_score’ columns. This makes the DataFrame longer but less wide, and puts it in a format that’s more convenient for many types of analysis.
The ‘year’ column was converted to integer for correct chronological order sorting.
The DataFrame was sorted by ‘type’, ‘code’, and ‘year’ to ensure correct grouping and comparison for each type of rating, company, and year.
The ‘rating_change’ column was calculated by finding the difference in rating scores for each group of ‘type’ and ‘code’. The resulting values indicate the direction and magnitude of change in the rating score from the previous year.
The numerical scores in the ‘rating_score’ column were mapped back to their original ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) to create a new ‘rating’ column.
The ‘rating_change’ values were adjusted to indicate the direction of change: ‘1’ if the rating improved from the previous year, ‘0’ if it stayed the same, ‘-1’ if it worsened, and ‘nan’ if the rating was not available in either the previous year or the current year.
The resulting dataset contains the original ESG ratings along with additional columns for the numerical rating scores and the direction of rating change, allowing for more detailed and flexible analysis of the ESG performance of companies over time.
Convert ratings column 2011 ~ 2022 to rows with year
column and rating
column.
Also, add rating_change column. It is the direction of rating change. 1
means up, 0
means same, -1
means down.
Converted data is saved as ESG_ratings_processed.parquet
.
Sample data:
type |
code |
name |
market |
year |
rating_score |
rating_change |
rating |
|
---|---|---|---|---|---|---|---|---|
2263 |
E |
A005930 |
삼성전자 |
1 |
2011 |
6 |
nan |
A+ |
10139 |
E |
A005930 |
삼성전자 |
1 |
2012 |
5 |
-1 |
A |
18015 |
E |
A005930 |
삼성전자 |
1 |
2013 |
5 |
0 |
A |
25891 |
E |
A005930 |
삼성전자 |
1 |
2014 |
5 |
0 |
A |
33767 |
E |
A005930 |
삼성전자 |
1 |
2015 |
6 |
1 |
A+ |
41643 |
E |
A005930 |
삼성전자 |
1 |
2016 |
5 |
-1 |
A |
49519 |
E |
A005930 |
삼성전자 |
1 |
2017 |
5 |
0 |
A |
57395 |
E |
A005930 |
삼성전자 |
1 |
2018 |
5 |
0 |
A |
65271 |
E |
A005930 |
삼성전자 |
1 |
2019 |
4 |
-1 |
B+ |
73147 |
E |
A005930 |
삼성전자 |
1 |
2020 |
5 |
1 |
A |
from esgcov import HyFI
if HyFI.is_colab():
HyFI.mount_google_drive()
prj = HyFI.init_project(
project_name="esg-coverage",
task_name="datasets",
project_root=HyFI.dotenv().DOTENV_DIR or ".",
log_level="WARNING",
verbose=True,
)
print("project directory:", prj.root_dir)
print("project workspace directory:", prj.workspace_dir)
INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.
project directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage
project workspace directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/workspace
config = HyFI.compose("workflow=datasets")
config.project.project_root = prj.project_root
wf = HyFI.workflow_config(**config)
HyFI.run_workflow(wf)
INFO:hyfi.composer:Composing `WorkflowConfig` class with `datasets` config in `workflow` group.
INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.pipeline:Applying 6 pipes
Change directory to workspace/tasks/dataset
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.106289
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
type code name market year rating_score rating_change rating
1969 E A000010 신한은행 0 2011 NaN NaN NaN
9845 E A000010 신한은행 0 2012 NaN NaN NaN
17721 E A000010 신한은행 0 2013 NaN NaN NaN
25597 E A000010 신한은행 0 2014 NaN NaN NaN
33473 E A000010 신한은행 0 2015 NaN NaN NaN
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_ratings_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.090372
type code name market year rating_score rating_change rating
1969 E A000010 신한은행 0 2011 NaN NaN NaN
9845 E A000010 신한은행 0 2012 NaN NaN NaN
17721 E A000010 신한은행 0 2013 NaN NaN NaN
25597 E A000010 신한은행 0 2014 NaN NaN NaN
33473 E A000010 신한은행 0 2015 NaN NaN NaN
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.098626
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
code year type topic_weight yearly_change coverage_change
0 A000070 2013 E 0.000017 NaN NaN
1 A000070 2015 E 0.000025 7.496203e-06 0.0
2 A000070 2016 E 0.000029 3.520852e-06 0.0
3 A000070 2017 E 0.000028 -5.451693e-08 0.0
4 A000080 2011 E 0.001494 NaN NaN
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_coverage_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.010824
code year type topic_weight yearly_change coverage_change
0 A000070 2013 E 0.000017 NaN NaN
1 A000070 2015 E 0.000025 7.496203e-06 0.0
2 A000070 2016 E 0.000029 3.520852e-06 0.0
3 A000070 2017 E 0.000028 -5.451693e-08 0.0
4 A000080 2011 E 0.001494 NaN NaN
Change directory back to /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/book/supplementary
ratings_df = HyFI.load_dataframe(
"processed/ESG_ratings_processed.parquet",
data_dir=prj.workspace_dir / "dataset",
)
print(
ratings_df[ratings_df.code == "A005930"]
.sort_values(["type", "code", "year"])
.head(10)
.to_markdown()
)
| | type | code | name | market | year | rating_score | rating_change | rating |
|------:|:-------|:--------|:---------|---------:|-------:|---------------:|----------------:|:---------|
| 2263 | E | A005930 | 삼성전자 | 1 | 2011 | 6 | nan | A+ |
| 10139 | E | A005930 | 삼성전자 | 1 | 2012 | 5 | -1 | A |
| 18015 | E | A005930 | 삼성전자 | 1 | 2013 | 5 | 0 | A |
| 25891 | E | A005930 | 삼성전자 | 1 | 2014 | 5 | 0 | A |
| 33767 | E | A005930 | 삼성전자 | 1 | 2015 | 6 | 1 | A+ |
| 41643 | E | A005930 | 삼성전자 | 1 | 2016 | 5 | -1 | A |
| 49519 | E | A005930 | 삼성전자 | 1 | 2017 | 5 | 0 | A |
| 57395 | E | A005930 | 삼성전자 | 1 | 2018 | 5 | 0 | A |
| 65271 | E | A005930 | 삼성전자 | 1 | 2019 | 4 | -1 | B+ |
| 73147 | E | A005930 | 삼성전자 | 1 | 2020 | 5 | 1 | A |
ESG Coverage Dataset#
ESG Coverage Dataset Description
The ESG coverage dataset provides data on the topic weights of different environmental, social, and governance (ESG) topics for various companies over the years. Specifically, it has the following columns:
code
: a string representing the unique identifier of a company.year
: the year for which the topic weights are provided.topic_num
: a string representing the topic number, ranging from ‘topic0’ to ‘topic35’.topic_weight
: a float representing the weight of a specific topic for a given company in a given year.
Sample data:
code |
year |
topic_num |
topic_weight |
|
---|---|---|---|---|
0 |
A000070 |
2003 |
topic0 |
1.86484e-05 |
1 |
A000070 |
2003 |
topic1 |
9.6384e-05 |
2 |
A000070 |
2003 |
topic10 |
2.59255e-05 |
3 |
A000070 |
2003 |
topic11 |
3.35297e-05 |
4 |
A000070 |
2003 |
topic12 |
0.292819 |
The conversion process reshapes and enriches this raw dataset to provide more insightful views of the ESG coverage. This process involves the following steps:
Pivot the DataFrame: The
pivot()
function is used to reshape the raw data, withcode
andyear
as the index,topic_num
as the columns, andtopic_weight
as the values.Create New Categories: New columns are added to represent the total topic weights for different ESG categories. These include:
‘E’: the weight of ‘topic0’
‘S’: the weight of ‘topic1’
‘G’: the sum of the weights of ‘topic2’ and ‘topic6’
‘G2’: the weight of ‘topic2’
‘ESG’: the sum of the weights of ‘E’, ‘S’, and ‘G’
‘ESG2’: the sum of the weights of ‘E’, ‘S’, and ‘G2’
Melt the DataFrame: The DataFrame is then melted to reshape it again, turning the newly created categories into rows in a new ‘type’ column, with the corresponding topic weights listed in a ‘topic_weight’ column. Each row now represents the total topic weight for a specific ESG category, for a specific company in a given year.
Calculate Yearly Changes: For each company and each ESG category, the year-on-year changes in ‘topic_weight’ are calculated.
Calculate Coverage Change: Based on the yearly changes, a new ‘coverage_change’ column is created. If the change in ‘topic_weight’ is greater than 0.1 standard deviations, ‘coverage_change’ is set to 1. If it is less than -0.1 standard deviations, it is set to -1. Otherwise, it is set to 0. If data for the previous year does not exist or is NaN, ‘coverage_change’ is set to NaN.
The converted ESG coverage dataset is now in a format that allows for more detailed analysis and comparison with the ESG ratings. It provides an overview of the emphasis placed on different ESG topics by each company over time, as well as the yearly changes in these topic weights.
Converted data is saved as ESG_coverage_processed.parquet
.
Sample data:
code |
year |
type |
topic_weight |
yearly_change |
coverage_change |
|
---|---|---|---|---|---|---|
412 |
A005930 |
2011 |
E |
0.00133423 |
nan |
nan |
413 |
A005930 |
2012 |
E |
7.16375e-05 |
-0.00126259 |
-1 |
414 |
A005930 |
2013 |
E |
0.000591296 |
0.000519659 |
0 |
415 |
A005930 |
2014 |
E |
0.000554041 |
-3.72556e-05 |
0 |
416 |
A005930 |
2015 |
E |
0.000238747 |
-0.000315294 |
0 |
417 |
A005930 |
2016 |
E |
0.000624334 |
0.000385587 |
0 |
418 |
A005930 |
2017 |
E |
0.000190126 |
-0.000434208 |
0 |
419 |
A005930 |
2018 |
E |
0.000252869 |
6.27427e-05 |
0 |
420 |
A005930 |
2019 |
E |
0.0004799 |
0.000227031 |
0 |
421 |
A005930 |
2020 |
E |
0.00112099 |
0.000641088 |
0 |
coverage_df = HyFI.load_dataframe(
"processed/ESG_coverage_processed.parquet",
data_dir=prj.workspace_dir / "dataset",
)
print(
coverage_df[coverage_df.code == "A005930"]
.sort_values(["type", "code", "year"])
.head(10)
.to_markdown()
)
| | code | year | type | topic_weight | yearly_change | coverage_change |
|----:|:--------|-------:|:-------|---------------:|----------------:|------------------:|
| 412 | A005930 | 2011 | E | 0.00133423 | nan | nan |
| 413 | A005930 | 2012 | E | 7.16375e-05 | -0.00126259 | 0 |
| 414 | A005930 | 2013 | E | 0.000591296 | 0.000519659 | 0 |
| 415 | A005930 | 2014 | E | 0.000554041 | -3.72556e-05 | 0 |
| 416 | A005930 | 2015 | E | 0.000238747 | -0.000315294 | 0 |
| 417 | A005930 | 2016 | E | 0.000624334 | 0.000385587 | 0 |
| 418 | A005930 | 2017 | E | 0.000190126 | -0.000434208 | 0 |
| 419 | A005930 | 2018 | E | 0.000252869 | 6.27427e-05 | 0 |
| 420 | A005930 | 2019 | E | 0.0004799 | 0.000227031 | 0 |
| 421 | A005930 | 2020 | E | 0.00112099 | 0.000641088 | 0 |