Preprocessing Datasets

Preprocessing Datasets#

ESG Ratings Dataset#

ESG Ratings Dataset Description

The ESG ratings dataset contains environmental, social, and governance (ESG) ratings for various companies from 2011 to 2020.

Structure

The original dataset is a structured CSV file with columns:

‘type’: the type of the rating (ESG, E, S, or G).
‘code’: a unique identifier for the company (‘A’ followed by 6 digits).
‘name’: the name of the company.
‘market’: a category indicating the market in which the company operates (0 for Others, 1 for KOSPI, 2 for KOSDAQ).
‘2011’ to ‘2020’: the ESG ratings for each year.

The ratings range from ‘A+’ (best) to ‘D’ (worst), and missing ratings are denoted as ‘nan’.

Sample data:

	type	code	name	market	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020
0	ESG	A000010	신한은행	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
1	ESG	A000020	동화약품	1	B	B	C	B	C	B	B	B	C	B
2	ESG	A000030	우리은행	1	nan	nan	nan	nan	B+	B+	B+	B+	nan	nan
3	ESG	A000040	KR모터스	1	C	B	C	B	C	C	C	D	C	C
4	ESG	A000050	경방	1	nan	B	B	C	C	C	C	C	C	C

Conversion Procedure

The dataset was transformed for further analysis using the following procedure:

The ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) were first mapped to numerical scores (6, 5, 4, 3, 2, 1, and NaN respectively) for easy comparison.
The DataFrame was reshaped (“melted”) to convert the year columns into rows, resulting in a DataFrame with ‘year’ and ‘rating_score’ columns. This makes the DataFrame longer but less wide, and puts it in a format that’s more convenient for many types of analysis.
The ‘year’ column was converted to integer for correct chronological order sorting.
The DataFrame was sorted by ‘type’, ‘code’, and ‘year’ to ensure correct grouping and comparison for each type of rating, company, and year.
The ‘rating_change’ column was calculated by finding the difference in rating scores for each group of ‘type’ and ‘code’. The resulting values indicate the direction and magnitude of change in the rating score from the previous year.
The numerical scores in the ‘rating_score’ column were mapped back to their original ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) to create a new ‘rating’ column.
The ‘rating_change’ values were adjusted to indicate the direction of change: ‘1’ if the rating improved from the previous year, ‘0’ if it stayed the same, ‘-1’ if it worsened, and ‘nan’ if the rating was not available in either the previous year or the current year.

The resulting dataset contains the original ESG ratings along with additional columns for the numerical rating scores and the direction of rating change, allowing for more detailed and flexible analysis of the ESG performance of companies over time.

Convert ratings column 2011 ~ 2022 to rows with year column and rating column. Also, add rating_change column. It is the direction of rating change. 1 means up, 0 means same, -1 means down.

Converted data is saved as ESG_ratings_processed.parquet.

Sample data:

	type	code	name	market	year	rating_score	rating_change	rating
2263	E	A005930	삼성전자	1	2011	6	nan	A+
10139	E	A005930	삼성전자	1	2012	5	-1	A
18015	E	A005930	삼성전자	1	2013	5	0	A
25891	E	A005930	삼성전자	1	2014	5	0	A
33767	E	A005930	삼성전자	1	2015	6	1	A+
41643	E	A005930	삼성전자	1	2016	5	-1	A
49519	E	A005930	삼성전자	1	2017	5	0	A
57395	E	A005930	삼성전자	1	2018	5	0	A
65271	E	A005930	삼성전자	1	2019	4	-1	B+
73147	E	A005930	삼성전자	1	2020	5	1	A

from esgcov import HyFI

if HyFI.is_colab():
    HyFI.mount_google_drive()

prj = HyFI.init_project(
    project_name="esg-coverage",
    task_name="datasets",
    project_root=HyFI.dotenv().DOTENV_DIR or ".",
    log_level="WARNING",
    verbose=True,
)

print("project directory:", prj.root_dir)
print("project workspace directory:", prj.workspace_dir)

INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.

project directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage
project workspace directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/workspace

config = HyFI.compose("workflow=datasets")
config.project.project_root = prj.project_root
wf = HyFI.workflow_config(**config)
HyFI.run_workflow(wf)

INFO:hyfi.composer:Composing `WorkflowConfig` class with `datasets` config in `workflow` group.
INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.pipeline:Applying 6 pipes

 Change directory to workspace/tasks/dataset

INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.106289
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.

      type     code  name  market  year  rating_score  rating_change rating
   E  A000010  신한은행       0  2011           NaN            NaN    NaN
   E  A000010  신한은행       0  2012           NaN            NaN    NaN
  E  A000010  신한은행       0  2013           NaN            NaN    NaN
  E  A000010  신한은행       0  2014           NaN            NaN    NaN
  E  A000010  신한은행       0  2015           NaN            NaN    NaN

INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_ratings_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.090372

      type     code  name  market  year  rating_score  rating_change rating
   E  A000010  신한은행       0  2011           NaN            NaN    NaN
   E  A000010  신한은행       0  2012           NaN            NaN    NaN
  E  A000010  신한은행       0  2013           NaN            NaN    NaN
  E  A000010  신한은행       0  2014           NaN            NaN    NaN
  E  A000010  신한은행       0  2015           NaN            NaN    NaN

INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.098626
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.

      code  year type  topic_weight  yearly_change  coverage_change
A000070  2013    E      0.000017            NaN              NaN
A000070  2015    E      0.000025   7.496203e-06              0.0
A000070  2016    E      0.000029   3.520852e-06              0.0
A000070  2017    E      0.000028  -5.451693e-08              0.0
A000080  2011    E      0.001494            NaN              NaN

INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_coverage_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.010824

      code  year type  topic_weight  yearly_change  coverage_change
A000070  2013    E      0.000017            NaN              NaN
A000070  2015    E      0.000025   7.496203e-06              0.0
A000070  2016    E      0.000029   3.520852e-06              0.0
A000070  2017    E      0.000028  -5.451693e-08              0.0
A000080  2011    E      0.001494            NaN              NaN
 Change directory back to /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/book/supplementary

ratings_df = HyFI.load_dataframe(
    "processed/ESG_ratings_processed.parquet",
    data_dir=prj.workspace_dir / "dataset",
)

print(
    ratings_df[ratings_df.code == "A005930"]
    .sort_values(["type", "code", "year"])
    .head(10)
    .to_markdown()
)

|       | type   | code    | name     |   market |   year |   rating_score |   rating_change | rating   |
|------:|:-------|:--------|:---------|---------:|-------:|---------------:|----------------:|:---------|
|  2263 | E      | A005930 | 삼성전자 |        1 |   2011 |              6 |             nan | A+       |
| 10139 | E      | A005930 | 삼성전자 |        1 |   2012 |              5 |              -1 | A        |
| 18015 | E      | A005930 | 삼성전자 |        1 |   2013 |              5 |               0 | A        |
| 25891 | E      | A005930 | 삼성전자 |        1 |   2014 |              5 |               0 | A        |
| 33767 | E      | A005930 | 삼성전자 |        1 |   2015 |              6 |               1 | A+       |
| 41643 | E      | A005930 | 삼성전자 |        1 |   2016 |              5 |              -1 | A        |
| 49519 | E      | A005930 | 삼성전자 |        1 |   2017 |              5 |               0 | A        |
| 57395 | E      | A005930 | 삼성전자 |        1 |   2018 |              5 |               0 | A        |
| 65271 | E      | A005930 | 삼성전자 |        1 |   2019 |              4 |              -1 | B+       |
| 73147 | E      | A005930 | 삼성전자 |        1 |   2020 |              5 |               1 | A        |

ESG Coverage Dataset#

ESG Coverage Dataset Description

The ESG coverage dataset provides data on the topic weights of different environmental, social, and governance (ESG) topics for various companies over the years. Specifically, it has the following columns:

code: a string representing the unique identifier of a company.
year: the year for which the topic weights are provided.
topic_num: a string representing the topic number, ranging from ‘topic0’ to ‘topic35’.
topic_weight: a float representing the weight of a specific topic for a given company in a given year.

Sample data:

	code	year	topic_num	topic_weight
0	A000070	2003	topic0	1.86484e-05
1	A000070	2003	topic1	9.6384e-05
2	A000070	2003	topic10	2.59255e-05
3	A000070	2003	topic11	3.35297e-05
4	A000070	2003	topic12	0.292819

The conversion process reshapes and enriches this raw dataset to provide more insightful views of the ESG coverage. This process involves the following steps:

Pivot the DataFrame: The pivot() function is used to reshape the raw data, with code and year as the index, topic_num as the columns, and topic_weight as the values.
Create New Categories: New columns are added to represent the total topic weights for different ESG categories. These include:
- ‘E’: the weight of ‘topic0’
- ‘S’: the weight of ‘topic1’
- ‘G’: the sum of the weights of ‘topic2’ and ‘topic6’
- ‘G2’: the weight of ‘topic2’
- ‘ESG’: the sum of the weights of ‘E’, ‘S’, and ‘G’
- ‘ESG2’: the sum of the weights of ‘E’, ‘S’, and ‘G2’
Melt the DataFrame: The DataFrame is then melted to reshape it again, turning the newly created categories into rows in a new ‘type’ column, with the corresponding topic weights listed in a ‘topic_weight’ column. Each row now represents the total topic weight for a specific ESG category, for a specific company in a given year.
Calculate Yearly Changes: For each company and each ESG category, the year-on-year changes in ‘topic_weight’ are calculated.
Calculate Coverage Change: Based on the yearly changes, a new ‘coverage_change’ column is created. If the change in ‘topic_weight’ is greater than 0.1 standard deviations, ‘coverage_change’ is set to 1. If it is less than -0.1 standard deviations, it is set to -1. Otherwise, it is set to 0. If data for the previous year does not exist or is NaN, ‘coverage_change’ is set to NaN.

The converted ESG coverage dataset is now in a format that allows for more detailed analysis and comparison with the ESG ratings. It provides an overview of the emphasis placed on different ESG topics by each company over time, as well as the yearly changes in these topic weights.

Converted data is saved as ESG_coverage_processed.parquet.

Sample data:

	code	year	type	topic_weight	yearly_change	coverage_change
412	A005930	2011	E	0.00133423	nan	nan
413	A005930	2012	E	7.16375e-05	-0.00126259	-1
414	A005930	2013	E	0.000591296	0.000519659	0
415	A005930	2014	E	0.000554041	-3.72556e-05	0
416	A005930	2015	E	0.000238747	-0.000315294	0
417	A005930	2016	E	0.000624334	0.000385587	0
418	A005930	2017	E	0.000190126	-0.000434208	0
419	A005930	2018	E	0.000252869	6.27427e-05	0
420	A005930	2019	E	0.0004799	0.000227031	0
421	A005930	2020	E	0.00112099	0.000641088	0

coverage_df = HyFI.load_dataframe(
    "processed/ESG_coverage_processed.parquet",
    data_dir=prj.workspace_dir / "dataset",
)

print(
    coverage_df[coverage_df.code == "A005930"]
    .sort_values(["type", "code", "year"])
    .head(10)
    .to_markdown()
)

|     | code    |   year | type   |   topic_weight |   yearly_change |   coverage_change |
|----:|:--------|-------:|:-------|---------------:|----------------:|------------------:|
| 412 | A005930 |   2011 | E      |    0.00133423  |   nan           |               nan |
| 413 | A005930 |   2012 | E      |    7.16375e-05 |    -0.00126259  |                 0 |
| 414 | A005930 |   2013 | E      |    0.000591296 |     0.000519659 |                 0 |
| 415 | A005930 |   2014 | E      |    0.000554041 |    -3.72556e-05 |                 0 |
| 416 | A005930 |   2015 | E      |    0.000238747 |    -0.000315294 |                 0 |
| 417 | A005930 |   2016 | E      |    0.000624334 |     0.000385587 |                 0 |
| 418 | A005930 |   2017 | E      |    0.000190126 |    -0.000434208 |                 0 |
| 419 | A005930 |   2018 | E      |    0.000252869 |     6.27427e-05 |                 0 |
| 420 | A005930 |   2019 | E      |    0.0004799   |     0.000227031 |                 0 |
| 421 | A005930 |   2020 | E      |    0.00112099  |     0.000641088 |                 0 |

Preprocessing Datasets

Contents

Preprocessing Datasets#

ESG Ratings Dataset#

ESG Coverage Dataset#