Preprocessing Datasets

Preprocessing Datasets#

ESG Ratings Dataset#

ESG Ratings Dataset Description

The ESG ratings dataset contains environmental, social, and governance (ESG) ratings for various companies from 2011 to 2020.

Structure

The original dataset is a structured CSV file with columns:

  • ‘type’: the type of the rating (ESG, E, S, or G).

  • ‘code’: a unique identifier for the company (‘A’ followed by 6 digits).

  • ‘name’: the name of the company.

  • ‘market’: a category indicating the market in which the company operates (0 for Others, 1 for KOSPI, 2 for KOSDAQ).

  • ‘2011’ to ‘2020’: the ESG ratings for each year.

The ratings range from ‘A+’ (best) to ‘D’ (worst), and missing ratings are denoted as ‘nan’.

Sample data:

type

code

name

market

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

0

ESG

A000010

신한은행

0

nan

nan

nan

nan

nan

nan

nan

nan

nan

nan

1

ESG

A000020

동화약품

1

B

B

C

B

C

B

B

B

C

B

2

ESG

A000030

우리은행

1

nan

nan

nan

nan

B+

B+

B+

B+

nan

nan

3

ESG

A000040

KR모터스

1

C

B

C

B

C

C

C

D

C

C

4

ESG

A000050

경방

1

nan

B

B

C

C

C

C

C

C

C

Conversion Procedure

The dataset was transformed for further analysis using the following procedure:

  1. The ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) were first mapped to numerical scores (6, 5, 4, 3, 2, 1, and NaN respectively) for easy comparison.

  2. The DataFrame was reshaped (“melted”) to convert the year columns into rows, resulting in a DataFrame with ‘year’ and ‘rating_score’ columns. This makes the DataFrame longer but less wide, and puts it in a format that’s more convenient for many types of analysis.

  3. The ‘year’ column was converted to integer for correct chronological order sorting.

  4. The DataFrame was sorted by ‘type’, ‘code’, and ‘year’ to ensure correct grouping and comparison for each type of rating, company, and year.

  5. The ‘rating_change’ column was calculated by finding the difference in rating scores for each group of ‘type’ and ‘code’. The resulting values indicate the direction and magnitude of change in the rating score from the previous year.

  6. The numerical scores in the ‘rating_score’ column were mapped back to their original ratings (‘A+’, ‘A’, ‘B+’, ‘B’, ‘C’, ‘D’, and ‘nan’) to create a new ‘rating’ column.

  7. The ‘rating_change’ values were adjusted to indicate the direction of change: ‘1’ if the rating improved from the previous year, ‘0’ if it stayed the same, ‘-1’ if it worsened, and ‘nan’ if the rating was not available in either the previous year or the current year.

The resulting dataset contains the original ESG ratings along with additional columns for the numerical rating scores and the direction of rating change, allowing for more detailed and flexible analysis of the ESG performance of companies over time.

Convert ratings column 2011 ~ 2022 to rows with year column and rating column. Also, add rating_change column. It is the direction of rating change. 1 means up, 0 means same, -1 means down.

Converted data is saved as ESG_ratings_processed.parquet.

Sample data:

type

code

name

market

year

rating_score

rating_change

rating

2263

E

A005930

삼성전자

1

2011

6

nan

A+

10139

E

A005930

삼성전자

1

2012

5

-1

A

18015

E

A005930

삼성전자

1

2013

5

0

A

25891

E

A005930

삼성전자

1

2014

5

0

A

33767

E

A005930

삼성전자

1

2015

6

1

A+

41643

E

A005930

삼성전자

1

2016

5

-1

A

49519

E

A005930

삼성전자

1

2017

5

0

A

57395

E

A005930

삼성전자

1

2018

5

0

A

65271

E

A005930

삼성전자

1

2019

4

-1

B+

73147

E

A005930

삼성전자

1

2020

5

1

A

from esgcov import HyFI

if HyFI.is_colab():
    HyFI.mount_google_drive()

prj = HyFI.init_project(
    project_name="esg-coverage",
    task_name="datasets",
    project_root=HyFI.dotenv().DOTENV_DIR or ".",
    log_level="WARNING",
    verbose=True,
)

print("project directory:", prj.root_dir)
print("project workspace directory:", prj.workspace_dir)
INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.
project directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage
project workspace directory: /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/workspace
config = HyFI.compose("workflow=datasets")
config.project.project_root = prj.project_root
wf = HyFI.workflow_config(**config)
HyFI.run_workflow(wf)
INFO:hyfi.composer:Composing `WorkflowConfig` class with `datasets` config in `workflow` group.
INFO:hyfi.composer:Composing `ProjectConfig` class with `__init__` config in `project` group.
INFO:hyfi.composer:Composing `JobLibConfig` class with `__init__` config in `joblib` group.
INFO:hyfi.composer:Composing `PathConfig` class with `__init__` config in `path` group.
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.pipeline:Applying 6 pipes
 Change directory to workspace/tasks/dataset
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_ratings_raw.csv
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.106289
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
      type     code  name  market  year  rating_score  rating_change rating
1969     E  A000010  신한은행       0  2011           NaN            NaN    NaN
9845     E  A000010  신한은행       0  2012           NaN            NaN    NaN
17721    E  A000010  신한은행       0  2013           NaN            NaN    NaN
25597    E  A000010  신한은행       0  2014           NaN            NaN    NaN
33473    E  A000010  신한은행       0  2015           NaN            NaN    NaN
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_ratings_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.090372
      type     code  name  market  year  rating_score  rating_change rating
1969     E  A000010  신한은행       0  2011           NaN            NaN    NaN
9845     E  A000010  신한은행       0  2012           NaN            NaN    NaN
17721    E  A000010  신한은행       0  2013           NaN            NaN    NaN
25597    E  A000010  신한은행       0  2014           NaN            NaN    NaN
33473    E  A000010  신한은행       0  2015           NaN            NaN    NaN
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.iolibs:Processing [1] files from ['https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet']
INFO:hyfi.utils.datasets:Loading data from https://assets.entelecheia.ai/datasets/esg_coverage/ESG_coverage_raw.parquet
INFO:hyfi.utils.datasets: >> elapsed time to load data: 0:00:00.098626
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
      code  year type  topic_weight  yearly_change  coverage_change
0  A000070  2013    E      0.000017            NaN              NaN
1  A000070  2015    E      0.000025   7.496203e-06              0.0
2  A000070  2016    E      0.000029   3.520852e-06              0.0
3  A000070  2017    E      0.000028  -5.451693e-08              0.0
4  A000080  2011    E      0.001494            NaN              NaN
INFO:hyfi.pipeline:Running a pipe with functools.partial(<function dataframe_external_funcs at 0x7f92f436e0d0>)
INFO:hyfi.composer:Composing `TaskConfig` class with `datasets` config in `task` group.
INFO:hyfi.composer:Composing `BatchPathConfig` class with `__batch__` config in `path` group.
INFO:hyfi.utils.datasets:Saving dataframe to processed/ESG_coverage_processed.parquet
INFO:hyfi.utils.datasets: >> elapsed time to save data: 0:00:00.010824
      code  year type  topic_weight  yearly_change  coverage_change
0  A000070  2013    E      0.000017            NaN              NaN
1  A000070  2015    E      0.000025   7.496203e-06              0.0
2  A000070  2016    E      0.000029   3.520852e-06              0.0
3  A000070  2017    E      0.000028  -5.451693e-08              0.0
4  A000080  2011    E      0.001494            NaN              NaN
 Change directory back to /mnt/nvme1n1p2/home/yj.lee/workspace/projects/esg-coverage/book/supplementary
ratings_df = HyFI.load_dataframe(
    "processed/ESG_ratings_processed.parquet",
    data_dir=prj.workspace_dir / "dataset",
)

print(
    ratings_df[ratings_df.code == "A005930"]
    .sort_values(["type", "code", "year"])
    .head(10)
    .to_markdown()
)
|       | type   | code    | name     |   market |   year |   rating_score |   rating_change | rating   |
|------:|:-------|:--------|:---------|---------:|-------:|---------------:|----------------:|:---------|
|  2263 | E      | A005930 | 삼성전자 |        1 |   2011 |              6 |             nan | A+       |
| 10139 | E      | A005930 | 삼성전자 |        1 |   2012 |              5 |              -1 | A        |
| 18015 | E      | A005930 | 삼성전자 |        1 |   2013 |              5 |               0 | A        |
| 25891 | E      | A005930 | 삼성전자 |        1 |   2014 |              5 |               0 | A        |
| 33767 | E      | A005930 | 삼성전자 |        1 |   2015 |              6 |               1 | A+       |
| 41643 | E      | A005930 | 삼성전자 |        1 |   2016 |              5 |              -1 | A        |
| 49519 | E      | A005930 | 삼성전자 |        1 |   2017 |              5 |               0 | A        |
| 57395 | E      | A005930 | 삼성전자 |        1 |   2018 |              5 |               0 | A        |
| 65271 | E      | A005930 | 삼성전자 |        1 |   2019 |              4 |              -1 | B+       |
| 73147 | E      | A005930 | 삼성전자 |        1 |   2020 |              5 |               1 | A        |

ESG Coverage Dataset#

ESG Coverage Dataset Description

The ESG coverage dataset provides data on the topic weights of different environmental, social, and governance (ESG) topics for various companies over the years. Specifically, it has the following columns:

  • code: a string representing the unique identifier of a company.

  • year: the year for which the topic weights are provided.

  • topic_num: a string representing the topic number, ranging from ‘topic0’ to ‘topic35’.

  • topic_weight: a float representing the weight of a specific topic for a given company in a given year.

Sample data:

code

year

topic_num

topic_weight

0

A000070

2003

topic0

1.86484e-05

1

A000070

2003

topic1

9.6384e-05

2

A000070

2003

topic10

2.59255e-05

3

A000070

2003

topic11

3.35297e-05

4

A000070

2003

topic12

0.292819

The conversion process reshapes and enriches this raw dataset to provide more insightful views of the ESG coverage. This process involves the following steps:

  1. Pivot the DataFrame: The pivot() function is used to reshape the raw data, with code and year as the index, topic_num as the columns, and topic_weight as the values.

  2. Create New Categories: New columns are added to represent the total topic weights for different ESG categories. These include:

    • ‘E’: the weight of ‘topic0’

    • ‘S’: the weight of ‘topic1’

    • ‘G’: the sum of the weights of ‘topic2’ and ‘topic6’

    • ‘G2’: the weight of ‘topic2’

    • ‘ESG’: the sum of the weights of ‘E’, ‘S’, and ‘G’

    • ‘ESG2’: the sum of the weights of ‘E’, ‘S’, and ‘G2’

  3. Melt the DataFrame: The DataFrame is then melted to reshape it again, turning the newly created categories into rows in a new ‘type’ column, with the corresponding topic weights listed in a ‘topic_weight’ column. Each row now represents the total topic weight for a specific ESG category, for a specific company in a given year.

  4. Calculate Yearly Changes: For each company and each ESG category, the year-on-year changes in ‘topic_weight’ are calculated.

  5. Calculate Coverage Change: Based on the yearly changes, a new ‘coverage_change’ column is created. If the change in ‘topic_weight’ is greater than 0.1 standard deviations, ‘coverage_change’ is set to 1. If it is less than -0.1 standard deviations, it is set to -1. Otherwise, it is set to 0. If data for the previous year does not exist or is NaN, ‘coverage_change’ is set to NaN.

The converted ESG coverage dataset is now in a format that allows for more detailed analysis and comparison with the ESG ratings. It provides an overview of the emphasis placed on different ESG topics by each company over time, as well as the yearly changes in these topic weights.

Converted data is saved as ESG_coverage_processed.parquet.

Sample data:

code

year

type

topic_weight

yearly_change

coverage_change

412

A005930

2011

E

0.00133423

nan

nan

413

A005930

2012

E

7.16375e-05

-0.00126259

-1

414

A005930

2013

E

0.000591296

0.000519659

0

415

A005930

2014

E

0.000554041

-3.72556e-05

0

416

A005930

2015

E

0.000238747

-0.000315294

0

417

A005930

2016

E

0.000624334

0.000385587

0

418

A005930

2017

E

0.000190126

-0.000434208

0

419

A005930

2018

E

0.000252869

6.27427e-05

0

420

A005930

2019

E

0.0004799

0.000227031

0

421

A005930

2020

E

0.00112099

0.000641088

0

coverage_df = HyFI.load_dataframe(
    "processed/ESG_coverage_processed.parquet",
    data_dir=prj.workspace_dir / "dataset",
)

print(
    coverage_df[coverage_df.code == "A005930"]
    .sort_values(["type", "code", "year"])
    .head(10)
    .to_markdown()
)
|     | code    |   year | type   |   topic_weight |   yearly_change |   coverage_change |
|----:|:--------|-------:|:-------|---------------:|----------------:|------------------:|
| 412 | A005930 |   2011 | E      |    0.00133423  |   nan           |               nan |
| 413 | A005930 |   2012 | E      |    7.16375e-05 |    -0.00126259  |                 0 |
| 414 | A005930 |   2013 | E      |    0.000591296 |     0.000519659 |                 0 |
| 415 | A005930 |   2014 | E      |    0.000554041 |    -3.72556e-05 |                 0 |
| 416 | A005930 |   2015 | E      |    0.000238747 |    -0.000315294 |                 0 |
| 417 | A005930 |   2016 | E      |    0.000624334 |     0.000385587 |                 0 |
| 418 | A005930 |   2017 | E      |    0.000190126 |    -0.000434208 |                 0 |
| 419 | A005930 |   2018 | E      |    0.000252869 |     6.27427e-05 |                 0 |
| 420 | A005930 |   2019 | E      |    0.0004799   |     0.000227031 |                 0 |
| 421 | A005930 |   2020 | E      |    0.00112099  |     0.000641088 |                 0 |