# Project Dataset

## Overview

This repository contains the dataset associated with the research paper *"Comprehensive records of a financial social media platform"*. The data is organized into several CSV files, each focusing on different aspects of message processing and sentiment analysis. These files are stored in an AWS S3 bucket for public access and can be downloaded or accessed directly using `pandas` or `dask`.

### Recommendation: Use `Dask` for Large Datasets

For handling larger datasets, we recommend using `Dask` instead of `pandas` to manage memory and enable parallel processing. Install `Dask` with:

```bash
pip install dask
```

## Table of Contents

1. [Repository Structure](#repository-structure)
2. [Accessing Dataset from S3](#accessing-dataset-from-s3)
3. [Reading Files from S3 with Python](#reading-files-from-s3-with-python)
4. [File Descriptions and Columns](#file-descriptions-and-columns)
5. [Example Code Using `Dask`](#example-code-using-dask)
6. [Usage](#Usage)

### 1. Repository Structure

The dataset structure in this repository is organized as follows:

```sh
dataset/
├── README.md
└── v1
    └── data
        └── csv
            ├── feature_wo_messages
            │   ├── feature_wo_messages_000.csv
            │   └── ... 
            ├── messages
            │   ├── msg_000.csv
            │   └── ... 
            ├── msg_info
            │   ├── msg_info_00.csv
            │   └── ...
            ├── sentiments
            │   ├── sentiment_00.csv
            │   └── ...
            ├── symbols
            │   ├── symbol_000.csv
            │   └── ...
            └── symbol_sentiments
                ├── symbol_sentiments_00.csv
                └── ...
```

### 2. Accessing Dataset from S3

The dataset is hosted on an AWS S3 bucket. The base URL for accessing the dataset files is as follows:

```bash
BASE_URL="s3://stocktwits-nyu"
CSV_URL="${BASE_URL}/dataset/v1/data/csv"
```

To list and download the files:

- **List Files in the S3 Bucket:**

```bash
aws s3 ls --no-sign-request $CSV_URL
```

- **Download a Specific File:**

```bash
aws s3 cp --no-sign-request $CSV_URL/feature_wo_messages/feature_wo_messages_000.csv .
```

- **Sync an Entire Directory:**

```bash
aws s3 sync --no-sign-request $CSV_URL/ .
```

### 3. Reading Files from S3 with Python

To load data directly from S3 without downloading, use `pandas` or `dask`:

```python
BASE_URL = "s3://stocktwits-nyu" # or local path BASE_URL="local_path"
CSV_URL = f"{BASE_URL}/dataset/v1/data/csv"
```

- **Using `pandas`:**

```python
import pandas as pd

data_url = f"{CSV_URL}/feature_wo_messages/feature_wo_messages_000.csv"
df = pd.read_csv(data_url, dtype={"sentiment": "object", "message_id": "object"})
print(df.head())
```

- **Using `dask`:**

```python
import dask.dataframe as dd

data_url = f"{CSV_URL}/feature_wo_messages/*.csv"
df_dask = dd.read_csv(data_url, dtype={"sentiment": "object", "message_id": "object"})
print(df_dask.head())
```

### 4. File Descriptions and Columns

Each CSV file contains data on a specific aspect of the dataset:

- **`feature_wo_messages`**  
  Columns: `message_id`, `user_id`, `created_at`, `sentiment`, `parent_message_id`, `in_reply_to_message_id`, `symbol_list`
  
- **`messages`**  
  Columns: `message_id`, `message_body`
  
- **`msg_info`**  
  Columns: `message_id`, `length`, `important_words`
  
- **`sentiments`**  
  Columns: `message_id`, `user_id`, `created_at`, `sentiment`, `symbol_list`
  
- **`symbols`**  
  Columns: `message_id`, `user_id`, `created_at`, `sentiment`, `symbol_list`, `sym_number`, `symbol`
  
- **`symbol_sentiments`**  
  Columns: `message_id`, `user_id`, `created_at`, `sentiment`, `symbol_list`

### 5. Example Code Using `Dask`

Here is a script that demonstrates how to load and explore the different datasets with `Dask`:

```python
import dask.dataframe as dd

# S3 paths to CSV files
file_paths = {
    "Feature Without Messages": f"{CSV_URL}/feature_wo_messages/*.csv",
    "Messages": f"{CSV_URL}/messages/*.csv",
    "Message Info": f"{CSV_URL}/msg_info/*.csv",
    "Sentiments": f"{CSV_URL}/sentiments/*.csv",
    "Symbols": f"{CSV_URL}/symbols/*.csv",
    "Symbol Sentiments": f"{CSV_URL}/symbol_sentiments/*.csv"
}

# Load and display each file using Dask
for key, path in file_paths.items():
    print(f"--- {key} ---")
    df = dd.read_csv(path, dtype={"sentiment": "object", "message_id": "object"})
    print(df.head(), "\n")
```

#### Output

```py
--- Feature Without Messages ---
  message_id  user_id            created_at sentiment  parent_message_id  in_reply_to_message_id symbol_list
0          4      593  2008-05-27T15:28:28Z       NaN                NaN                     NaN       ['V']
1          5     8687  2008-05-27T16:03:34Z       NaN                NaN                     NaN     ['NES']
2          6      549  2008-05-27T17:48:41Z       NaN                6.0                     NaN    ['AAPL']
3          7      170  2008-05-27T19:11:10Z       NaN                7.0                     NaN     ['XLE']
4          9      126  2008-05-27T22:39:09Z       NaN                NaN                     NaN    ['AAPL'] 

--- Messages ---
  message_id                                       message_body
0          4                        Sorry, I mean trading $V ;)
1          5  Following HEK ($HEK for stocktweets) this morn...
2          6  Wondering when the $AAPL rocket is going to ta...
3          7  Welcome early adopters! Remember to prefix the...
4          9  My $AAPL puts are now barely profitable.  I st... 

--- Message Info ---
  message_id  length                      important_words
0          4      27         ['sorry', 'mean', 'trading']
1          5      52  ['hek', 'stocktweets', 'following']
2          6      85             ['wwdc', 'rocket', 'im']
3          7      89       ['adopters', 'xle', 'welcome']
4          9     138         ['barely', 'ok', 'absorbed'] 

--- Sentiments ---
  message_id  user_id  created_at sentiment       symbol_list
0   10000059     6472  2012-10-15      -1.0  ['ZNGA', 'META']
1   10000071   148519  2012-10-15       1.0           ['FVI']
2   10000072    75026  2012-10-15       1.0            ['GS']
3   10000084   155028  2012-10-15       1.0          ['WYNN']
4   10000088    75026  2012-10-15       1.0           ['JPM'] 

--- Symbols ---
  message_id  user_id  created_at sentiment symbol_list  sym_number symbol
0          4      593  2008-05-27       NaN       ['V']           1      V
1          5     8687  2008-05-27       NaN     ['NES']           1    NES
2          6      549  2008-05-27       NaN    ['AAPL']           1   AAPL
3          7      170  2008-05-27       NaN     ['XLE']           1    XLE
4          9      126  2008-05-27       NaN    ['AAPL']           1   AAPL 

--- Symbol Sentiments ---
  message_id  user_id  created_at sentiment       symbol_list
0   10000059     6472  2012-10-15      -1.0  ['ZNGA', 'META']
1   10000071   148519  2012-10-15       1.0           ['FVI']
2   10000072    75026  2012-10-15       1.0            ['GS']
3   10000084   155028  2012-10-15       1.0          ['WYNN']
4   10000088    75026  2012-10-15       1.0           ['JPM'] 

```

### **Why Use Dask?**

- **Handles Large Files**: Dask efficiently handles files that are too large to fit in memory.
- **Parallel Computation**: It can utilize multiple cores, making data processing faster.
- **Scalable**: As data grows, Dask scales from a single machine to a distributed cluster.

## Usage

This repository provides all data necessary for replicating the analysis and results discussed in the associated research paper. You can access and process data using the examples provided here, adapting them for either `Dask` or `pandas` depending on dataset size.

