top | item 40308147

(no title)

thyrox | 1 year ago

Very nice. Since Hn data spawns so many such fun projects, there should be a monthly or weekly updates zip file or torrent with this data, which hackers can just download instead of writing a scraper and starting from scratch all the time.

discuss

zX41ZdbW|1 year ago

It is very easy to get this dataset directly from HN API. Let me just post it here:

Table definition:

    CREATE TABLE hackernews_history
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = MergeTree(update_time) ORDER BY id;

A shell script:

    BATCH_SIZE=1000

    TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"

    rm -f maxitem.json
    wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json

    clickhouse-local --query "
        SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
        GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
    while read ITEMS
    do
        echo $ITEMS
        clickhouse-client $TWEAKS --query "
            INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
    done

It takes a few hours to download the data and fill the table.

tarasglek|1 year ago

May I hijack this thread for a related q. I love the public up-to-date hn dataset.

I saw recursive cte blog post..but this doesn't seem to work your hn dataset

https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...

Are recursive ctes disabled on this instance or am i doing something wrong?

strooper|1 year ago

While trying the script, I am getting the following error -

<Trace> ReadWriteBufferFromHTTP: Failed to make request to 'https://hacker-news.firebaseio.com/v0/item/40298680.json'. Error: Timeout: connect timed out: 216.239.32.107:443. Failed at try 3/10. Will retry with current backoff wait is 200/10000 ms.

I googled with no luck. I was wondering if you have a solution for it.

zX41ZdbW|1 year ago

Also, a proof that it is updated in real-time: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

minimaxir|1 year ago

There is a public dataset of Hacker News posts on BigQuery, but it unfortunately has only been updated up to November 2022: https://news.ycombinator.com/item?id=19304326

pfarrell|1 year ago

I have a daily updated dataset that has the HN data split out by months. I've published it on my web page, but it’s served from my home server so I don’t want to link to it directly. Each month is about 30mb of compressed csv. I’ve wanted to torrent it, but don’t know how to get enough seeders since each month will produce a new torrent file (unless I’m mistaken). If you’re interested, send me a message. My email is mrpatfarrell. Use gmail for the domain.

remram|1 year ago

As a starting point, that project has Apache Arrow files. I don't know if they'll update them though.

https://github.com/wilsonzlin/hackerverse/releases/tag/datas...

The comments text table is 13 GB, to give you an idea. Can definitely be processed on a laptop.

noman-land|1 year ago

I very much support this idea. Put them on ipfs and/or torrents. Put them on HuggingFace.

pfarrell|1 year ago

I’ve had this same thought but was unsure what the licensing for the data would be.

average_r_user|1 year ago

that's a nice idea

unknown|1 year ago

[deleted]