1. The better way to get all Hacker News data instead of blasting the API is to download the data from the official BigQuery dataset, which can do the task in a single query: https://news.ycombinator.com/item?id=40644563
2. For labeling the posts, instead of label-then-explanation, it may be better to do explanation-then-label to give the model a chance to reason though the edge cases.
3. Following up from #2, for prompt engineering the system prompt, it would likely be better to give a list of multiple valid examples and invalid examples (as noted after the fact) to guide reasoning.
4. Since the target label is a binary objective, it may be more practical/faster/cheaper to create a normal logistic regression model (e.g. tf-idf/BoW) from a large representative sample, then use that to predict the rest of the labels.
The more advanced way to do #4 would be to encode the posts as text embeddings first then use them as the input for a small MLP model...which I may or may not have a project in the pipeline based around that approach.
minimaxir|10 months ago
1. The better way to get all Hacker News data instead of blasting the API is to download the data from the official BigQuery dataset, which can do the task in a single query: https://news.ycombinator.com/item?id=40644563
2. For labeling the posts, instead of label-then-explanation, it may be better to do explanation-then-label to give the model a chance to reason though the edge cases.
3. Following up from #2, for prompt engineering the system prompt, it would likely be better to give a list of multiple valid examples and invalid examples (as noted after the fact) to guide reasoning.
4. Since the target label is a binary objective, it may be more practical/faster/cheaper to create a normal logistic regression model (e.g. tf-idf/BoW) from a large representative sample, then use that to predict the rest of the labels.
The more advanced way to do #4 would be to encode the posts as text embeddings first then use them as the input for a small MLP model...which I may or may not have a project in the pipeline based around that approach.
sethkim|10 months ago
On 4. specifically we've got some thoughts here as well. Will reach out!