seems like it is working. i was expecting models like minimax 229b or higher to succeed here but qwen3 30b also did a good job!
gpt oss 120b was terribly censored and was of no use. even though you say answer based on the text it does not. huihui version of it (derestricted) was doing fine, until it did not understand one article. so derestricted models seem to be acting weird or the derestriction process makes them dumber.
overall, it looks like i can 10x the dataset. we will soon see if the training and evals look good.