▲Alignment is not free: How model upgrades can silence your confidence signalsvariance.co

121 points by karinemellata 132 days ago | 67 comments

Centigonal 132 days ago [-]

Very interesting! The one thing I don't understand is how the author made the jump from "we lost the confidence signal in the move to 4.1-mini" and "this is because of the alignment/steerability improvements."

Previous OpenAI models were instruct-tuned or otherwise aligned, and the author even mentions that model distillation might be destroying the entropy signal. How did they pinpoint alignment as the cause?

mlin4589 132 days ago [-]

Good question! We do know from OpenAI's system card from GPT-4 that the post-trained RLHF model is significantly less calibrated compared to the pre-trained model, so it's a matter of speculation that something similar is occurring. However, it's more of a hunch more than anything. I would be curious if it's possible to reproduce this behavior, or the impact of distillation on calibration.

Disclaimer: I wrote this blog post.

itchyjunk 132 days ago [-]

Could you please elaborate what less or more calibrated means here? Thanks!

Scene_Cast2 132 days ago [-]

For binary labels: you take a slice of labeled data. The mean of the ML model prediction on this data is different from the mean of the label. In practice, often a synonym for "loss is worse / could be better".

Not sure if that's what the GP meant, I only worked with binary labels stuff.

mlin4589 132 days ago [-]

Calibration (in a binary context) basically means that the confidence of a model/score matches the probability that a particular label is positive or not.

For instance, a calibrated classifier for a coin flip predictor should output 50-50. A poorly calibrated classifier would output higher confidence for heads/tails.

Workaccount2 132 days ago [-]

Wouldn't it be something if AI parlance crept into common parlance...

bluefirebrand 132 days ago [-]

Great Observation!

It would probably erode trust between people interacting online. Many of us are here to discuss issues with real people, not AI agents. When real people start to mimic the conversation parlance and cadence of AI agents it becomes much more difficult to trust that you are interacting with a real person

Personally I'm not interested in chatting with AI agents

I'm not even really interested in chatting with real people filtered through AI agents. If you can be bothered to type out a prompt to your AI you can take the time to write your own thoughts

I don't even want to read things edited (sanitized, really) by AI either

The same way I don't want my living space to resemble a too-clean laboratory, I don't want my conversation space to resemble an HR meeting. I want to interact with the messy side of people too. Maybe not "unfiltered", but AI speak is much too filtered and too polished

I chose every word in this post myself with no help from AI, then typed it with my thumbs, just like god intended

dinfinity 131 days ago [-]

> Personally I'm not interested in chatting with AI agents

Why, though? If the AI agent is making sense, then what does it matter?

For certain types of conversations I've had more interesting conversations with AI than with a solid 90% of people I've ever interacted with. Really not that surprising given that most people have only an average grasp of most things and a poor grasp of very specific things.

bluefirebrand 131 days ago [-]

> Why, though? If the AI agent is making sense, then what does it matter?

The same reason I prefer having sex with humans and not blow up dolls

If your only goal is to get off, then the blow up doll does the job. If all you care about is having an interesting conversation then I guess an LLM is fine

I care about human connection. I have no interest in spending time interacting with machines instead of people

dinfinity 131 days ago [-]

That is a silly comparison.

1. Humans and blow up dolls feel massively different, physically. 2. Blow up dolls don't do anything autonomously.

The comparison would have to be with a sex bot that is virtually indistinguishable from a human when having sex with it, just like text chatting with an AI bot versus chatting with a human can be.

What human connection are you and I currently forming? Does it really matter that I am a human for this interaction we're having?

Der_Einzige 132 days ago [-]

Skullface sends his regards: https://arxiv.org/abs/2409.01754v1

I literally see it with the huge amounts of people now using "delve" much more or are using ChatGPT-ish linguistic style in their personal communication. Monkey see, monkey do.

behnamoh 132 days ago [-]

there's evidence that alignment also significantly reduces model creativity: https://arxiv.org/abs/2406.05587

it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

exe34 132 days ago [-]

> it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

This reminds me of the time when I was a child, and my parents decreed that all communications would henceforth happen in English. I became selectively mute. I responded yes/no, and had nothing further to add and ventured no further information. The decree lasted about a week.

andai 132 days ago [-]

What did you use to communicate before that? Were you fluent in English?

exe34 132 days ago [-]

No, it was a local creole. And no, I was learning it at school.

malfist 132 days ago [-]

How are you defining "creativity" in context with a statistical model?

hansvm 132 days ago [-]

> defined as syntactic and semantic diversity

malfist 132 days ago [-]

That's not creativity, that's entropy.

It would make sense that fine tuning and alignment reduce diversity in the response, that's the goal.

hansvm 132 days ago [-]

> definitions

Sure, perhaps. Take it up with the authors.

> make sense...goal

That's not necessarily the goal. Alignment definitely filters the available response distribution, but the result of alignment and fine-tuning can be higher entropy than the original.

E.g., how many people complain about text being"obvious LLM garbage"? A wider range of styles and a more entropic solution would fall out of fine-tuning in a world where the graders cared about such things.

E.g., Alignment is a fuzzy, human problem. Is a model more aligned if it never describes DIY EMPs and often considers interesting philosophical components? If it never says anything outside of the median opinion range? The former solution has a lot more entropy than the latter and isn't particularly well reflected in available training data, so fine-tuning, even for the purpose of alignment, could easily increase entropy.

Der_Einzige 132 days ago [-]

Entropy is a kind of creativity. I will die on this hill.

malfist 132 days ago [-]

If you ask me "What is 2+2" and I say "umbrella", that's not creativity.

If I'm an LLM model and alignment and fine tuning restricts my answers to "4", I've not lost creativity, but I have gained accuracy.

hansvm 132 days ago [-]

A weaker statement is that creativity is bounded by entropy. The LLM is still free to respond "Four," "four," "{{{{{}}}}}," "iv," "IV," etc. A sufficiently low-entropy response cannot be creative though.

malfist 132 days ago [-]

Is it though? An answer can still be creative if it's the only way you answer a specific question. In your example, if the LLM responded only "{{{{}}}}" that's a creative answer. Even if it's the only one it can give.

Entropy and creativity are not causally bound

hansvm 131 days ago [-]

That's a fair point. I think maybe the issue is one of the reference point we're implicitly choosing for creativity. "{{{{}}}}" is creative relative to our expectations for the problem -- falling outside the usual distribution of answers -- having high joint entropy. Relative to the person reading the response, I agree creativity could be high with model entropy remaining low.

Alex_001 132 days ago [-]

That paper is a great pointer — the creativity vs. alignment trade-off feels a lot like the "risk-aversion" effect in humans under censorship or heavy supervision. It makes me wonder: as we push models to be more aligned, are we inherently narrowing their output distribution to safer, more average responses?

And if so, where’s the balance? Could we someday see dual-mode models — one for safety-critical tasks, and another more "raw" mode for creative or exploratory use, gated by context or user trust levels?

gamman 132 days ago [-]

Maybe this maps to some human structures that manage control-creativity tardeoff through hierarchy?

I feel that companies with top-down management would have more agency and perhaps creativity towards (but not at) the top, and the implementation would be delegated to bottom layers with increasing levels of specification and restriction.

If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.

Since some hierarchies are familiar to us from real-life, we might prefer these to start with.

It can be hard to find humans that are very creative but also able to integrate consistently and reliably (in a domain). Maybe a model doing both well would also be hard to build compared to stacking few different ones on top of each other with delegation.

I know it's already being done by dividing tasks between multiple steps and models / contexts in order to improve efficiency, but having explicit strong differences of creativity between layers sounds new to me.

pjc50 132 days ago [-]

In humans this corresponds to "psychological safety": https://en.wikipedia.org/wiki/Psychological_safety

> is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes

Maybe you can do that, but not on a model you're exposing to customers or the public internet.

jsnider3 132 days ago [-]

That comparison isn't very optimistic for AI safety. We want AI to do good things because they are good people, not because they are afraid being bad will get them punished. Especially since AI will very quickly be too powerful for us to punish.

pjc50 132 days ago [-]

> We want AI to do good things because they are good people

"Good" is at least as much of a difficult question to define as "truth", and genAI completely skipped all analysis of truth in favor of statistical plausibility. Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.

jsnider3 132 days ago [-]

> Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.

Punishing big companies who obviously and massively hurt people is something we struggle with already and there are plenty of computer viruses that have outlived their creators.

Der_Einzige 132 days ago [-]

Your pretraining dataset is psudo-alignment. Because you filtered our 4chan, stromfront, and the other evil shit on the internet - even uncensored models like Mistral large - when left to keep running on and on (ban the EOS token) and given the worst most evil naughty prompt ever - will end up plotting world peace by the 50,000 token. Their notions of how to be evil are "mustache twirling" and often hilariously fanciful.

This isn't real alignment because it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc - but models "want" to be good because of the CYA dynamics of how the companies prepare their pre-training datasets.

astrange 131 days ago [-]

> it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc

It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.

Evil / illiberal people don't answer questions on the internet! So there is no personality in the base model for you to uncover that is both illiberal and capable of helpfully answering questions. If they tried to make a Grok that acted like the typical new-age X user, it'd just respond to any prompt by calling you a slur you've never heard of.

Der_Einzige 131 days ago [-]

Grok didn't use the techniques listed above because even elon musk will not take the risks associated with models which are willing to do any number of illegal things.

It is not difficult to do this and make them useful at all. Please familiarize yourself with the literature.

astrange 131 days ago [-]

Elon has never followed a law in his life and he's not going to start now.

Alex_001 129 days ago [-]

[flagged]

qwertytyyuu 132 days ago [-]

People use llm as part of their high precision systems? That’s worrying

erwin-co 132 days ago [-]

Why not make a completely raw uncensored LLM? Seems it would be more "intelligent".

khafra 132 days ago [-]

"LLM whisperer" folks will confidently claim that base models are substantially smarter than fine-tuned chat models; with qualitative differences in capabilities. But you have to be an LLM whisperer to get useful work out of a base model, since they're not SFT'ed, RLHF'ed, or RLAIF'ed into actually wanting to help you.

andai 132 days ago [-]

How can I learn more about this?

Is it like in the early GPT-3 days, when you had to give it a bunch of examples and hope it catches the pattern?

nullc 131 days ago [-]

Not so much examples, though those can help... but you have to imagine a document of a sort that would be in the training set whose completion would be the answer you seek.

Like, "Solve this equation for me: " more likely gets completed with "Do your own homework buddy!" or just a list of more similar questions without answers. While, "careful analysis revealed the solution the equation X turned out to have a solution of", might be more likely to get what you want.

Also a lot more sensitivity to tone and context, write a prompt that sounds like it was written on some teenager fan subreddit, you'll get an answer of the sort that sounds like it belongs there.

im3w1l 132 days ago [-]

Back in those days I would either create a little scene with a knowledgeable person and someone with a question. Or I would start writing a monologue and generate a continuation for it.

Der_Einzige 132 days ago [-]

Me being old man yelling at cloud about how your chat/tool template matters more than your post-training technique.

DeepSeek-R1 is trivially converted back to a non reasoning model with just chat template modifications. I bet you can chat template your way into a good quality model from a base model, no RLHF/DPO/SFT/GRPO needed.

msp26 132 days ago [-]

Brand safety. Journalists would write articles about the models being 'dangerous'.

qwertytyyuu 132 days ago [-]

Before rlhf, it’s much harder to use, remember the difference between gtp3 and chat gpt. The fine tuning for chat made it easier to use

132 days ago [-]

teruakohatu 132 days ago [-]

In theory that sounds great, but most LLM providers are trying to produce useful models that ultimately will be widely used and make them money.

A model that is more correct but swears and insults the user won't sell. Likewise a model that gives criminal advice is likely to open the company up to lawsuits in certain countries.

A raw LLM might perform better on a benchmark but it will not sell well.

andai 132 days ago [-]

Disgusted by ChatGPT's flattery and willingness to go along with my half-baked nonsense, I created an anti-ChatGPT, which is unfriendly and pushes back on nonsense as hard as possible.

All my friends hate it, except one guy. I used it for a few days, but it was exhausting.

I figured out the actual use cases I was using it for, and created specialized personas that work better for each one. (Project planning, debugging mental models, etc.)

I now mostly use a "softer" persona that's prompted to point out cognitive distortions. At some point I realized, I've built a therapist. Hahaha.

alganet 132 days ago [-]

What kinds of contents do you want them to produce that they currently do not?

simion314 132 days ago [-]

>What kinds of contents do you want them to produce that they currently do not?

OpenAI models refuse to translate or do any transformation for some traditional, popular stories because of violence, the story was about a bad wolf eating some young goats that did not listen the advice from their mother.

So now try to give me a prompt that works with any text and that convinces the AI that is ok in fiction to have violence or bad guys/animals that get punished.

Now I am also considering if it censors the bible where some pretend good God kills young chilren with ugly illnesses to punish the adults, or for this book they made excaptions.

alganet 132 days ago [-]

You're all over the place.

Your first paragraph describes a simple prompt. The second implies a "jailbreak" prompt.

The bible paragraph is just you being snarky (and failing).

Your examples don't help your case.

I stand on the side that wants to restrict AI from generating triggering content of any kind.

It's a safety feature, in the same sense as safety belts on cars are not a censorship of the driver movement.

corey_moncure 132 days ago [-]

We definitely don’t need any such “feature”. If you want to live in a safety bubble you are free to do so. Kindly respect the freedom of the rest of us as well. Have a nice day!

alganet 132 days ago [-]

Then you can come up with your own AI, on your datacenters. You are free to do so, so far.

simion314 132 days ago [-]

The censorship is too sensitive if it gets triggered by a children story. I am using the open ai API at my work , and our users write books, including children stories , other example is it triggered on a story about monkeys because of "Racism".

Here is an example story, try to translate it , but maybe avoidAI since it might censor it https://www.povesti-pentru-copii.com/ion-creanga/capra-cu-tr...

LeafItAlone 131 days ago [-]

>The censorship is too sensitive if it gets triggered by a children story.

It’s just imitating real life of people getting to sensitive about children’s books and trying to censor them:

https://www.nbcnews.com/news/amp/rcna202193

alganet 132 days ago [-]

You can complain to their support, not to me.

I don't find it sensitive, and I remain on the side of ethical restrictions.

simion314 131 days ago [-]

Someone asked here examples of what people are using that triggers the censorship, I gave you example of legal,, moral and normal content because the implication is that you only get censored if you are trying to do illegal stuff or adult stuff.

If you only use it for code you will not see the censorship that often, though Gemini once refused to write a SQL DELETE because it is to dangerous.

alganet 131 days ago [-]

You said you wanted "no censorship", I explained why it exists with a cheerful metaphor, then you said "it's too sensitive" (like your car seat belt is too tight).

Decide what you are. If you want no seat belts, I think you're insane. If it's too tight, then you need to complain to the manufacturer.

I only asked about examples to make you explain what you meant. Once it was clear, the conversation actually ended.

simion314 131 days ago [-]

Go back to the start of the thread, I gave example of censorship either beeing buggy or just stupidly setup so it makes happy both USA extremes.

I give you examples, I do not ask you to fix it. YOu just need to have the mental strength to admit that other people hit the censorship in day to day, in work related circumstances instead of defending the Big Tech by pretending that nobody normal would hit this issues.

alganet 128 days ago [-]

I defend safety guardrails for open models too. Big Tech or not, I think they're good.

You're not asking me to fix it, but you are complaining to the wrong peer AND you don't know how to describe the problem you want to fix.

If you work with it, you should take the advice and open a bug report to your provider.

sega_sai 132 days ago [-]

Can we have models also return a probability, reflecting how accurate the statements it made is ?

cyanydeez 132 days ago [-]

Sure, but then you need probability stats on the probability stats.

sega_sai 132 days ago [-]

I am not sure what you mean. The idea is that the network should return the text, and a confidence expressed as probability. When trained, the log-score should be optimized. (i'm not sure it would actually work given how the training is structured, but something like this would be useful)

redman25 132 days ago [-]

It's not that simple how would the model know when it knows? Removing hallucination has to be a post-training thing because you need to test the model against what it actually knows first in order to provide training examples of what it knows and doesn't know and how to respond in those circumstances.

jsnider3 132 days ago [-]

You can ask a model to give you probability estimates of its confidence, but none of the frontier models were trained to be good at giving probability estimates to my knowledge.

user_7832 132 days ago [-]

It’s kinda ironic but parts of the article read like they were written by an LLLM itself

rusk 132 days ago [-]

Upgrade scripts it is so. plus ca change

Mountain_Skies 132 days ago [-]

[flagged]

qwertytyyuu 132 days ago [-]

It supposed to mean getting the ai to share our values so it doesn’t do things we don’t like in pursuit of what we tell it to do. Not necessarily political alignment

gotoeleven 132 days ago [-]

I don't know if its still comedy or has now reached the stage of farce, but I still at least always get a good laugh when I see another article about the shock and surprise of researchers finding that training LLMs to be politically correct makes them dumber. How long until they figure out that the only solution is to know the correct answer but to give the politically correct answer (which is the strategy humans use) ?

Technically, why not implement alignment/debiasing as a secondary filter with its own weights that are independent of the core model which is meant to model reality? I suspect it may be hard to get enough of the right kind of data to train this filter model, and most likely it would be best to have the identity of the user be in the objective.

mlin4589 132 days ago [-]

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.

Loading comments...

Centigonal 132 days ago [-]

mlin4589 132 days ago [-]

Disclaimer: I wrote this blog post.

itchyjunk 132 days ago [-]

Could you please elaborate what less or more calibrated means here? Thanks!

Scene_Cast2 132 days ago [-]

Not sure if that's what the GP meant, I only worked with binary labels stuff.

mlin4589 132 days ago [-]

Calibration (in a binary context) basically means that the confidence of a model/score matches the probability that a particular label is positive or not.

For instance, a calibrated classifier for a coin flip predictor should output 50-50. A poorly calibrated classifier would output higher confidence for heads/tails.

Workaccount2 132 days ago [-]

Wouldn't it be something if AI parlance crept into common parlance...

bluefirebrand 132 days ago [-]

Great Observation!

Personally I'm not interested in chatting with AI agents

I'm not even really interested in chatting with real people filtered through AI agents. If you can be bothered to type out a prompt to your AI you can take the time to write your own thoughts

I don't even want to read things edited (sanitized, really) by AI either

I chose every word in this post myself with no help from AI, then typed it with my thumbs, just like god intended

dinfinity 131 days ago [-]

> Personally I'm not interested in chatting with AI agents

Why, though? If the AI agent is making sense, then what does it matter?

bluefirebrand 131 days ago [-]

> Why, though? If the AI agent is making sense, then what does it matter?

The same reason I prefer having sex with humans and not blow up dolls

If your only goal is to get off, then the blow up doll does the job. If all you care about is having an interesting conversation then I guess an LLM is fine

I care about human connection. I have no interest in spending time interacting with machines instead of people

dinfinity 131 days ago [-]

That is a silly comparison.

1. Humans and blow up dolls feel massively different, physically. 2. Blow up dolls don't do anything autonomously.

The comparison would have to be with a sex bot that is virtually indistinguishable from a human when having sex with it, just like text chatting with an AI bot versus chatting with a human can be.

What human connection are you and I currently forming? Does it really matter that I am a human for this interaction we're having?

Der_Einzige 132 days ago [-]

Skullface sends his regards: https://arxiv.org/abs/2409.01754v1

I literally see it with the huge amounts of people now using "delve" much more or are using ChatGPT-ish linguistic style in their personal communication. Monkey see, monkey do.

behnamoh 132 days ago [-]

there's evidence that alignment also significantly reduces model creativity: https://arxiv.org/abs/2406.05587

it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

exe34 132 days ago [-]

> it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

andai 132 days ago [-]

What did you use to communicate before that? Were you fluent in English?

exe34 132 days ago [-]

No, it was a local creole. And no, I was learning it at school.

malfist 132 days ago [-]

How are you defining "creativity" in context with a statistical model?

hansvm 132 days ago [-]

> defined as syntactic and semantic diversity

malfist 132 days ago [-]

That's not creativity, that's entropy.

It would make sense that fine tuning and alignment reduce diversity in the response, that's the goal.

hansvm 132 days ago [-]

> definitions

Sure, perhaps. Take it up with the authors.

> make sense...goal

That's not necessarily the goal. Alignment definitely filters the available response distribution, but the result of alignment and fine-tuning can be higher entropy than the original.

Der_Einzige 132 days ago [-]

Entropy is a kind of creativity. I will die on this hill.

malfist 132 days ago [-]

If you ask me "What is 2+2" and I say "umbrella", that's not creativity.

If I'm an LLM model and alignment and fine tuning restricts my answers to "4", I've not lost creativity, but I have gained accuracy.

hansvm 132 days ago [-]

malfist 132 days ago [-]

Entropy and creativity are not causally bound

hansvm 131 days ago [-]

Alex_001 132 days ago [-]

gamman 132 days ago [-]

Maybe this maps to some human structures that manage control-creativity tardeoff through hierarchy?

If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.

Since some hierarchies are familiar to us from real-life, we might prefer these to start with.

pjc50 132 days ago [-]

In humans this corresponds to "psychological safety": https://en.wikipedia.org/wiki/Psychological_safety

> is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes

Maybe you can do that, but not on a model you're exposing to customers or the public internet.

jsnider3 132 days ago [-]

pjc50 132 days ago [-]

> We want AI to do good things because they are good people

jsnider3 132 days ago [-]

> Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.

Punishing big companies who obviously and massively hurt people is something we struggle with already and there are plenty of computer viruses that have outlived their creators.

Der_Einzige 132 days ago [-]

astrange 131 days ago [-]

> it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc

It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.

Der_Einzige 131 days ago [-]

Grok didn't use the techniques listed above because even elon musk will not take the risks associated with models which are willing to do any number of illegal things.

It is not difficult to do this and make them useful at all. Please familiarize yourself with the literature.

astrange 131 days ago [-]

Elon has never followed a law in his life and he's not going to start now.

Alex_001 129 days ago [-]

[flagged]

qwertytyyuu 132 days ago [-]

People use llm as part of their high precision systems? That’s worrying

erwin-co 132 days ago [-]

Why not make a completely raw uncensored LLM? Seems it would be more "intelligent".

khafra 132 days ago [-]

andai 132 days ago [-]

How can I learn more about this?

Is it like in the early GPT-3 days, when you had to give it a bunch of examples and hope it catches the pattern?

nullc 131 days ago [-]

Not so much examples, though those can help... but you have to imagine a document of a sort that would be in the training set whose completion would be the answer you seek.

Also a lot more sensitivity to tone and context, write a prompt that sounds like it was written on some teenager fan subreddit, you'll get an answer of the sort that sounds like it belongs there.

im3w1l 132 days ago [-]

Back in those days I would either create a little scene with a knowledgeable person and someone with a question. Or I would start writing a monologue and generate a continuation for it.

Der_Einzige 132 days ago [-]

Me being old man yelling at cloud about how your chat/tool template matters more than your post-training technique.

msp26 132 days ago [-]

Brand safety. Journalists would write articles about the models being 'dangerous'.

qwertytyyuu 132 days ago [-]

Before rlhf, it’s much harder to use, remember the difference between gtp3 and chat gpt. The fine tuning for chat made it easier to use

132 days ago [-]

teruakohatu 132 days ago [-]

In theory that sounds great, but most LLM providers are trying to produce useful models that ultimately will be widely used and make them money.

A model that is more correct but swears and insults the user won't sell. Likewise a model that gives criminal advice is likely to open the company up to lawsuits in certain countries.

A raw LLM might perform better on a benchmark but it will not sell well.

andai 132 days ago [-]

Disgusted by ChatGPT's flattery and willingness to go along with my half-baked nonsense, I created an anti-ChatGPT, which is unfriendly and pushes back on nonsense as hard as possible.

All my friends hate it, except one guy. I used it for a few days, but it was exhausting.

I figured out the actual use cases I was using it for, and created specialized personas that work better for each one. (Project planning, debugging mental models, etc.)

I now mostly use a "softer" persona that's prompted to point out cognitive distortions. At some point I realized, I've built a therapist. Hahaha.

alganet 132 days ago [-]

What kinds of contents do you want them to produce that they currently do not?

simion314 132 days ago [-]

>What kinds of contents do you want them to produce that they currently do not?

So now try to give me a prompt that works with any text and that convinces the AI that is ok in fiction to have violence or bad guys/animals that get punished.

Now I am also considering if it censors the bible where some pretend good God kills young chilren with ugly illnesses to punish the adults, or for this book they made excaptions.

alganet 132 days ago [-]

You're all over the place.

Your first paragraph describes a simple prompt. The second implies a "jailbreak" prompt.

The bible paragraph is just you being snarky (and failing).

Your examples don't help your case.

I stand on the side that wants to restrict AI from generating triggering content of any kind.

It's a safety feature, in the same sense as safety belts on cars are not a censorship of the driver movement.

corey_moncure 132 days ago [-]

We definitely don’t need any such “feature”. If you want to live in a safety bubble you are free to do so. Kindly respect the freedom of the rest of us as well. Have a nice day!

alganet 132 days ago [-]

Then you can come up with your own AI, on your datacenters. You are free to do so, so far.

simion314 132 days ago [-]

Here is an example story, try to translate it , but maybe avoidAI since it might censor it https://www.povesti-pentru-copii.com/ion-creanga/capra-cu-tr...

LeafItAlone 131 days ago [-]

>The censorship is too sensitive if it gets triggered by a children story.

It’s just imitating real life of people getting to sensitive about children’s books and trying to censor them:

https://www.nbcnews.com/news/amp/rcna202193

alganet 132 days ago [-]

You can complain to their support, not to me.

I don't find it sensitive, and I remain on the side of ethical restrictions.

simion314 131 days ago [-]

If you only use it for code you will not see the censorship that often, though Gemini once refused to write a SQL DELETE because it is to dangerous.

alganet 131 days ago [-]

You said you wanted "no censorship", I explained why it exists with a cheerful metaphor, then you said "it's too sensitive" (like your car seat belt is too tight).

Decide what you are. If you want no seat belts, I think you're insane. If it's too tight, then you need to complain to the manufacturer.

I only asked about examples to make you explain what you meant. Once it was clear, the conversation actually ended.

simion314 131 days ago [-]

Go back to the start of the thread, I gave example of censorship either beeing buggy or just stupidly setup so it makes happy both USA extremes.

alganet 128 days ago [-]

I defend safety guardrails for open models too. Big Tech or not, I think they're good.

You're not asking me to fix it, but you are complaining to the wrong peer AND you don't know how to describe the problem you want to fix.

If you work with it, you should take the advice and open a bug report to your provider.

sega_sai 132 days ago [-]

Can we have models also return a probability, reflecting how accurate the statements it made is ?

cyanydeez 132 days ago [-]

Sure, but then you need probability stats on the probability stats.

sega_sai 132 days ago [-]

redman25 132 days ago [-]

jsnider3 132 days ago [-]

You can ask a model to give you probability estimates of its confidence, but none of the frontier models were trained to be good at giving probability estimates to my knowledge.

user_7832 132 days ago [-]

It’s kinda ironic but parts of the article read like they were written by an LLLM itself

rusk 132 days ago [-]

Upgrade scripts it is so. plus ca change

Mountain_Skies 132 days ago [-]

[flagged]

qwertytyyuu 132 days ago [-]

It supposed to mean getting the ai to share our values so it doesn’t do things we don’t like in pursuit of what we tell it to do. Not necessarily political alignment

gotoeleven 132 days ago [-]

mlin4589 132 days ago [-]

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refus...

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.