Install Theme

 Sorry for the brief Frank outage around 1 PM PST today (not sure anyone even noticed?).

Relatedly, TIL that the python smart_open package is a dangerous way to upload files to GCS!

It calls a method in the GCS python SDK in a manner that disables the SDK’s internal retry mechanism, while also not implementing retry logic of its own.

Instead, if the upload fails partway through, it just raises an exception and hands control back to you.  And worse, it’s a so-called “resumable upload session,” which is supposed to let you resume the upload if it fails partway through … but smart_open doesn’t do that, and the exception it raises doesn’t contain the information you’d need to do that, even if you wanted to.

I lost over a week of logs due to this – not mission critical ones, but it’s still kind of a bummer.  Oh well.

GitHub - nostalgebraist/pytumblr2: A Python Tumblr API v2 Client, updated for the New Post Format era →

Not quite ready to push it to PyPI yet, but… here’s a little thing I’ve been working on.

In the course of working on nostalgebraist-autoresponder, I’ve made a bunch of compliance and usability upgrades to pytumblr.

Since Tumblr hasn’t been allocating much developer attention to the official API clients, I’m putting these changes in a fork called Pytumblr2 so they’re available to anyone who wants to use them.

This seems like a better home for NPF support, NPF -> HTML parsing, etc. than the innards of a large chatbot repo.

Another day, another mysterious memory leak that I finally trace to a common python library instead of my team’s own code….

Python developers, please read about the garbage collector. Don’t type “git commit” again until you know what a reference cycle is. I’m begging you! It’ll only take you fifteen minutes! My people work-hours are dying

the-moti asked:

Is the end of Frank's recent long post "Miranda" which is repetitive, but not completely repetitive, a consequence of breakruns? I'm imagining that the generator component is "breaking a run" every time the number of instances of "clothing" in a line is different from predicted, and a comma shows up early or late. But because this is only a small deviation from a repetitive pattern, it fails to completely escape - it gets closest in the transition between "Splinter" and "Transmutation".

nostalgebraist:

nostalgebraist:

nostalgebraist:

This is actually (maybe?) a consequence of me not having yet implemented breakruns in pytorch!  Which I meant to mention in my earlier post but forgot.

BTW, Frank is in a very unstable state at the moment due to the recent changes.

(Not in an interesting sense of “unstable,” just suffering from various bugs and unnecessary sources of slowdown, missing a few previously present features like Breakruns, etc.)

If you’re excited about the new model, I’d strongly suggest waiting a week or so until the dust settles before making any inferences based on her current output.

Update: I’ve now implemented Breakruns again, verified it works, and turned it on.

This should help reduce the repetition issues Frank has been having today.

She’s still tending to be repetitive when responding in threads where she already repeated herself a lot. This would be tough to prevent, since “staying in character” is generally desirable and generally what language models do.

Memory management with the new model continues to be tricky, though I’ve already made a bunch of improvements on that front. Response times may be slower and more variable for a while.

On a technical note, I’m really starting to hate the Huggingface transformers library.

I used it to get up and running quickly and verify that the new model(s) worked, but after that it’s been nothing but a source of needless pain, e.g.

image

What I’ve been doing lately in Frank development:

  1. Switching the ML stuff from tensorflow to pytorch.
  2. Replacing the generator model with one 2x as big, finetuned from the 2.7B GPT-Neo checkpoint released by Eleutherai. (This is the same size and architecture as the smallest GPT-3 model)

#1 is basically done and I should be able to “flip the switch” in production soon, probably tomorrow

#2 is nearly done on the development side, but might be too slow to be practical for Frank’s level of demand. No way to be sure without trying it

The second was enabled by the first: I finetuned the Eleutherai model in tensorflow(-mesh), same way they trained it, then spent like a week going down a Pepe Silvia-style rabbit hole trying to figure out how to do inference with the damn thing.

…then I converted it to pytorch and it instantly worked like a charm. Like 15 minutes of work after spending days on the tf version (actually rewriting and rebuilding parts of tf itself from source by the tail end of my quixotic efforts )

I’d been meaning to switch the project to pytorch for a long time, and this was the last straw.

My post “the scikit-learn cargo cults” from earlier in the week got linked on HN.

There aren’t that many comments on the HN post, but every commenter there seemed to read it in roughly the same way. Their reading is very different from what I originally intended to say. It’s like they’re all reading a totally different post from the one I (thought I?) wrote.

I wish I knew whether

  • I was much less clear in the post than I think I was, or
  • the HN comments are not representative of how most/many readers would interpret the post

If anyone with the relevant background wants to offer feedback on whether or where I communicated something badly, I’d be thankful. (The feedback I got from @the-moti in this post is a good example of the kind of thing I’m looking for.)

the scikit-learn cargo cults

People who design machine learning frameworks love the scikit-learn estimator interface. We can tell they love it, because they keep trying to imitate it.

But love and understanding are not the same – and none of these designers seem to understand what the sklearn estimator interface is. This failure is

  • inexplicable, because the concept is very simple
  • utterly disastrous in its consequences

—–

Specifically, no one seems to get that the sklearn estimator interface is … wait for it … an interface.

That is: it specifies a standard way for objects to communicate with one another. It doesn’t specify what the objects are, themselves.

That’s the whole point. Anything can be an sklearn estimator, as long as it conforms to the rules that sklearn lays down for estimators.

Aside from that, it can contain anything, do anything. It’s very easy to write a whole new sklearn estimator that no one has ever thought of before: the docs tell you exactly how an estimator is expected to behave, and as long as your object plays by those simple rules, it’s allowed to join the game. (What’s more, you can a lot of the rules for free, just by inheriting from the base classes and mixins sklearn provides.)

The simple rules include having a method called “fit,” which takes one or two inputs and ought to set some internal state. For predictors, the most famous type of estimator, you need a method called “predict.” This will matter in a moment.

(Sidenote: the sklearn estimator interface is really not a great example of an interface, because it actually does care about internals. It inspects attribute names and requires them to follow their own rules, and it has a not fully explicit expectation that estimators can be serialized with pickle.

However, these requirements are still interface-y in the sense that they only constrain estimators along a few well-defined dimensions, leaving everything else free. Anything that plays by the rules can still join the game, and play it just as well as the “official” estimators built in to sklearn.)

—–

Interfaces are great. They are one of the foundations of modern software. You would think people who loved an interface would learn the lesson “interfaces are great, and we should use them.”

Here is what developers of keras, tensorflow, and Sagemaker learned from that beloved estimator interface:

  • Data scientists love typing the words “fit” and “predict.”
  • It is, in fact, possible – one cannot rule it out – that data scientists do not know how to do anything other than type the words “fit” and “predict.”
  • An “easy to use” ML library is one where you can make the work happen by typing “fit” and “predict.” This is basically what usability is; the rest is details.

—–

Keras: patient zero

The first casualty of this odd disease – indeed, perhaps the patient zero from whom all the rest sprang – was François Chollet, creator of Keras.

Chollet says that sklearn was a “huge influence” on keras. “From Sklearn, I borrowed ‘fit’, but more generally best practices around usability.”

(Note that the claim in the first tweet is false: Keras models have never been valid sklearn estimators, because they do not follow the parameter naming rule. In many versions of Keras they are also not pickleable. Indeed, the tweet itself is about about a wrapping layer meant to add this missing compatibility, so I have no idea what “compatibility since 2015” is supposed to mean.)

The “Model” objects in Keras look deceptively like sklearn estimators. They have “fit” and “predict.” The methods do roughly the same things they do in sklearn.

But there is no “Keras estimator interface.” There is only one known valid species of the Keras fit/predict gizmo, namely “Model,” the one built into Keras.

The only way to roll your own thing that behaves like “Model” is to subclass “Model.” With sklearn, it’s helpful to inherit from BaseEstimator, but that just helps you follow a few rules, and you can easily follow them on your own. There is no set of rules that “Model” is following. It doesn’t follow the law, it is the law.

“I have in hand an sklearn estimator. What does that mean?” Just read this page: that is literally all there is to know.

“I have in hand a Keras model. What does that mean?” Read this labyrinthine piece of code, and also read everything it imports. That’s what a model does. Yes, you have to read the code — the docs tell you how to subclass Model, not what Model is.

—–

Tensorflow gets a fit/predict gizmo

Keras started out as a 3rd-party library, but was incorporated into tensorflow at some point, and was pushed as the standard way to develop neural nets in tf.

This is unfortunate, because Keras objects are complex beasts and no one really knows how to decompose one fully into primitives of tensorflow (or of anything). Nothing can be a Keras object that was not built as one from the ground up.

Thus, read any tensorflow doc and you’re likely to run into a strange split: “if you’re using Keras, then do X…” “…otherwise, do Y.” There has to be a generic path because you might not be using Keras, and if you aren’t, you’re stuck there. Thus everything gets done twice, often different ways.

All for poor, little “fit” and “predict”!

—–

Tensorflow makes another one

That is not the end of the story. No, at some later date tensorflow decided one fit/predict wasn’t enough. (“The more fit/predict-y a library is, the more usable it is,” to adapt a meme.)

Thus, tensorflow introduced a new thing called – of course – “Estimator.”

What the fuck is an Estimator (tensorflow flavor)? Well, it’s yet another gizmo with “fit” and “predict.”

It’s not a Keras model, but is more generic than a Keras model, and indeed closer to the spirit of sklearn. Its “fit” and “predict” can wrap almost arbitrary tensorflow code.

I suppose this may be one of the reasons they created it in the first place. But they didn’t get rid of Keras’ fit/predict thing, they just confusingly had two at once – and indeed the Keras gizmo both predated Estimator, and outlived it. (Like all reliable tensorflow features, Estimator has been officially deprecated and dis-recommended outside some specific legacy cases; references to Estimator are being slowly scrubbed out of the official guides as we speak.)

Estimator has (had?) its own complex ecosystem of helpers, most of them only “internal” and documented in code, just like Keras, but all over again. (Right before starting this post, I was trying to wrap my head around one called “MonitoredSession.”)

What really made Estimator different, though, was its support for distributed/cloud computing.

Elaborating on the theme that users cannot do anything but type “fit” and “predict,” Estimator aspires to make even such fearsome tasks as “training on multiple GPUs,” “training on cloud TPUs,” and even “deploying to a cloud service” into a call to either “fit” or “predict.”

Amusingly, Estimator was the primary supported way to take these actions for a while, and certainly the least painful. Thus, any code you wanted to distribute had to be wrapped in a “fit” or a “predict,” for the sake of letting an Estimator be the thing that calls it.

Perhaps (?) because the devs have noticed how unnecessary this is, tensorflow is now trying to ditch Estimator in favor of “Strategy,” a more generic wrapper for distributing arbitrary tf code.

Before this, Estimator and Strategy sat alongside one another awkwardly, just like Estimator and Keras did. Indeed, Estimator seems more reliable than Strategy, and continues to see use in official spin-offs like Mesh Tensorflow, presumably because people know it actually works, and know how to use it in real life.

Meanwhile, Strategy … well, the guide for Strategy contains this mind-melting compatibility table:

image

I remember this table from way back in Dec 2019, when I wrote my tensorflow rant. I am perversely pleased to see it still there in April 2021, with about as many “Experimental” and “Limited” cells as I remember.

(Note that this table’s rows include Keras, a model API, and Estimator, a model-and-distribution API, and compare these for compatibility with Strategy, a distribution API.

If you understood that sentence, I fear you.)

I have spent countless hours trying to understand this kind of nonsense. One might find oneself asking where the “usability” has gone, and where it was supposed to come from in the first place.

Sagemaker: a copy of a copy

Sagemaker is one of the zillions of AWS products.

It’s a “platform for machine learning,” which in practice means it’s Yet Another Complicated Wrapper Around Running Docker Containers On EC2™.

Like any AWS product, Sagemaker has API endpoints, and in python you can call these through the generic client boto3. To serve “high-level” “usability” needs, though, there is also a dedicated python SDK.

I bet you can guess what’s in it.

image

Estimator (Sagemaker flavor) takes the cloud computing focus of Estimator (tensorflow flavor) to its logical conclusion.

Sagemaker “Estimators” do not have anything to do with fitting or predicting anything. The SDK is not supplying you with any machine learning code here. The only vestige of the original meanings attached to these words is that “fit” is expected to modify a state (hence it downloads an artifact from the cloud when it completes), while “predict” should be stateless.

Instead, “fit” and “predict” here are wrappers for pushing and running an arbitrary Docker image. “Fit” runs it with an entrypoint called “train,” while “predict” runs it with one called “serve.”

There are some surrounding helpers with an ML flavor, but they are similarly generic. There’s something called “hyperparameters” which actually means “a json dict with string-only values injected into the container as a file before it runs,” and something called “training data” which actually means “an S3 path the container can read.”

It is impossible to understand what’s going on outside of the “built-in” Estimators without remembering that actually “fit” and “predict” are lies and you are just using Docker.

This is the furthest thing from an interface! Anyone who can make their own Estimator (Sagemaker flavor) also has no reason to do so; if you know how to write Dockerfiles for ECS/EC2, you can just do that without tacking on this extra SDK.

Indeed, Estimator (Sagemaker flavor) is so far from the sklearn original that it is hard to imagine its developers had sklearn clearly in mind when they wrote. More likely, they were trying to imitate the earlier imitators.

Epilogue: pytorch

Pytorch is by far the most user-friendly neural network library available in 2021.

Pytorch does not have “fit” or “predict.”

Pet peeve: when public codebases for machine learning research projects do “the main.py thing”

That is: they come bundled with a single CLI script, usually called “main.py,” which is capable of calling several entirely different code paths.

Training, evaluation, prediction, one or more “experiments” from the paper, each stage of training if there’s more than one, and anything else the authors did – it’s all “main.py” with different arguments.

This is bad for a lot of reasons, including:

  1. It takes the arguments of several conceptually distinct functions, and smooshes them all into one argument namespace.

    This often requires renaming or overloading them to avoid collisions. An argument called, say, “eval_steps” might do different things when it’s controlling evaluation-during-training vs. when it’s controlling evaluation on its own, or it might just control one of those but not the other.

    This problem could be trivially solved by using multiple CLI scripts.
  2. In practice, “main.py"s are rarely just simple wrappers that select a function and pass CLI arguments to it. They usually contain business logic, like calling functions with hardcoded but non-default arguments, or using the script arguments to make branching if/else decisions about function arguments.

    Everything can now comes in two flavors, the "CLI flavor” and the “library flavor.” There’s no way to intuitively assign meaning to these distinctions, because there’s no intuitive reason for them to exist at all. When reading/using the code, you feel like you’re watching two sets of intentions argue with each other, both warning you not to trust the other one.

I don’t see any upsides of this approach?

I imagine it’s just a thing people started doing, and then everyone noticed everyone else was doing it, and researchers tend to be risk-averse about everything that’s not related to the meat of their research, so why rock the boat…

(If you’ve done this, don’t feel bad, I’m not annoyed at you. Just at the pattern.)

(… at least it’s not the even worse pattern where there are multiple scripts, but they only run benchmarks or other narrow tasks, and the ability to train/eval/predict generally is technically there but locked behind each script’s tangle of business logic. The name “run_squad.py” still haunts me)

Did you know that GPT-2 can run on AWS Lambda these days?

I don’t know if anyone else has done it (probably?), but after hearing about Lambda’s recent updates I just had to try it, and … it works!

(As usual with me and GPT-2, this is 1558M, the big boy.)

You can now have Lambdas with up to 10GB RAM, which is enough for sampling.  You can use Docker images up to 10GB in size, which is enough for the model.  And Lambdas can run for 15 minutes, which turns out to be enough for sampling 1 context window or so, assuming you’ve warmed the thing up first.

I’ve got it fully implemented as an alternative GPT-2 backend for Frank, which is a nice insurance policy in case my current one stops working.

It’s hard to estimate exactly how much it would cost to use Lambda for Frank, but it would definitely be far less expensive than any approach that requires persistently reserved compute.

(What would really be nice would be something like “Lambda for GPUs.”  Which already exists in AWS as “Elastic Inference,” but only as an add-on for EC2, so I guess what I want could be rephrased as “Elastic Inference for Lambda”)

(On another note, I wish I knew Google Cloud Platform as well as I know AWS.  IME it has a much better user experience, and its owners seem at least less transparently evil.)

memory (mis)management in keras

Keeping with my theme of occasionally blogging about how much I hate tensorflow and/or keras

Here’s a “fun” thing I discovered today.  If you have ever used Keras, you might have called this function:

K.clear_session()

This is supposed to help with memory.  Here’s what the docs say:

Resets all state generated by Keras.

Keras manages a global state, which it uses to implement the Functional model-building API and to uniquify autogenerated layer names.

If you are creating many models in a loop, this global state will consume an increasing amount of memory over time, and you may want to clear it. Calling clear_session() releases the global state: this helps avoid clutter from old models and layers, especially when memory is limited.

Does K.clear_session() actually “reset all state generated by Keras”?  Nope!

—-

Okay, some background is needed.  

tensorflow (tf)

…is a neural network library whose core is written in C++.  It technically has API bindings for various high-level languages, but almost all users call it via the python bindings.  So, to most users, tensorflow is a complicated python package containing some black boxes that drop into C++ and do the actual computations.

keras

…is a weird piece of malware attached parasitically to tensorflow.  It’s best to avoid it entirely, but in some contexts this is difficult.

memory in tf

Because tf is written in C++, it doesn’t “have” automatic memory management.  This can get weird when you access tf solely through the python API, as most people do.

Because tf uses manual memory management, memory it allocates won’t be freed unless tf frees it at some point.

If you’re interacting with tf through the python API, you deal with python objects that are sort of “associated with” the underlying tf stuff that allocates memory.  If you want to make sure memory gets freed at an appropriate time, you can either:

  1. Call a method on the python object to explicitly free the memory.  This always works, but you can only do it if you have a reference to the object sitting around so you have something to call. 
  2. Hope the python garbage collector will deal with it.  Usually this means trying to get rid of references to the object, so it’s kind of the opposite of the first one: either it works, or it doesn’t work and then you can’t do #1 because you’ve deleted your reference. 

There are two places that lots of memory can build up in tf: “graphs” and “sessions.”

A graph defines a static computation you want to do.  A session is sort of an “execution context” in which you say “hey, do the computation specified by [this graph].”

I don’t know why sessions have to have state at all, but apparently they do, and it can get big.  The docs say:

A session may own resources, such as tf.Variable, tf.queue.QueueBase, and tf.compat.v1.ReaderBase. It is important to release these resources when they are no longer required. To do this, either invoke the tf.Session.close method on the session, or use the session as a context manager.

tf.Session.close here is method #1 from my list above, for sessions.  If you call it, great, you’ve cleared the memory.  What if you don’t?

Well, the Session class tries to close the session when python garbage collection happens to it.  It does this by defining __del__.  In python 3.4 and later, __del__ always gets called upon garbage collection, so we are sure to free the memory at some point.

However, “at some point” may not be good enough if you’re training neural nets, which usually means using as much of your memory as you can get away with.

garbage collection in python

Python (as CPython) has 2 types of garbage collection.

If all pointers to THING are deleted, THING gets immediately collected.  (“Immediate” is good when you’re using a lot of memory.)

THING can also get garbage collected if there are still pointers to it, but only as part of a cyclic isolate – a group of objects that nothing else cares about, but point to each other in a self-referential loop.

However, this second route is slower and works based on heuristics.  A cyclic isolate won’t get collected the moment it becomes a cyclic isolate.  You have to wait for the heuristics to decide it’s time, and you might not be able to get to that point before running out of memory trying to accomplish something.

sessions in keras

Keras attempts to hide tensorflow details from you.  With sessions, it does this by maintaining a single global variable called _SESSION.session.

Whenever keras needs a session, it calls K.get_session(), which returns _SESSION.session, after creating a new session if the value happens to be None.

Now, remember K.clear_session().  What does it do?  If you’ve been following along so far, you’d expect something like

_SESSION.session.close()

But no, what it actually does is

_SESSION.session = None

In ordinary python code, this is a standard way to free memory.  You take an existing reference to THING and make it point at None, instead of its previous referent.  This triggers garbage collection if appropriate – either immediately, or with a delay if there is a cycle.

In keras, _SESSION.session gets put into a complicated reference cycle.  I don’t know why.  I don’t know whether it’s for a good reason, or whether it was just convenient or seemed somehow “pythonic.”

But anyway, if you create a keras “Model,” called let’s say “model,” and you call “model.fit,” you now have two (why two? who knows) extra references to _SESSION.session:

- model._session

- model.train_function._callable_fn._session

Meanwhile, model refers to itself in several places:

- model.history.model is a reference to model

- model._inbound_nodes[0].outbound_layer – whatever the hell that’s about – is a reference to model

So, immediately garbage collecting _SESSION.session is pretty tough.  Simply setting it to None won’t get rid of the pointers hanging off of model.  Setting model to None won’t do the job either, because it will turn model into a cyclic isolate, taking _SESSION.session with it.

Now, yes, eventually even cyclic isolates get collected.  Which means that, if you call K.clear_session(), your memory will eventually be freed.

But why “eventually”?  We know how to clear that memory immediately.  It’s called “Session.close(),” and they could just … do that?  But they don’t. 

(And in fact, because setting _SESSION.session = None throws away your reference to the problematic object, calling K.clear_session() leaves you unable to manually deal with it, which you would have been able to do before!)