Install Theme

homebrew

utilitymonstermash:

nostalgebraist:

Homebrew, a very popular package manager for OS X, does not allow the user to install a specific version of a package.

Nor does it allow packages (“formulae” in its lingo) to specify versions or version ranges in their dependencies.

Instead, in Homebrew, packages just have names, and the names mean “the newest version released to Homebrew so far.”

—-

For example, here’s Ipython on PyPI and github.   There, you can see lots of different versions, and you can see the newest ones require python >= 3.7, as advised in NEP 0029.

… and here’s Ipython on Homebrew.  There’s only one version, the latest one, whatever the latest one happens to be at $CURRENT_DATE.

And instead of depending on python >= 3.7, it requires python 3.8, which NEP 0029 will not demand until Dec 26, 2021.  And work to bump that requirement to python 3.9 is apparently underway.

Actually, it does not really require python 3.8 (remember, you cannot require versions in Homebrew).  Instead, it just requires “python,” i.e. whatever Homebrew has decided the latest version of python is.

Formulae for apps that require Python 3 should declare an unconditional dependency on "python@3.x". These apps must work with the current Homebrew Python 3.x formula.

If a package developer really wants to make multiple versions available on Homebrew at once, they can request to do so, but must pass a manual curation step, and even if they pass, their special status is provisional.

No more than five versions of a formula (including the main one) will be supported at any given time, regardless of usage. When removing formulae that violate this, we will aim to do so based on usage and support status rather than age.

[…]

Versioned formulae submitted should be expected to be used by a large number of people. If this ceases to be the case, they will be removed.

—-

Am I missing something, or is this really bad?

I’ve learned to call `brew install` as rarely as possible, because it will recursively update all dependencies of the thing I’m installing to Homebrew’s current versions – that’s the only thing it can do, no other versions “exist” – and this means replacing possibly large quantities of software that works fine with software that might not work.

And once that happens, you can’t get the old versions back.  It was installed and running on your machine a moment ago, but to Homebrew it doesn’t exist anymore.

If you need to get old versions back, because you need your computer to work or some nonsense like that, you will probably find yourself reading this Stack Overflow thread, which has been chugging along since 2010 with no fully satisfying resolution.  Some highlights:

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image
image

¯\_(ツ)_/¯

Engineering is about trade offs. Latest version only and unconditional dependencies obviate the need for a SAT solver. Many homebrew packages expect to deal with untrusted input from the network. Latest version only greatly simplifies issues surrounding securing old versions of software and aligning lifecycles of dependencies with different release cycles. A ton of seemingly boring bugs get fixed and don’t get CVEs with backports to all stable branches because the security implications weren’t obvious to whoever found and fixed the bug.

Homebrew Python still provides pip, you can still spin up a virtualenv with a curated requirements.txt on Homebrew Python if that floats your boat.

Homebrew still needs its Python to support end user Python apps shipped as part of homebrew, including some apps that are pretty strongly evergreen. (Someone around here had a rant about youtube-dl in Ubuntu being broken by the time the distro releases).

If you need to exact point releases of all your dependencies, including a specific versions of postgres, docker might be the better fit for the job. I also hear good things about conda, but I can’t vouch for it, and the installers also seem to be tied to recent Python versions newer than NEP 0029 requires.

There are a bunch of things I’d rather see homebrew change before better support for version pinning. I’d love to see them get out of a shared /usr/local that lots of other things pollute, handle conflicting binaries better, and track better data about about when to rebottle due to changes in build time dependencies.

My real hot take about reproducable computing on mac is that it would be nice if macOS had a better container option for building and running macOS (not linux) software.

Most of this is over my head – which is not a criticism.  I’m not very familiar with package management in general, and I wrote the OP thinking maybe this behavior is normal and I’m just not used to it.

However, insofar as I understand your argument, I’m not convinced.  It sounds like you’re arguing that, because Homebrew forces the user into all new releases, users of Homebrew will stay up to date with security patches:

Latest version only greatly simplifies issues surrounding securing old versions of software […] A ton of seemingly boring bugs get fixed and don’t get CVEs with backports to all stable branches because the security implications weren’t obvious to whoever found and fixed the bug.

But this cuts both ways.  Experience has taught me not to ever run `brew install` or `brew update` unless I have hours of spare time set aside to deal with the fallout if necessary.  So, I never run those commands unless I’m forced to – which means that, usually, none of these patches reach my machine.

—-

Taking a step back: I don’t think I necessarily object to a lack of support for multiple package versions.  (Since Homebrew is mostly a binary installer these days, I understand that supporting these would be a large cost for their build process.)

What I really object to is the inherent instability of Homebrew-core, the collection of packages you are pulling from when you run `brew install` or `brew update` as a typical user.

Unlike virtually any other mature project I interact with, Homebrew-core does not have versions or releases.  It is a git repo with one branch, no tags, ~179000 commits to master, and ~59000 closed PRs.

Using an “up to date” Homebrew (which will happen unless you try hard to stop it) means using the very latest built commit to this master branch, which probably occurred within the last 24 hours.

—-

I’m not actually using Homebrew for development – I have a few dev tools installed through it, but I’m not looking for version pins so I can build software.  I’m just trying to install software as a normal user, so I can use it.

And if something breaks, I want to be able to say “okay, I’ll try downgrading back to version 7.3.11″ or something like that.  Some pointer to the thing I had before I updated.  Like I get with any other software.

I can’t do that with Homebrew packages.  I can’t do it with Homebrew-core, the collection of Homebrew packages.  The closest things to version numbers are individual commits to homebrew-core master, and even then I don’t know which commit I was on yesterday, before I ran `brew update` (desperate times call for etc.)

I do know which commit I’m on now, though!  `brew –version` tells me:

Homebrew/homebrew-core (git revision 8a34ac; last commit 2020-10-27)

which is a commit to update something called jfrog-cli to 1.40.0, made 22 hours ago, very close to the time I ran `brew update`.

Many commits have been made in the 22 hours since then, and every one makes all prior Homebrew configurations effectively unrecoverable, if usually in a superficially harmless way.

History moves forward and the past is erased.  What will be true tomorrow?  In a month?  In six months?

And how will I even know the name of the ephemeral past I have lost?  As “8a34acb309ba9d62b2d0377fe76c1a5731ddacc7″, a hash I was careful enough to write down this time around?  Seriously?

nightpool asked:

Homebrew is awful for a number of reasons, but why use it to manage Python dependencies? Can't you just install IPython using pip? Ideally in a virtualenv?

You can, and these days I would.

However, installation pages for python applications often recommend installing them via Homebrew even if they are on PyPI, and some are not on PyPI at all.  (I’m specifically talking about standalone applications, like development utilities, not libraries I want to use in the project when I’m developing.)

So, a python developer working on a Mac probably uses some tools that have been installed with Homebrew, unless they have been careful to avoid Homebrew from the moment they received the Mac.  These then need to be kept up to date, etc. with Homebrew. 

This category includes applications commonly used to manage virtualenvs or python itself, like pipenv, virtualenv, and pyenv.

The first time I installed pipenv was via Homebrew, because a setup tutorial at work told me to type `brew install pipenv`.  This is now officially discouraged by pipenv (I don’t think it was at the time), for a reason I later encountered on my own: pipenv uses the python version which installed it to create its virtualenvs, and virtualenvs contain many symlinks to the python that created them – links which will point back to Homebrew python, and which will break if Homebrew ever decides to “update” python.

homebrew

Homebrew, a very popular package manager for OS X, does not allow the user to install a specific version of a package.

Nor does it allow packages (“formulae” in its lingo) to specify versions or version ranges in their dependencies.

Instead, in Homebrew, packages just have names, and the names mean “the newest version released to Homebrew so far.”

—-

For example, here’s Ipython on PyPI and github.   There, you can see lots of different versions, and you can see the newest ones require python >= 3.7, as advised in NEP 0029.

… and here’s Ipython on Homebrew.  There’s only one version, the latest one, whatever the latest one happens to be at $CURRENT_DATE.

And instead of depending on python >= 3.7, it requires python 3.8, which NEP 0029 will not demand until Dec 26, 2021.  And work to bump that requirement to python 3.9 is apparently underway.

Actually, it does not really require python 3.8 (remember, you cannot require versions in Homebrew).  Instead, it just requires “python,” i.e. whatever Homebrew has decided the latest version of python is.

Formulae for apps that require Python 3 should declare an unconditional dependency on "python@3.x". These apps must work with the current Homebrew Python 3.x formula.

If a package developer really wants to make multiple versions available on Homebrew at once, they can request to do so, but must pass a manual curation step, and even if they pass, their special status is provisional.

No more than five versions of a formula (including the main one) will be supported at any given time, regardless of usage. When removing formulae that violate this, we will aim to do so based on usage and support status rather than age.

[…]

Versioned formulae submitted should be expected to be used by a large number of people. If this ceases to be the case, they will be removed.

—-

Am I missing something, or is this really bad?

I’ve learned to call `brew install` as rarely as possible, because it will recursively update all dependencies of the thing I’m installing to Homebrew’s current versions – that’s the only thing it can do, no other versions “exist” – and this means replacing possibly large quantities of software that works fine with software that might not work.

And once that happens, you can’t get the old versions back.  It was installed and running on your machine a moment ago, but to Homebrew it doesn’t exist anymore.

If you need to get old versions back, because you need your computer to work or some nonsense like that, you will probably find yourself reading this Stack Overflow thread, which has been chugging along since 2010 with no fully satisfying resolution.  Some highlights:

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image

~~~~~~~~~~~~~~~~~~~~~~~~~~~

image
image
image

stumpyjoepete asked:

It seems like Python (or a similarly architected language) would need a specially designed data structure for efficiently manipulating large arrays/matrices. Furthermore, there would be a huge performance hit if you used native loops rather than broadcasting ops. So, do you object to the basic setup of Pandas, or do you think it just did a shit job of being a good library that does that?

Pandas gets these fast linear algebra tools from numpy, and I don’t object to numpy.  Pandas adds things on top of numpy, and I object to those.

Pandas is not a package for fast linear algebra, it’s a package for running queries on 2D data structures that resemble queries on a relational database.  So it introduces things like:

  • Named and typed “columns” (AKA “fields”).

    This means we are thinking about matrices from a very different perspective from abstract linear algebra: not only do we fix a preferred basis, but the columns may even have different types that cannot be added to one another (say float vs. string).

    (I mention this to emphasize that pandas is not just an obvious extension of numpy, nor is numpy obviously the right foundation for pandas.)
  • A typed “index” used to identify specific rows.
  • Operations similar to SQL select, join, group by, and order by.

In other words, interacting with a data table in pandas is similar to running SQL queries on a database.  However, the pandas experience is (IME) worse than the SQL experience in numerous ways.

I’ve used pandas almost every day of my life for around three years (kind of sad to think about tbh), and I still frequently have to look up how to do basic operations, because the API is so messy.  I never forget how to do a join in SQL: it’s just something like 

SELECT […] FROM a 

JOIN b

ON a.foo = b.bar

To do a join in pandas, I can do at least two different things.  One of them looks like

image

and the other looks like (chopped off at my screen height!)

image

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.  Got that?

Let’s not even talk about “MultiIndices.”  Every time I have the misfortune to encounter one of those, I stare at this page for 30 minutes, my brain starts to melt, and I give up.

As mentioned earlier, the type system for columns doesn’t let them have nullable types.  This is incredibly annoying and makes it next to useless.  This limitation originates in numpy’s treatment of NaN, which makes sense in numpy’s context, but pandas just inherits it in a context where it hurts.

There’s no spec, behavior is defined by API and by implementation, those change between versions.

Etc., etc.  It’s just a really cumbersome way to do some simple database-like things.

urpriest asked:

Thinking about your criticism of Jupyter (which I haven't used much, but which also applies to Mathematica which I use all the time), doesn't the same criticism apply to physical pen-and-paper notebooks, and blackboards? Especially if you use an eraser to go back and fix mistakes.

I don’t think so.

When someone writes out a calculation in a physical notebook, or on a blackboard, it doesn’t produce any “state” apart from what ends up in the minds of people reading it.

Every assumption used in the calculation is either explicitly written out, or supplied by the readers as tacit background knowledge.  There is no category of “assumptions used in the calculation which the blackboard knows, but no reader could know.”  A proof on a blackboard that uses information “only the blackboard knows” is just an invalid proof on a blackboard.

When someone writes out a calculation in a computer notebook (Mathematica, Jupyter, or the like), the notebook really does “know things” in a meaningful sense, and these are not just the implications of its current written contents.  (Unless you are doing purely functional programming, which is one way to keep a notebook from getting weird.)

In the notebook’s mind, every step it computed in the current session still holds true, even if that step was later erased.  This is different from what happens in a reader’s mind: the reader only considers the outcomes of the steps that they see in front of them.  This divergence between two belief states is impossible with a physical notebook.

Generally we want the computer to know things we don’t.  This is less important when doing pure math, where you still want to obtain a “blackboard-valid” proof at the end of your work.  But it’s very important in numerics or data analysis, where we use the computer to work with huge matrices or data sets which would take huge quantities of paper just to write down, and even huger ones if we need to derive things about them.

So the set of facts which the notebook knows, but we don’t, contains important things that determine the results of our work.  If we lose track of how this set of facts came to be, we don’t know what our work means anymore.

A simple example.  Say I write this on a blackboard:

y = 2x

x = 1

⇒ y = 2

A valid derivation.  Now, I erase the first line:

x = 1

⇒ y = 2

Now it’s just an invalid derivation.  There is no sense in which the calculation is “still valid, but for mysterious reasons.”

In a computer notebook, I could write (assuming the notebook language handles assignment in a way resembling our notation on the blackboard):

>> y = 2x

>> x = 1

>> y

2

and now erase the first line:

>> x = 1

>> y

2

As on the blackboard, this is no longer a valid derivation for the reader.  But the notebook remembers what was going on.  So I can go on to derive further results that make sense, but only in light of the notebook’s secret knowledge:

>> x = 5

>> y

10

Because people regularly go back and fix their mistakes, this situation – with “secret knowledge” used but never written out – is the norm in computer notebooks.  Preventing it requires special care, attentiveness, and discipline.

I was expecting the pandas criticism in this post to be more controversial than the Jupyter part, but the reverse was true.  Interesting!

Indeed, in the post, I wrote

Everyone knows Jupyter Notebook is bad.  People talk about it with amused shame, like it’s candy or an addictive drug.

I knew this generalization wasn’t literally accurate, but it seems like it was less accurate than I realized.

—-

For more Jupyter criticism, I highly recommend that Joel Grus talk I linked.

It’s fast and punchy and has a lot of memes and humor, so it may not satisfy you if you’re looking for carefully explained arguments, but it covers a lot of ground.

The most relevant parts of the talk, for me, are the parts about Jupyter encouraging bad habits and making it harder to practice good habits.  I agree with these points, and also frequently hear them echoed by colleagues, hence my “everyone knows” comment.

szhmidty:

nostalgebraist:

A few really bad tools have risen to ubiquity in data science, and they’re an immense drag on the productivity of almost everyone in the field.

Someday someone is going to create, and then successfully promote, a serious competitor to these tools, and I will be so happy.  It won’t actually be that hard, because the tools are so bad.

The tools I’m thinking of are

- Jupyter Notebook (which is such an inherently bad idea it feels like a mean joke)

- Pandas (which is much less actively harmful than Jupyter Notebook, but is a very cumbersome and confusing way of doing some very basic and foundational tasks)

- “Jupyter + Pandas,” the synergetic combination of these two tools (pandas clearly expects you to use Jupyter so you can see its HTML output) that has data science in a tighter grip than either bad tool could manage on its own

—-

Everyone knows Jupyter Notebook is bad.  People talk about it with amused shame, like it’s candy or an addictive drug.  Here’s Joel Grus ranting about it for an hour, for example.

What is Jupyter Notebook?  It’s basically an interactive interpreter that looks like an IDE.  You can write long blocks of code at once easily, and you can go back and edit/delete/rewrite your code … and all the while you are in the same interpreter session, with the same global state, which was produced by code you ran earlier and then rewrote or deleted.

The state of the session is the context in which your code executes, yet it quickly diverges from anything your code could ever have produced!  Indeed, any Jupyter Notebook quickly develops a mysterious state which is impossible for anyone to reproduce perfectly.  A huge fraction of all code written by data scientists is first executed inside one of these phantom, inexplicable states.

Yet we develop our code in this nightmare joke IDE anyway, because nothing else has the same (fairly simple, but essential) visualization tools.  And because we like doing computations that take a while, and doing all of them in a single, convoluted, stateful process running alongside development is a simple (albeit horrible) way to avoid doing them more than once.

Some people embrace this tool to an extent I do not understand, seeing some untapped potential in it.  For example, Google made Colaboratory/Colab, and Netflix built some vast complex system around it so they could … so they could do … honestly, I watched that whole video and I’m still not sure.

—-

Pandas is … okay, I guess, it’s just very un-Pythonic.  Python is great!  That’s why these ubiquitous add-ons to python are so frustrating.

Python likes having one conceptually simple way to do each things.  Pandas has a huge, inconsistent API with 5 different ways to do everything.

Quick, do you want `pd.read_sql` or `pd.read_sql_query` or `pd.read_sql_table`?  Do you want `is_na` or `is_null``join` and `merge` do the overlapping things with different argument syntax.  There is no concept of a field/column with nullable type, so the moment you add a null value to a typed field, its type degrades to “object.”  Everything is fuzzy and squishy and changes from version to version.

But it prints the outputs of SQL queries in a pretty way that everyone loves.  … except only if you’re in a Jupyter Notebook.  You’re in a Jupyter Notebook, right?  You’re using pandas, right?  Right???

RE: Pandas

I hate Pandas. It feels like someone took R and ported it into Python.

RE: Jupyter

Interactive interpreter + IDE is pretty much how MATLAB works also. I’m not convinced its all that bad. Like yes, the environment the code executes in is inconsistent, and that can cause problems, but the solution seems fairly straightforward: use the “restart kernel and run all” button.

This isn’t even specific to Jupyter or MATLAB, Maple and Mathematica have have the same issue. Its real easy in maple to accidently write a statement that depends on the result of a computation 3 lines down.

Really, I think you shouldn’t be doing actual code development in Jupyter. Or rather, I dont think you should be doing code deployment from Jupyter (Jupyterdoesn’t think you should either, AFAICT; there’s no way to export the notebook to straight plaintext code). Its nice for bespoke code that is meant to only do one idiosyncratic thing.

Which is exactly what most people doing data science are doing: they’re not on github making contributions. They’re using python as a particularly sophisticated calculator to solve a problem in front of them at the moment.

You mention not wanting to repeat computations because they take a while, but I think equally important is the convenience of an uninterrupted workflow and being able to test the current state of your code. Being able to really quickly iterate on the attempts at the next step in your code is really, truly useful. Without jupyter, what I need to do is to run the python interpreter and then import my code as is. Except if I make a mistake in my imported code, then I need to exit the interpreter, fix it, then reimport it. (There’s a module that helps with this bits its a tad prone to failure.) Every typo, everytime you forget to import a module, every time copy and paste accidently drops some whitespace is another exit, fix, reenter, reimport.

That’s a really annoying hassle when I just want to make a few interactive queries about the state my code produces to make sure I’m on track.

I also personally just like being able to bounce between executable code and rendered markdown. Its a nice tool for presentation.

Really, I think you shouldn’t be doing actual code development in Jupyter. […] Its nice for bespoke code that is meant to only do one idiosyncratic thing.

Which is exactly what most people doing data science are doing: they’re not on github making contributions. They’re using python as a particularly sophisticated calculator to solve a problem in front of them at the moment.

The definition of data science aside (I think it varies by company, definitely every data scientist I’ve known has committed code as part of their job) … I guess my opinion is that no one should only write idiosyncratic one-off code.

I mean, the entire history of computer programming is a long string of people noticing “hey, we’re doing this same long thing over and over again a lot, let’s turn it into a short command.”  If it weren’t for a long line of people scratching that same itch, we wouldn’t be talking about python and Mathematica, we’d still be writing assembly.  Or byte code.  Codifying and automating repetitive actions is the soul of programming, and it’s hard for me to imagine a day-to-day programming workflow where it simply never comes up.

More prosaically, I just don’t think anyone’s work is that reliably unreliable.  Even if python is a sophisticated calculator to you, you are going to notice yourself doing the same long strings of calculator steps over and over again, and you’re going to notice them failing in the same ways, and you’re going to notice yourself Googling the same terms and looking at the same Stack Overflow pages … if your job involves writing a lot of code, it quickly becomes a good idea to write some of it down permanently for later re-use.  I think that principle generalizes across all work where people frequently type out lines of code.

I used Mathematica and MATLAB a lot back in my physics/math days, and I’m not a huge fan of either one.  Mathematica is definitely a lot like Jupyter Notebook, but that’s a count against both of them IMO.  What I remember of MATLAB was more like regular python, though?  You have scripts and you have a command line.  You can’t really develop code in the command line, you have to do it in the script editor.

Jupyter is certainly more “convenient” than some other workflows for quick one-off development, but this convenience quickly fades into confusion and frustration once your code crosses some low complexity bar.  Standard IDEs are not a great solution here, but they are a great solution for regular software engineering.

My frustration is that data science doesn’t have a good, mature equivalent of that tool.  Regular software engineers have regular IDEs, which were carefully crafted over time to serve their goals.  We just have Jupyter, which wasn’t carefully crafted at all, it’s just an ugly hack someone threw out there and everyone started using because there wasn’t anything else around.

A few really bad tools have risen to ubiquity in data science, and they’re an immense drag on the productivity of almost everyone in the field.

Someday someone is going to create, and then successfully promote, a serious competitor to these tools, and I will be so happy.  It won’t actually be that hard, because the tools are so bad.

The tools I’m thinking of are

- Jupyter Notebook (which is such an inherently bad idea it feels like a mean joke)

- Pandas (which is much less actively harmful than Jupyter Notebook, but is a very cumbersome and confusing way of doing some very basic and foundational tasks)

- “Jupyter + Pandas,” the synergetic combination of these two tools (pandas clearly expects you to use Jupyter so you can see its HTML output) that has data science in a tighter grip than either bad tool could manage on its own

—-

Everyone knows Jupyter Notebook is bad.  People talk about it with amused shame, like it’s candy or an addictive drug.  Here’s Joel Grus ranting about it for an hour, for example.

What is Jupyter Notebook?  It’s basically an interactive interpreter that looks like an IDE.  You can write long blocks of code at once easily, and you can go back and edit/delete/rewrite your code … and all the while you are in the same interpreter session, with the same global state, which was produced by code you ran earlier and then rewrote or deleted.

The state of the session is the context in which your code executes, yet it quickly diverges from anything your code could ever have produced!  Indeed, any Jupyter Notebook quickly develops a mysterious state which is impossible for anyone to reproduce perfectly.  A huge fraction of all code written by data scientists is first executed inside one of these phantom, inexplicable states.

Yet we develop our code in this nightmare joke IDE anyway, because nothing else has the same (fairly simple, but essential) visualization tools.  And because we like doing computations that take a while, and doing all of them in a single, convoluted, stateful process running alongside development is a simple (albeit horrible) way to avoid doing them more than once.

Some people embrace this tool to an extent I do not understand, seeing some untapped potential in it.  For example, Google made Colaboratory/Colab, Amazon made Sagemaker, and Netflix built some vast complex system around it so they could … so they could do … honestly, I watched that whole video and I’m still not sure.

—-

Pandas is … okay, I guess, it’s just very un-Pythonic.  Python is great!  That’s why these ubiquitous add-ons to python are so frustrating.

Python likes having one conceptually simple way to do each things.  Pandas has a huge, inconsistent API with 5 different ways to do everything.

Quick, do you want `pd.read_sql` or `pd.read_sql_query` or `pd.read_sql_table`?  Do you want `is_na` or `is_null``join` and `merge` do the overlapping things with different argument syntax.  There is no concept of a field/column with nullable type, so the moment you add a null value to a typed field, its type degrades to “object.”  Everything is fuzzy and squishy and changes from version to version.

But it prints the outputs of SQL queries in a pretty way that everyone loves.  … except only if you’re in a Jupyter Notebook.  You’re in a Jupyter Notebook, right?  You’re using pandas, right?  Right???

stumpyjoepete asked:

inspired by your most recent post (625458970897874944): what other important lessons have you learned in your data sciencing job? what advice would you send to your past self?

Good question.  I’m not sure I have any really good answers – the most important lesson is the one I related in that post:

Over time, I learned the value of doing exactly what you want, not something close to it.  I learned that a little bit of data in your actual domain, specifying your exact task, goes much further than any domain-general component.  Your applied needs will be oddly shaped, extremely specific, finicky, and narrow.  You rarely need the world’s greatest model to accomplish them – but you need a model with access to a very precise specification of exactly what you want.

That said, here are some other things that come to mind.

(1)

Quality metrics work very differently in academic research and in applications.

In academia, people are usually working on well-defined tasks where the community can agree on benchmark datasets and standard metrics for each task.  (Accuracy, F1, BLEU, perplexity, whatever – they may not be perfect, but there’s an agreement to use one or a few of these per task as a “good enough” proxy to make results from different researchers comparable.)

In applications, you’re usually doing something novel where it’s not too clear what “good” even means.  Additionally, you get to define your task and to some extent your dataset.  Ultimately you want something that’s good on a human level (“what the users want” or something).

This means there’s an extra step, not even present in academic research, where you take an outcome defined in human terms and frame it as an ML task.  This choice of framing can drastically affect the quality of the outcome and the effort needed to achieve it.

(1b)

A particular pattern I’ve noticed in task framing: it’s often better in applications to impose a “hardcoded” structure where a decision is made in a sequence of easily understandable stages, rather than trying to make the decision end-to-end from the raw inputs.

As a made-up example, instead of making a recommender system that just decides what a user wants to see next based on all available info, you could instead build models that extract various intuitive features like “what genres do they like” and “do we think they want something similar or different from the thing they last saw,” and then make the decision based on those features.

The important part is that you may want to do this even if the end-to-end approach could easily figure out the same procedure on its own.

With the staged approach, it’s easy to explain decisions in human terms, easy to diagnose what’s going on when they fail, easy to try out new ideas by expressing them as compositions of the features (maybe you re-use the genre predictor in some other project), easy to extend with new intuitive features, etc.

Whereas if you make an end-to-end model, even if it does this one thing well, you’re kind of locked in to that exact framing.  It’s hard to go back and decompose its decisions into intuitive steps; the steps will all be implicitly mixed together in its learned parameters.  (In academia it’s popular to build end-to-end models and then try to decompose them via “interpretability” methods, and much of this strikes me as a waste of time.)

(2)

Incomplete data is ubiquitous in applications, and most existing tools are not well built for it.

What I mean by incomplete data is like, say you used to only measure 5 features per interaction/user/whatever, but now you measure 12.  You want to use all 12 features when available, but still get value out of that old data, which has “missing” entries for 7/12 of the current features.

Just on a grubby technical level, standard python tools handle this really badly.  You have to keep close track of python None vs. numpy nan, and pandas/scikit-learn/etc. seem built from the ground up on the assumption you’ll never have missing values, with errors or (worse) bizarre behavior when they’re present.

If there’s a lesson here, it’s something like “think upfront about how how you plan to handle missing values and write your code with a plan in mind?”  I spend an embarrassing fraction of my work time handling None/nan bugs and could probably do better if I thought more proactively.

(2b)

Another thing that’s common in applications is highly unbalanced data, e.g. a classification problem where the answer is “No” 99% of the time but you really care about the 1% that’s “Yes.”

There’s plenty of research out there on “unbalanced data” per se, but papers that aren’t explicitly “about” this topic tend to use balanced datasets, and metrics like accuracy/F1 that work best with balanced data.

In classification, the Matthews Correlation Coefficient is a wonderful metric that behaves similarly to more popular ones but has no problems with unbalanced data.  I wish I’d known about it sooner.

(3)

Much of the data science code ecosystem is very new, and much of it is poorly maintained, unstable, poorly documented, or just full of hidden assumptions.

I used to make a mistake where I’d use superficial “officialness” or sleek presentation as a proxy for maturity.  In the python context, I’d look at things like whether a package was on PyPI, whether it had a simple and generic name, whether it seemed widely used, whether it was made by a big name like Google … and if these added up to a sort of “official” or “standard” vibe, I’d view it as trustworthy.  This was a very bad, perhaps even valueless proxy.

Oddly, a better proxy is the structure of someone’s documentation.  It’s a good sign if there’s a “User Guide,” separate from the code-level API reference, that walks though the different parts of the system in human terms.  (Examples: pandas, sklearn.)  This suggests the creators think about making a holistic system that “hangs together” in a stable way across versions.  On the other hand, it’s a bad sign if the documentation is a flat list of how-to-do-X tutorials (example: tensorflow, many AWS/cloud products).

I’ve also learned that the best reference for any open-source package/library, even the best ones, is the source code itself.  If a package/library is giving you trouble, you shouldn’t be shy about just looking at the code – I find this often quickly and cleanly resolves confusions that would have been impossible to resolve otherwise, and reveals a great deal of valuable information no one ever thinks to write down elsewhere.

(Frequently what you learn is that the authors assumed no one would ever do the thing you are, in fact, trying to do.  It’s important to learn this as fast as possible so you can start working around it.

Again, trusting the “official vibe” is bad: if something looks like the one-stop solution for everyone, trust me, it’s still assuming all kinds of things about you behind the scenes.  Truly general-use software exists – I mean, programming languages and stuff – but anything in data science that looks like that is faking it with hacks and duct tape.)

a-point-in-tumblspace asked:

Hey, you're a super accomplished ML person who very frequently says Right Things about software development -- do you still endorse your earlier exhortation to "use pytorch [instead of Keras] or if you have to use tensorflow just use raw ops"? (Much of the discussion around GPT-3 is just words to me, and I want that to _not_ be the case when the end times come for real, so I'm starting to get into ML, and I want to choose a good tool to learn with.)

Yes, definitely!  Specifically, if you’re just getting started, I strongly recommend choosing Pytorch and trying to avoiding tensorflow/Keras entirely.

Code and models built in one of these frameworks can be highly nontrivial to port to the other one, so this is a pretty consequential decision point.

Also, I don’t know if you’re specifically interested in transformer models like GPT-n, but if you are, the Huggingface transformers package has become the de facto standard implementation of them, and it’s based in Pytorch.