Being able to build good representations of things is important. So many programming languages lack something as basic as sum types, which puts them behind the state of the art in the 1970s. Yes, Haskell has some fancier stuff in it as well, but a lot of the value comes from simply having a good way to represent "this or that" and applying it properly.
The other thing that we do all the time in everyday life is a quotient, and as far as I know no programming language has a good way to represent that: something as basic as a fraction (i.e. a pair of integers, but two pairs are equivalent when ...) can't be expressed except in an ad-hoc way.
I appreciate the author's enthusiasm for offering a large standard library of datatypes - I certainly think things like decimals and trees ought to be more available than they are - but I don't think we can even start on that until we actually have good tools for defining datatypes. And actually the best way to get there might be to have fewer built-in datatypes, and force ourselves to go through the work of defining numbers and strings and so on "in userspace", using the normal language facilities. That would quickly highlight a lot of quite basic things that a lot of languages are missing - e.g. literals for user-defined types are incredibly painful in a lot of algol-family languages.
> The other thing that we do all the time in everyday life is a quotient, and as far as I know no programming language has a good way to represent that: something as basic as a fraction
Er, fractions are a directly-supported (core or stdlib) type in many languages. Including:
Scheme family languages (numbers work this way by default, unless you introduce irrational numbers), Ruby (doesn't default to rational math by default generally for numbers, but has syntax for rational literals and rationals use rational math), Raku, Python, Clojure, Go, Haskell, and, well, lots of languages have core/stdlib support even if they don't have a literal syntax or Scheme-style default-to-exact-rational behavior.
I pretty sure the poster meant the general 'fraction' (quotient) construction (set 'a' is like set 'b' but with certain elements to be regarded as the same - 'a rational number is a pair of numbers, where you regard (ax,ay) to be the same as (x,y)'. ), rather than rational numbers exactly. (Similarly, Actionscript 3 had generic lists, but not generic types generally IIRC).
Isn't this usually trivial to inplement, but pointless because you'll never want to abstract over it?
It's basically a type characterized by a base type and a canonicalization function that maps all values in that type to a subset of the same type, with those values that have the same canonicalized value being equal.
In order to deal with quotients it is not necessary to normalize to a single representative for each equivalence class (though sometimes that can be useful).
The more important part is a facility for detecting whether two objects are equivalent, and a guarantee that all other operations respect the structure.
> In order to deal with quotients it is not necessary to normalize to a single representative for each equivalence class
I may have been thinking about too close of an analogy with mathematical quotients where there are other operations and you are going to want to apply those to the canonical form generally, but, sure, I can see that.
My point was in languages where you can construct user-defined generics types, a general implementation is usually straightforward but not particularly useful, because it's an abstraction you'd almost never consume except in defining concrete types. General operations on quotient-like types aren't something that there is much call for. And I think that's just as true with or without canonicalization.
I would suspect that that is why, unlike say iterables (or in languages with more powerful type systems, monads) it's not something you see supported out of the box in type systems.
Quotients are a bit of an unsolved problem even in Agda. Cubical Agda makes the definition easier but makes it much harder to work with the resulting datatype, while if you don't use higher inductive types you end up in setoid hell where the language can't do any of the obvious things you should be able to do automatically.
I appreciate Haskell's power in this domain, and apparently sum types are life-changing (they get mentioned in so many language conversations).
That said, I'm mostly concerned with types like columns, rows, tables, colors, urls, documents, or images. These are implementable in lots of languages, but many languages just have a lot of ad-hoc implementations of them, of the same concept but with slightly different tilts on it. I'm saying that it would be interesting if a language focused on that part of the ecosystem, so that things could connect together.
Haskell might have a very nice type system, but it does not seem like it has many universally-accepted standard types for everything to use. For example, as soon as it gets up to the complexity of a "text" type, there are lots of slightly-different flavors of text.
I think this is the same kind of misguided as "every app has its own ad-hoc implementation of login, it's the same concept with just a slightly different tilt on it, so the language should offer a login function". The essence of programming is properly representing similarities between similar things but also properly representing the sometimes quite subtle differences. So most mature languages, rather than offering us a bunch of premade functions, try to offer us some low-level building blocks and make it easy to compose them together to suit our purposes (and I think this is why giant class libraries have gone out of fashion - even though classes are sort of extensible, customising someone else's class turns out to be a lot harder than using someone else's function). And I think that's the part that's really missing in data-land (and I'd say even Haskell is barely any way along this road) - not a bunch of predefined datatypes but really easy ways of composing basic datatypes and building them up into more complex datatypes that meet our needs. Having lots of similar but subtly distinct datatypes isn't actually a problem if your language is good at dealing with it, just as having lots of similar but subtly distinct functions isn't actually a problem.
"of the same concept but with slightly different tilts on it"
The answer is, they aren't the same concept. In English, we may look at multiple implementations and say "hey, those are all an 'image'", but in reality, each of these is a distinct concept:
A representation of the image meant for manipulation by pixel, such as for a 2d image editor.
A representation of the internal structure of the image format, such as giving access to the PNG or JPG frames.
The raw bytes of the specific serialization, such as for shipping around a network.
Representations of the image in a display context, such as the display driver might use.
Representation of the image from the point of view of metadata, such as a file explorer may need.
Representation of an image from the point of view of editing, where it may be disassembled into frames or layers for later reconstruction after applying edits, such as Photoshop may use.
You can't create an "image" class/package/library that will fit all those needs, to say nothing of the ways those needs may interact with other things (i.e. I may need metadata for that file explorer that includes non-images as well, even there are image-specific elements to it like a thumbnail).
Moreover, the lack of connectivity between this sort of object isn't even my top 25 programming problems... I'm almost always several layers composed on top of these sorts of things anyhow with my own local data concerns. Even if I've got an "image" I've also got a ton of other things with it that nothing else can be expected to understand (like which user this is a avatar for or where the image came from or any number of other things).
It's sometimes annoying to deal with too many color types or something, but that's a local concern for an ecosystem, not a universal problem a general-purpose language should be spending valuable design budget trying to dictate. (DSLs can, if appropriate, precisely because they are domain-specific languages.)
Hey, you're the author! I just wanted to thank you for writing this. I saw it yesterday, actually, and it set me off on an adventure: you mentioned seaborn, so I went and learned that (it's beautiful), which mentioned xarray (http://xarray.pydata.org/en/stable/) which is the generalized solution to the problem you mention!
I'm a bit skeptical, but, xarray's demos are very impressive. They do all kinds of geospatial modeling on ocean salinity, ozone, etc, and the data model seems general enough to create beautiful plots automatically (for various reasons, but mostly because of what you were saying in your post: simplicity enforced by constraints).
I'd love your thoughts on xarray, since it was such a serendipitous discovery in relation to your post. (If you're not impressed with xarray, then that's probably a useful signal that perhaps it's a step in the wrong direction on the ladder of complexity.) But either way, thanks for reminding me that "the solution is elegance, not a battalion of special cases."
(Not the OP) I’ve been using xarray for the better part of a year now on some serious scientific projects. It certainly has some pain points and places to grow (it’s still a 0.x release after all), but when it works it’s truly excellent to use. So many things “just work”
Thanks for the review! I'll give it a serious try then.
Does your code happen to be available anywhere? I'd love to learn by example from someone who's used it for so long on real work. (Feel free to DM me on twitter: https://twitter.com/theshawwn if you'd like to share it privately.)
I think you have a point, but the situation you describe is largely because things are in flux - that is certainly the case for web pages.
When consensus is reached, things settle down - so most new languages give you associative arrays out of the box. On the other hand, there was a time when exceptions seemed to be the consensus on runtime error handling, but there has been some rethinking on that...
Then there is APL, which seems to exemplify what you are asking for. It is very well-regarded by those who get it, but it seems to have been just too difficult (or alien) for most of the engineers whose purpose it addresses.
Haskell may not be the best example to make a case from, as it is still, to a large extent, an experimental language. Spreadsheets are also something of an outlier, as they are a system of data manipulation already developed before it was practical to computerize them.
If transfer size isn't an (important) consideration, then the one true answer is: just use a record, which doesn't enforce an ordering and is explicit about which coordinate is which.
No. To satisfy the complaint, you would have to be able to create the rational-number type yourself in userspace, and moreover it would have to be built in such a way that you didn't need to worry much about cancellation.
Analogously, Rust or F# have type systems which allow you to build the Option type yourself easily: there is no magic behind them, and you can just write down whatever sum type you want. Similarly, we want to be able to build arbitrary quotient types (not just of numbers, but of general sets with equivalence relations) ourselves.
I don't think I have ever seen a mainstream language letting you design data types with a built-in normalization. The reason is that you normally have algebraic, i.e., sum and product, types. In such constructions you would expect construction and deconstruction to compose to the identity function. The closest that comes to mind are languages that marry algebraic data types with OO programming (scala, maybe OCaml, F#). There you can have constructing functions but the normalized result would still be expressed as a plain ADT. I also don't think that GADT's help here, but I might be wrong.
In any case, it is an interesting proposition. In order to be non-trivial the type system would have to be able to express the constraints of the normal form.
I think the most interesting design space is around not requiring normalisation. After all, if you have a normal form, you can probably represent it with sum and product types. Cubical Agda gives you this ability but it's really hard to use (as you might expect from one of the first attempts to create the functionality).
> The other thing that we do all the time in everyday life is a quotient, and as far as I know no programming language has a good way to represent that: something as basic as a fraction
How would the implementation of a general equivalence relation work? Just defining the equivalence relation does not yield a method for chosing a representative. Would you check the equivalence at each operation?
If the focus is on finite data structures only and the equivalence relation is "are the types isomorphic", then each type is isomorphic to the ordinary generating functor with some coefficients C : N->N:
> Just defining the equivalence relation does not yield a method for chosing a representative. Would you check the equivalence at each operation?
Choosing a canonical representative is not feasible in the general case, so yes one would need to prove that every operation respects the equivalence relation.
I don't think the addition of sum types broadly will be the panacea that people think. Software does need to be maintained, simpler things are easier to maintain.
What simpler alternative is there? The OOP way would be to always hide data. This certainly has a different set of trade-offs, and sometimes requires the visitor pattern. Maintainable and simple are not words I would use to describe these cases.
The other alternative is to emulate sum types through run-time type information or using code generation like protobuf. I also fail to see how these approaches are simpler or more maintainable.
Sum types are not a panacea for all software development ills, but they're the simplest and most direct solution to a concrete problem: My data is either this type or that type.
Having had to use a bunch of different spatial packages, each defining their own `Position` type, I would love it if there were a standard type in the standard library or a very simple minimal "type only" package that they could all depend on.
I could just pass between instead of having to covert constantly between the packages individual representations.
This connects to something I call "code bureaucracy" - in my experience working on legacy code, upwards of 50% of programming is being spent on shuffling data from one place to another, converting its type several times along the way, because different parts of the system - both internal and third party - each define their own version of same or similar types.
I wish in particular we could reduce the number of equivalent types across libraries, but I feel the solution will have to involve tagging types with WikiData identifiers or equivalent. So "Color" becomes defined as type Q1075, or perhaps Q1075<Q166194> to show you mean an RGB-denoted color...
100% agree, having different point representations is a pain.
I also don't like it when they're represented as objects (which is almost always the case). It seems so much more powerful, and expressive to me to just keep it as a tuple, so that you can work with matrices of points.
I think there is space for a programming language with first class support for typed tables. A blend of C# and SQL. F# with type providers does have some of that, or C# with linq, but in both cases, tables/relations are second-class citizen.
There is one way to represent things: tree notation. Or in other words: “space is all you need”. Everything can be encoded in a tree, from binary to this comment I’m righting now. There is a whole world of unexplored 2D and 3D languages that utilize space, and not 1D visible syntax delimiters to communicate meaning. It’s simple but a hard thing to communicate. However, seeing plenty of signs that it’s starting to catch on. Another way to think of it is that spreadsheets (1B users) and programming languages are about to meet and have a baby, and the result is going to be a leap across a chasm from the languages we have today.
Do you know of languages where parsing does not begin at position 0 and proceed linearly to position N?
In certain designed 2D/3D languages you can have N parsers that start and end in different places in any direction, potentially even in a random order depending upon the language.
I thought that as well, many types highly tied and interconnected in a well designed type coercion system, and i think red is awesome as long as it is limited to a prototype, but for serious development it will soon fall short. But I didn’t spend much time with it.
Take something like the design discipline Scott Wlaschin describes in this talk "Domain Modeling Made Functional"[1]. Combine it with a catalog of standard models like David Hay's "Data Model Patterns: A Metadata Map"[2] which is like a Pattern Language for data models combined with a catalog covering most (enterprise) needs. Specify the binary formats with Erlang "bit syntax"[3]. Eh?
I was thinking about something similar the other day when I was setting up a language server in my IDE for a non-mainstream programming language.
What if there was a language we could develop APIs / libraries with that would automatically work with any other language / runtime? I guess I'm describing C interop, but that's too low level. C stands on it's own. I have no idea how to solve this but there would have to be some standard interface that the C interopt is developed against to provide additional context / metadata to allow more native interoperability.
I don't know if I'm doing a good job explaining it. But, the idea is just like a language server. A server is implemented by a specific runtime and it can automatically interop with any client-library written in any other language, natively. This just sounds like wishful thinking.
There is something like that, a Unix shell. You can use the shell environment to string together small programs written in different languages.
I mostly work in nodejs, and instead of connecting my things through language level bindings, I just write simple wrappers to call a child process and handle stdout and stderr. It might not work for all situations, but it means I can build my bits and pieces out of whatever language I want.
I think the implicit desire of the parent is for the environment to handle data conversions for you.
Unix shell is an unstructured text environment. Data loses all semantic context the moment it's pushed to stdout, and thus every program that wants to interact with it needs to have its own, half-baked, buggy parser included. This is wasteful and a source of many maintenance and security problems.
In some lispmachine demo by .. I forgot his name (~~amir gupta?~~ kalman reti) he talks about that aspect. Unix serializes everything to strings, while lisp machines just pass pointers around. Obviously simpler and faster but maybe less easy to move across machines (I'm speculating). I've always been saddened by the neverending amount of grep/perl fu needed in unix to do the same thing over and over.
Similar feeling when using powershell, a lot of benefits of having a uniform set of interfaces for any kind of object. That said it feels a little straighjacketing compared to crunching strings.
Yeah. I love PowerShell conceptually, but at least in practice, it's cumbersome to exploit the object-based nature. Arguably this is only a UX problem. Say I want to kill all notepad.exe instances. I open PS, and do:
Get-Process -Name notepad
(Thank you autocomplete for reminding me about -Name parameter). Now how do I kill it? Lemme see what I'm getting:
Get-Process -Name notepad | Get-Member
Aha! It says the object has a `void Kill()` method. But how do I invoke it on each of those? If I do a lot of this, I'll probably remember[0]:
Awesome. With Tab-completion, not that cumbersome for common tasks, but it's annoying at discoverability. I have to do that Get-Member dance for each new command / type of object I'm working with, just to know what properties and methods are there for me to use. The UNIX equivalent would be something like[1]:
kill $(ps aux | grep '[n]otepad' | awk '{print $2}')
But the difference is, I can get arrive here by just knowing grep, awk, and seeing the output of ps. I don't have the extra step of having to inspect the particular type of objects being returned by a command. In UNIX, I have to write parsers for everything, but parser-writing tools like grep, sed and awk are generalizable across all problems, and with them, I can massage any output I see on screen into valid input for a command I want to use. Every complex command I do uses the same set of knowledge.
PowerShell could use exposing the internal objects more, to bridge this gap. That would require something more complex than linear terminals we're used to - something that would give user an IntelliSense or point-and-click quality. At the very least, it should start exposing API documentation (i.e. descriptions, not just function signatures) in Get-Member[2], but preferably it should have IntelliSense popups telling you about the type you can expect from a command (or that you just got), and what you can do with it. PowerShell ISE is 20% there. My dream would be something McCLIM-like - that I could point at any piece of human-readable output and jump straight to the property that it represents.
--
[0] - Yes, I know there's a simpler variant for this case, but I'm showing the general pattern for an arbitrary object collection.
[1] - Again, I know there's `killall'.
[2] - Hell, they should start shipping the documentation with default installation. As it turns out, Get-Help can't help you much until you let it download the documentation package on first use. Which was super-annoying when I had to do some PS work on a VM with no direct Internet access.
yes I think it boils down to having to know MS mental model of objects before being able to leverage it while unix and strings have no model, you're free to extract structure and data as you see fit (shooting your foot or not). Text is always kinda unbounded in possibilities but requires you to do a bit more work on your own.
No, but someone else does. Every language comes with its own collection[0] of half-baked, bug ridden libraries for parsing JSON and CSV. JSON and CSV are particularly poor data formats, in that JSON has an impedance mismatch if the language you're using isn't JavaScript[1], and CSV means whatever the tool generating it think it means - it doesn't have to be comma, separated, or just values.
Also, if you adopt either, you lose the benefit of human-readability, which is half of the justification for the UNIX tooling to emit plaintext in the first place. So one could as well bite the bullet and treat machine readability and human readability as separate concerns, and use formats suitable for each individually.
--
[0] - It's never just one library.
[1] - Do JSON objects get mapped to structs, arrays or hash tables in your language? How do you distinguish between true, false and null?
Perhaps none of this will be new to you, but here are my thoughts:
It's challenging because of the mismatches that can arise between programming languages. We tend to use C as the lingua franca for communication between languages (i.e. foreign function interfacing, FFI), but it doesn't make the problem trivial. C++ has dynamic dispatch, whereas C doesn't. Java has garbage-collection, whereas C doesn't. Haskell has lazy evaluation. C# has async. Lots of things that can't be mapped across directly to C functions.
FFI is by no means a neglected domain. Where it's possible to make it easy, it generally is easy. C++ written in a C style, can very easily interface with C. Same for Ada. The trouble is when there's mismatch from the more advanced features of C++/Ada that don't map that naturally to C.
I don't think a 'language server' would really bring anything new to the table. Where that approach makes sense, we already have REST.
> GraalVM is a universal virtual machine for running applications written in JavaScript, Python, Ruby, R, JVM-based languages like Java, Scala, Clojure, Kotlin, and LLVM-based languages such as C and C++.
and yes, it can do interop between those languages
I think there will always be a trade-off between "one way to do things" (opinionated approach) vs flexibility - the thing is, we need both, depending on the use case and the person using the language.
But I do think we are falling behind with opinionated approaches. I am guessing that is because experts are usually the most prominent in communities, they are directing the ecosystem, and they prefer flexibility and power vs opinionated approach.
Ruby on Rails was an opinionated approach made by non-expert that grew into something really big because of that.
I am also trying to contribute in that direction with highly opinionated solution for web app development (https://wasp-lang.dev).
I appreciate the authors argument. On the other hand, isn't the reason there are not more first class data types is that they are so difficult to design general purpose? Even simple data types vary so much from language to language. It seems reasonable that this is why generalizing more abstract data types don't seem to catch on universally very often.
It would be more convincing to see concrete examples, such as a type defined in a language such as Haskell that seems to be more or less universally applicable.
What about Pure OOP? It has one way to represent things, everything is an Object. Details of the data-structures can be hidden behind the method-interface.
In this system, an object is simply a pointer to some memory (where instance data can be stored), preceded by a pointer to a 'vtable'; i.e. `myObject` is a pointer to some memory, and `myObject[-1]` is a pointer to a vtable.
Vtables are themselves objects, implemented in the same way as above. There is a "vtable API" defining methods that a vtable should support, and a default 'vtable vtable' is defined during a bootstrap step (which is its own vtable).
The resulting object system has objects which are completely opaque: they're just arbitrary chunks of memory, which are only useful to the methods defined by their associated vtable. Since those vtables are also opaque objects, we can provide our own objects which implement the idea of 'method call' in whichever way we like.
Piumarta's COLAs work, and Maru, does not get enough attention. I only understand maybe understand 30% of the COLAs paper but one can intuit that there is real power there.
And everything has only one method in common: GetIdentifier().
So you've got components that cannot talk to each other, because the available methods differ (not just by name, but by what you can actually do), even though they are working on the same conceptual thing.
The article argues for more "base datatypes" that are in common use. It doesn't matter whether the language/runtime exposes them as objects or PODs.
The "base datatypes" have to be accessed somehow. If you have base-type Array you have to access its elements somehow. In C-type language you would access the elements by calculating a memory-offset and reading that location in memory. The problem is that now your code depends on elements of Array being stored in memory in specific (relative) locations.
Whereas in an Object Oriented language you would access the elements by calling a method myArray.at(123). The benefit is you don't need to know how that data is stored or where.
In OOP the elements of an Array of course can be Arrays too, whose elements you can similarly only access via a method-call.
Details could be usefully hidden, but also exposed or partially exposed in unexpected and undocumented ways.
OOP is only an attractive and helpful conceptualization of doing something with a data representation in a computer's memory, not a magical way to suppress differences. It doesn't help with incompatible representations, operations and expectations about them except by providing some tools to deal with problems.
On a more formal side, as already noted in other comments, if everything is an Object everything is nothing more than an Object: a one-type programming language is certainly coherent but no good if one wants to write meaningful programs.
"Everything is an object" is a very deep idea. Make a system where everything is an object and every object knows what it can do: it becomes dynamically self-describing.
The "formal side" assumes a static environment, but our OS and apps and browsers are constantly churning. When the runtime reality diverges from the compile-time model, you'll want dynamicism: can this object do that? What happens if I try to invoke an unsupported operation? etc.
In Smalltalk, if you really need to find this, there is the Finder in which you can provide examples of input and output and it will suggest messages that implement the desired functionality.
Also, the message does not need to be called "join," especially since it's use in the culture is ambiguous and confusing.
This doesn't help you to eliminate impossible values. A large aspect of datatype modeling is to encode constraints from the problem domain. Once encoded they are automatically enforced. Maybe the Builder pattern comes close to that ideal in OO, but it is really tedious to do right.
This has been my experience as well. It can be extraordinarily difficult to design a consistent OO type system to represent things like quantities with units and references, alongside quantities with just units and neither units nor references. For example, altitude is a quantity plus a unit of distance plus a reference (eg above ground level). Except some references require different data to convert. So AGL requires elevation for conversion to mean sea level, for example. Designing an OO system to represent that and constrain against invalid operations gets super complex and the developer ergonomics rapidly degrade in most languages. I’ve only ever successfully done it in C#, and even then I’ve found maintaining the unit conversion code annoying.
Using your example, in the limit you end up with a data type carrying a chain of conversions with it, that reference some database that you don't own, but have to maintain access to. That altitude's reference itself is a particular geographical coordinate system (of many possible GCSes) that references a particular datum ("on what planet are we, and how does it look", of many possible datums), both of which are maintained by third-party organizations and political in nature. And then when you have another altitude with a slightly different reference, you have to handle conversions between them.
At which point you may decide you just want to represent altitude with reference = "same as everywhere else in the program, because if it's the same, it doesn't matter what it is". Which is what most non-GIS code does, except it isn't documented explicitly, and it's up to the user or integrator to ensure the data fed to the program uses the same reference as the user expects to see.
You still have to stop somewhere and supply actual numeric data (which is the only real data there is).
I think it was in one Smalltalk book: so there are objects that send messages to each other, but how things are really done in Smalltalk? :) At some points there are primitive actions that actually do the work. Similarly with data we can define objects or any other higher-level concepts, but somewhere there must be actual data under all these abstractions.
It is in the final section of the Blue Book [1], where the virtual machine is described in Smalltalk itself. Basically you have tagged integer values and pointers and that's it. The VM describes a bytecode set that involves manipulation of memory structures that are the objects, along with very primitive arithmetic operations on low level integers.
The other thing that we do all the time in everyday life is a quotient, and as far as I know no programming language has a good way to represent that: something as basic as a fraction (i.e. a pair of integers, but two pairs are equivalent when ...) can't be expressed except in an ad-hoc way.
I appreciate the author's enthusiasm for offering a large standard library of datatypes - I certainly think things like decimals and trees ought to be more available than they are - but I don't think we can even start on that until we actually have good tools for defining datatypes. And actually the best way to get there might be to have fewer built-in datatypes, and force ourselves to go through the work of defining numbers and strings and so on "in userspace", using the normal language facilities. That would quickly highlight a lot of quite basic things that a lot of languages are missing - e.g. literals for user-defined types are incredibly painful in a lot of algol-family languages.