The elephant in the room with data is that we don’t need a lot of the fancy and ...

chasil · on Nov 16, 2022

The article mentions this workflow:

"Let’s now execute the script multiple times, one per set of parameters, and store the results in the experiments.db SQLite database... After finishing executing the experiments, we can initialize our database (experiments.db) and explore the results."

Be warned that issuing queries while DML is in process can result in SQLITE_BUSY, and the default behavior is to abort the transaction, resulting in lost data.

Setting WAL mode for greater concurrency between a writer and reader(s) can lead to corruption if the IPC structures are not visible:

"To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]."

If the database will not be entirely left alone during DML, then the busy handler must be addressed.

habibur · on Nov 16, 2022

None of these are a problem for the workload discussed.

When I am working with sqlite I am more likely accessing it from a single machine.

And in this case of ML, most likely from 1 process and by running multiple times in serial.

citizenpaul · on Nov 16, 2022

Unless your income is depending on carrying out the exact demands of some money guy that's most common phrase while using a computer is "it won't let me" and they want "big data".

Then you just suck it up and build one of the totally unnecessary big data systems that have been excreted all over the business world these days. I don't think the problem is that devs are over-engineering.

I wonder what its called, makes me think of tragedy of the commons but probably not quite right.

tomrod · on Nov 16, 2022

Hierarchy on bueracracies, by Jean Tirole. I know because this was the phenomenon I wanted to study in grad school only to find he scooped me (on this an several items) by several decades.

Edit: Tirole, Jean. "Hierarchies and bureaucracies: On the role of collusion in organizations." JL Econ. & Org. 2 (1986): 181.

melagonster · on Nov 17, 2022

if this research is so old, did world tried any thing to ameliorate this problem? I guess it doesn't happen yet...

tomrod · on Nov 17, 2022

36 years is not old in terms of research. 2,223 cites on Google Scholar and many in the past year. Seminal research often identifies the problem but not all solutions.

BeefWellington · on Nov 17, 2022

What gets me is how many companies paid through the nose to push their data into things like Hive and slowed down 99% of their queries to make one "run once a quarter" report run about 25% faster.

At least that was my experience a number of years back.

morelisp · on Nov 16, 2022

Maybe like 20 years ago you were right but today there's a generation that's been working for 10 years on systems built like that. They don't know any better, and in most cases nobody is around to teach them otherwise.

bob1029 · on Nov 16, 2022

> SQL against a relational database gets us extraordinarily far.

I think it gets us all the way once you consider the ability to expose domain-specific functions to SQL that are serviced by your application code.

I've always been of the mindset that you can do anything with SQL if you are clever enough.

isoprophlex · on Nov 16, 2022

Yeah and even if you do need to do proper big-dataset-ML... a SQL box and maybe something like a blob storage for large artifacts (S3, Azure storage account, whatever) is all you need as well. But if your boss bought The MLOps Experience, you gotta do what the cool kids are doing!